This is an implementation of pyannote.audio in a cog wrapper to easily run speaker diarization via replicate to save you the trouble of dependency hell π.
The model takes an input audio file in the audio
parameter. Then, it runs speaker diarization and returns a list of audio files containing the individual speaker turns within the audio file split by speaker and index. The output URLs contain encoded information in the file name to make working with the outputs easier. The format for the file name is {index}_{speaker}_{duration}
which resolves to 0_SPEAKER_01_16
. Duration is in seconds. Index refers to the order of speaker turns.
- SSH into a Linux environment with a GPU
- Install Cog (using replicate/codespaces if you're using GitHub Codespaces)
- Create a HuggingFace token and add it to predict.py as
HUGGINGFACE_TOKEN
(TOOD: Move it out of predict.py somehow.. maybe into a script that caches the weights) - Accept license aggrements for these two models on HuggingFace:
- Run
cog predict -i [email protected]
Then:
- Create a new model at replicate.com/create
- Run
cog push r8.im/your-username/your-model-name