A speaker identification and diarization solution based on PyTorch and the VoxCeleb v2 example from Kaldi.
This work is a speaker identification system based on the Kaldi VoxCeleb v2 example. It enhances it by replacing the nnet3 based neural network with one implemented using the PyTorch machine learning framework. This allows an easier and more dynamic change of the network architecture.
In addition to speaker identification with VoxCeleb this project also adds the ability to run diarization tasks.
Make sure the requirements listed in What you need are given. The follow the steps described in How to Install.
Before you can run this make sure you have the required tools available. You need:
- An Nvidia CUDA supporting graphic cart with more then 2 GB ram
- A current linux distribution on an x86 computer
- A fully operational installation of the kaldi framework
- PyTorch with CUDA support
- A copy of the VoxCeleb v1 and VoxCeleb v2 dataset
- A copy of the MUSAN dataset
- sox and ffmpeg for audio handling
Follow these steps in order to be able to run this project. If something does not work or you don't understand something please open up an issue and ask I'll be happy to help:
- Make sure Kaldi and CUDA are installed and work correctly.
- Download this repo:
git clone https://github.com/theScrabi/kaldi_voxceleb_pytorch
- Enter the root directory of the project:
cd kaldi_voxceleb_pytorch
- Create a new Python virtual environment:
virtualenv venv
- Activate the virtual environment:
source venv/bin/activate
- Install the required Python packages:
pip install -r requirements.txt
- Edit the file
sid/path.sh
and set theKALDI
variable to the path of your kaldi installation. (e.g.:KALDI=/opt/kaldi
) - If you want to use diarization you need to edit
diarization/path.sh
and also set theKALDI
variable there - Enter the diarization directory and run
./install.sh
. This will set the required symlinks.
You can use the run.sh
scripts in the sid
folder for speaker identification or in the diarization
folder for running training and testing.
For speaker identification please read the README.md inside the sid
folder. For diarization read the README.md in the diarization
folder.
The purpose of this work was to see if Angular Softmax with Cosine distance comparison can enhance end to end speaker identification and diarization. The goal was to find out if this could eventually outperform and replace the additional use of PLDA. Additionally it was checked if the use of an Attention Layer can also enhance speaker identification and diarization.
This was part of my Bachelor Thesis.
- Sphereface: The original implementation of the Angular margin based softmax implementation for face recognition.
- Speech Brain An all in one PyTorch speech recognition framework.
- pyannote.metric: A framework for diarization evaluation and error analysis.
- kaldi with tensorflow dnn: A Tensorflow implementation of x-vector topology on top of kaldi.