Minimized-and-Noised-Phase harmonic source Singing Voice Convertion [v2]
Language: English | 简体中文 | 한국어* | 日本語
(*: machine translation. PR is welcome to native translations!)
MNP-SVC is a open source singing voice conversion project dedicated to the development of free AI voice changer software that can be popularized on personal computers. These aims are inherited from an original repository. (DDSP-SVC)
Compared with an original, not to use the external vocoder and diffusion models, improved noise robustness thank by (DP)WavLM and change unvoiced pitch handling, and improved result (my feeling, subjectively). And there are many improvements (e.g. change losses, fast interpolation method, pretraining method for decrease original speaker feature leakage), implementations (e.g. easy intonation curve tweak).
This repo focus to improvement:
- learning multiple speakers at once into a single model
- reduce an original speaker's features and fit to target speaker's one
- still keep even small model size
- more natural and smooth output result
- and computational cost still keep not heavily
MNP refers: Minimized-and-Noised-Phase harmonic source.
After some experimentation, I changed harmonic source signal of synthesizer from linear-phase sinc to minimized-phase windowed sinc because I put the assumption that the unnatural and slightly not catchy sensations of the result may be due to the fact that the phase is linear. (And maybe, that thing made learning filters harder.)
This is appropriate that I think because all naturally occurring sounds, including human voices, are in minimum phase.
And improved acoustic model: The Noised-Phase Harmonic Source (named by me, I'm not a scholar.).
(For now, a Noised-Phase feature is unused.)
Different of model structure from DDSP-SVC is about:
- Use the ConvNeXt-V2-like convolution layers with GLU-ish structure
- Use speaker embedding (optionally you can disable it)
- Use conv layer after combining F0, phase and speaker embed
Disclaimer: Please make sure to only train DDSP-SVC models with legally obtained authorized data, and do not use these models and any audio they synthesize for illegal purposes. The author of this repository is not responsible for any infringement, fraud and other illegal acts caused by the use of these model checkpoints and audio.
Simply double-clicking dl-models.bat
and launch.bat
. This scripts doing:
- Download the pre-trained models if not exist (dl-models.bat)
- Download the MicroMamba
- Extract downloaded archive and create portable Python environment
- Install require packages to the portable Python environment
when first time execution.
For the next time, you can launch this script and use this console.
We recommend first installing the PyTorch from the official website. then run:
pip install -r requirements/main.txt
NOTE: I only test the code using python 3.11.9/3.12.2 (windows) + cuda 12.1 + torch 2.4.1, too new or too old dependencies may not work.
-
Feature Encoders:
- Download the pre-trained DPWavLM encoder and put it under
models/pretrained/dphubert
folder. - Download the pre-trained wespeaker-voxceleb-resnet34-LM (pyannote.audio ported) speaker embed extractor (both pytorch_model.bin and config.yaml) and puts it under
models/pretrained/pyannote.audio/wespeaker-voxceleb-resnet34-LM
folder.- or open configs (configs/combsub-mnp.yaml or you wanna use), and change
data.spk_embed_encoder_ckpt
value topyannote/wespeaker-voxceleb-resnet34-LM
. this allows download from huggingface model hub's one automatically.
- or open configs (configs/combsub-mnp.yaml or you wanna use), and change
- Download the pre-trained DPWavLM encoder and put it under
-
Pitch extractor:
- Download the pre-trained RMVPE extractor and unzip it into
models/pretrained/
folder.
- Download the pre-trained RMVPE extractor and unzip it into
-
MNP-SVC pre-trained model:
Download the pre-trained weights and put them under
models/pretrained/mnp-svc/
folder.- pre-trained only few conv layers model is also available. This model was not trained the voice characters, speaker distributions etc.
Put all the dataset (audio clips) in the below directory: dataset/audio
.
NOTE: Multi-speaker training is supported. If you want to train a multi-speaker model, audio folders need to be named with positive integers to represent speaker ids and friendly name separated with a underscore "_", the directory structure is like below:
# the 1st speaker
dataset/audio/1_first-speaker/aaa.wav
dataset/audio/1_first-speaker/bbb.wav
...
# the 2nd speaker
dataset/audio/2_second-speaker/aaa.wav
dataset/audio/2_second-speaker/bbb.wav
...
The directory structure of the single speaker model is also supported, which is like below:
# single speaker dataset
dataset/audio/aaa.wav
dataset/audio/bbb.wav
...
then run
python sortup.py -c configs/combsub-mnp-san.yaml
to divide your datasets to "train" and "test" automatically. If you wanna adjust some parameters, run python sortup.py -h
to help you.
After that, then run
python preprocess.py -c configs/combsub-mnp-san.yaml
python train.py -c configs/combsub-mnp-san.yaml
You can safely interrupt training, then running the same command line will resume training.
You can also finetune the model if you interrupt training first, then re-preprocess the new dataset or change the training parameters (batchsize, lr etc.) and then run the same command line.
# check the training status using tensorboard
tensorboard --logdir=exp
Test audio samples will be visible in TensorBoard after the first validation.
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange> -into <intonation curve> -id <speaker_id>
keychange: semitones
intonation curve: 1.0 means follow original pitch (default), more small to flat (calm), more large to dynamic (excite)
Other options about the f0 extractor and response threshold,see:
python main.py -h
Start a simple GUI with the following command:
python gui.py
The front-end uses technologies such as sliding window, cross-fading, SOLA-based splicing and contextual semantic reference, which can achieve sound quality close to non-real-time synthesis with low latency and resource occupation.
Execute following command:
python -m tools.export_onnx -i <model_num.pt>
Export to the same directory as the input file with named like model_num.onnx
.
Other options can be found in python -m tools.export_onnx -h
.
The exported onnx files can be used in the same way for real-time and non-real-time VC. For now, only CPU inference is supported.
- Export to ONNX
- Make WebUI
-
- Many codes and ideas based on, thanks a lot!