You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I realise Thorsten's tutorial is recommended as the best place to start looking at cloning. I've watched his video, and it's good but I would like to dig deeper into how training VITS actually works.
Does anyone know of a more thorough tutorial for this? In particular:
How long voice samples can be,
How aligned they have to be with the text,
The sort of punctuation that's supported,
The computer specs you need to train efficiently,
The amount of audio you need to get a good clone,
What checkpoints to use for what voice,
More direct information on the dataset and cloning procedure.
For anyone interested, this was a useful video I found on voice cloning for StyleTTS:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I realise Thorsten's tutorial is recommended as the best place to start looking at cloning. I've watched his video, and it's good but I would like to dig deeper into how training VITS actually works.
Does anyone know of a more thorough tutorial for this? In particular:
For anyone interested, this was a useful video I found on voice cloning for StyleTTS:
https://www.youtube.com/watch?v=5-Dk3ooxn2Q
If this doesn't exist, I will figure this out myself and post an article + video tutorial in the next couple of months.
Beta Was this translation helpful? Give feedback.
All reactions