-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training vs tensorboard metrics #211
Comments
We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training. |
@jeremy110 Thank you for your response! I’m honestly a bit hooked on watching the progress as it keeps going down, so I can’t seem to stop checking in :-) Currently at 68 hours. I’m planning to create an 8-10 hour audio dataset for the next training session. Could you suggest what kind of text data I should gather for it? So far, I’ve used random articles and some ChatGPT-generated data, but I’ve heard that people sometimes read books, for example. Is there perhaps a dataset available with quality English sentences that covers a variety of language phenomena? I tried to find it but no results. |
@smlkdev I haven’t specifically researched text types. My own dataset was professionally recorded, with sentences that resemble reading books. I’m not very familiar with English datasets—are you planning to train in English? |
This is my first attempt with ML/training/voice cloning and decided to use english. I read briefly Thai thread and it was way too complex for me to start with. Your training was 32 hours long and for me (I'm not the expert) infer voice matched original :) That's really nice. Is it a voice that had 8-10 hours of audio as you mentioned earlier? |
Yes, that's correct. I tried both single-speaker and multi-speaker models, and the total duration is around 8-10 hours. If this is your first time getting into it, I recommend you try F5-TTS. There are a lot of people in the forums who have trained their own models, and some even wrote a Gradio interface, which is very convenient. |
@jeremy110 thank your for your responses. Is F5-TTS better than MeloTTS in terms of quality? I just realized that my cloned MeloTTS voice doesn’t add breaks between sentences. I have to add them manually—by splitting the text into sentences, breaking it down into smaller parts, generating and then merging it back together after adding pauses. It can be made automatically of course but still a bit of work. (I was focusing on single sentences before and I liked the quality) |
In terms of quality, I think F5-TTS is quite good. You can try it out on the Huggingface demo. The pauses within sentences mainly depend on your commas (","). The program adds a space after punctuation to create a pause. However, if the audio files you trained on have very little silence before and after the speech, the generated audio will also have little silence. Of course, you can add the pauses manually, but you could also address it by adjusting the training data. |
@smlkdev I am training the melotts model with sentiment data. But I couldn't get the tensorboard graphs to work. Can you share your sample code? |
These are written in the train.log file. Training is still ongoing. Are these important? 2024-11-19 09:22:10,339 example ERROR enc_p.language_emb.weight is not in the checkpoint
2024-11-19 09:22:10,340 example ERROR emb_g.weight is not in the checkpoint |
I used the simplest cmd possible:
|
@jeremy110 Hello, I would like to inquire about the data preparation process when training on multiple speakers. Is it necessary for each speaker to have a comparable amount of data? For instance, if Speaker A has 10 hours of audio and Speaker B only has 1 hour, is it possible to create a good model, or does Speaker B also require approximately 10 hours of audio? Thank you |
@manhcuong17072002 Hello~ |
@jeremy110 Oh, if that's the case, that's wonderful. Collecting data and training the model will become much easier with your idea. So, when training, you must have used many speaker IDs, right? And do you find their quality sufficient for deployment in a real-world environment? I'm really glad to hear your helpful feedback. Thank you very much! |
Yes, there are about 15 speakers. Of course, if you have enough people, you can continue to increase the number. After 10 hours, the voice quality is quite close, but if you want better prosody, you might need more speakers and hours. From the TTS systems I've heard, the voice quality is about above average, but when it comes to deployment, you need to consider inference time. For this, MeloTTS is quite fast. |
@jeremy110 Thank you for the incredibly helpful information. Let me summarize a few points:
However, I've experimented with various TTS models and noticed that if the text isn't broken down into smaller chunks, the generated speech quality degrades towards the end of longer passages. Have you tested this with MeloTTS? If so, could you share your experimental process? Thank you so much. |
@manhcuong17072002 Normally, during training, long audio files are avoided to prevent GPU OOM (Out of Memory) issues. Therefore, during inference, punctuation marks are typically used to segment the text, ensuring that each sentence is closer to the length used during training for better performance. MeloTTS performs this segmentation based on punctuation during inference, and then concatenates the individual audio files after synthesis. |
@jeremy110 I'm so sorry but I suddenly have a question about training on a multiple speakers dataset. Is it possible for Speaker A to pronounce words that exist in other speakers but not in A? Because if not, dividing the dataset into multiple speakers would be pointless and the model would not be able to cover the entire vocabulary of a language. Have you tried this before and what are your thoughts on this? Thank you. |
@manhcuong17072002 |
@jeremy110 Thanks for your useful information |
@jeremy110 Even if latest messages were not addressed directly to me - I want to thank you as well, you are giving me a fresh point of view how to look at dataset.
|
https://huggingface.co/datasets/reach-vb/jenny_tts_dataset I am training this dataset. I made 10 epochs from the config settings. There is epoch 30 in the train.log file and it is still training. What should I fix? 2024-11-24 06:22:09,477 example INFO Train Epoch: 31 [53%]
2024-11-24 06:22:09,477 example INFO [2.2495381832122803, 3.04087495803833, 9.244926452636719, 18.031190872192383, 1.9427911043167114, 2.0941860675811768, 100200, 0.0002988770366855993]
2024-11-24 06:23:10,591 example INFO Train Epoch: 31 [59%]
2024-11-24 06:23:10,592 example INFO [2.145620107650757, 3.066821336746216, 9.333406448364258, 19.36675453186035, 2.052659511566162, 2.4836974143981934, 100400, 0.0002988770366855993]
2024-11-24 06:24:11,103 example INFO Train Epoch: 31 [65%]
2024-11-24 06:24:11,104 example INFO [2.5389487743377686, 2.33595871925354, 7.182312488555908, 19.055206298828125, 1.9395025968551636, 1.7028437852859497, 100600, 0.0002988770366855993] How do you think the graphs are? Can you interpret them? |
@kadirnar hello~ I’m not very experienced in the TTS field, but for MeloTTS, it mainly uses loss functions common in GANs, involving the Discriminator and Generator. Personally, I check the loss/g/total and loss/g/mel to assess whether the training results are as expected. From your graphs, since there is no loss/g/total, I cannot judge the result. From my own training, the values typically range from 45 to 60, depending on your dataset. |
@jeremy110 What do you think if I have a small dataset and I combine the audios in the dataset to create a larger one? For example, initially we have 2 audios A and B. We combine the 2 audios as follows: Audio A + Audio B = Audio C, from which we get 3 audios A, B, C. Do you think this will significantly affect the training results compared to the original small dataset? Thank you. |
This approach can indeed enhance the data and may provide a slight improvement, but several points need to be considered. Since MeloTTS uses BERT to extract feature vectors, if we randomly concatenate the text of two audio files and then extract feature vectors, can it still effectively represent the prosody of the text? Additionally, you can refer to the Emilia-Dataset (https://huggingface.co/datasets/amphion/Emilia-Dataset) used by F5-TTS. It has a process for generating the final training data, which you might consider using as a method to collect data. |
Will my training yield better results over time? Currently, the training took about 9 hours.
I have 1500 wav samples, with a total audio length of approximately 2 hours.
What other metrics should I pay attention to in TensorBoard?
The text was updated successfully, but these errors were encountered: