Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training vs tensorboard metrics #211

Open
smlkdev opened this issue Nov 8, 2024 · 26 comments
Open

Training vs tensorboard metrics #211

smlkdev opened this issue Nov 8, 2024 · 26 comments

Comments

@smlkdev
Copy link

smlkdev commented Nov 8, 2024

Will my training yield better results over time? Currently, the training took about 9 hours.
I have 1500 wav samples, with a total audio length of approximately 2 hours.

Screenshot 2024-11-08 at 11 53 27

What other metrics should I pay attention to in TensorBoard?

@smlkdev
Copy link
Author

smlkdev commented Nov 9, 2024

Update after ~34h:
Little improvement visible but note sure should I keep it longer because of the flattening.

Screenshot 2024-11-09 at 10 41 25
Screenshot 2024-11-09 at 10 41 41

@jeremy110
Copy link

We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.

@smlkdev
Copy link
Author

smlkdev commented Nov 10, 2024

We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.

@jeremy110 Thank you for your response! I’m honestly a bit hooked on watching the progress as it keeps going down, so I can’t seem to stop checking in :-)

Currently at 68 hours.

Screenshot 2024-11-10 at 22 05 29

I’m planning to create an 8-10 hour audio dataset for the next training session. Could you suggest what kind of text data I should gather for it? So far, I’ve used random articles and some ChatGPT-generated data, but I’ve heard that people sometimes read books, for example. Is there perhaps a dataset available with quality English sentences that covers a variety of language phenomena? I tried to find it but no results.

@jeremy110
Copy link

@smlkdev
Basically, this training can be kept short since it’s just a fine-tuning session; no need to make it too long. Here’s my previous tensorboard log for your reference(#120 (comment)).

I haven’t specifically researched text types. My own dataset was professionally recorded, with sentences that resemble reading books. I’m not very familiar with English datasets—are you planning to train in English?

@smlkdev
Copy link
Author

smlkdev commented Nov 11, 2024

This is my first attempt with ML/training/voice cloning and decided to use english. I read briefly Thai thread and it was way too complex for me to start with.

Your training was 32 hours long and for me (I'm not the expert) infer voice matched original :) That's really nice. Is it a voice that had 8-10 hours of audio as you mentioned earlier?

@jeremy110
Copy link

Yes, that's correct. I tried both single-speaker and multi-speaker models, and the total duration is around 8-10 hours.

If this is your first time getting into it, I recommend you try F5-TTS. There are a lot of people in the forums who have trained their own models, and some even wrote a Gradio interface, which is very convenient.

@smlkdev
Copy link
Author

smlkdev commented Nov 12, 2024

@jeremy110 thank your for your responses.

Is F5-TTS better than MeloTTS in terms of quality?

I just realized that my cloned MeloTTS voice doesn’t add breaks between sentences. I have to add them manually—by splitting the text into sentences, breaking it down into smaller parts, generating and then merging it back together after adding pauses. It can be made automatically of course but still a bit of work. (I was focusing on single sentences before and I liked the quality)

@jeremy110
Copy link

jeremy110 commented Nov 13, 2024

In terms of quality, I think F5-TTS is quite good. You can try it out on the Huggingface demo.

The pauses within sentences mainly depend on your commas (","). The program adds a space after punctuation to create a pause. However, if the audio files you trained on have very little silence before and after the speech, the generated audio will also have little silence. Of course, you can add the pauses manually, but you could also address it by adjusting the training data.

@kadirnar
Copy link

@smlkdev I am training the melotts model with sentiment data. But I couldn't get the tensorboard graphs to work. Can you share your sample code?

@kadirnar
Copy link

These are written in the train.log file. Training is still ongoing. Are these important?

2024-11-19 09:22:10,339	example	ERROR	enc_p.language_emb.weight is not in the checkpoint
2024-11-19 09:22:10,340	example	ERROR	emb_g.weight is not in the checkpoint

@smlkdev
Copy link
Author

smlkdev commented Nov 19, 2024

@smlkdev I am training the melotts model with sentiment data. But I couldn't get the tensorboard graphs to work. Can you share your sample code?

I used the simplest cmd possible:

tensorboard --logdir PATH where PATH is a logs folder inside ...MeloTTS/melo/logs/checkpoint_name (pointing to folder with checkpoints)

@manhcuong17072002
Copy link

@jeremy110 Hello, I would like to inquire about the data preparation process when training on multiple speakers. Is it necessary for each speaker to have a comparable amount of data? For instance, if Speaker A has 10 hours of audio and Speaker B only has 1 hour, is it possible to create a good model, or does Speaker B also require approximately 10 hours of audio? Thank you

@jeremy110
Copy link

@manhcuong17072002 Hello~
In my training, some speakers had 1 or 2 hours of audio, while others had 30 minutes, and in the end, there were about 10 hours of total data. I was able to train a decent model, but for speakers with less data, their pronunciation wasn't as accurate.

@manhcuong17072002
Copy link

@jeremy110 Oh, if that's the case, that's wonderful. Collecting data and training the model will become much easier with your idea. So, when training, you must have used many speaker IDs, right? And do you find their quality sufficient for deployment in a real-world environment? I'm really glad to hear your helpful feedback. Thank you very much!

@jeremy110
Copy link

@manhcuong17072002

Yes, there are about 15 speakers. Of course, if you have enough people, you can continue to increase the number. After 10 hours, the voice quality is quite close, but if you want better prosody, you might need more speakers and hours.

From the TTS systems I've heard, the voice quality is about above average, but when it comes to deployment, you need to consider inference time. For this, MeloTTS is quite fast.

@manhcuong17072002
Copy link

@jeremy110 Thank you for the incredibly helpful information. Let me summarize a few points:

  • Training data: 8 to 10 hours of audio is sufficient to train the MeloTTS model, and more data is always welcome.
  • Number of speakers: 15. 30 minutes to 2 hours of data per speaker yields good results. More data per speaker leads to better results.
  • Deployment speed: MeloTTS is relatively fast to deploy.

However, I've experimented with various TTS models and noticed that if the text isn't broken down into smaller chunks, the generated speech quality degrades towards the end of longer passages. Have you tested this with MeloTTS? If so, could you share your experimental process? Thank you so much.

@jeremy110
Copy link

@manhcuong17072002
You're welcome, your conclusion is correct.

Normally, during training, long audio files are avoided to prevent GPU OOM (Out of Memory) issues. Therefore, during inference, punctuation marks are typically used to segment the text, ensuring that each sentence is closer to the length used during training for better performance. MeloTTS performs this segmentation based on punctuation during inference, and then concatenates the individual audio files after synthesis.

@manhcuong17072002
Copy link

@jeremy110 I'm so sorry but I suddenly have a question about training on a multiple speakers dataset. Is it possible for Speaker A to pronounce words that exist in other speakers but not in A? Because if not, dividing the dataset into multiple speakers would be pointless and the model would not be able to cover the entire vocabulary of a language. Have you tried this before and what are your thoughts on this? Thank you.

@jeremy110
Copy link

@manhcuong17072002
If we consider 30 minutes of audio, assuming each word takes about 0.3 seconds, there would be around 5000–6000 words. These words would then be converted into phoneme format, meaning they would be broken down into their phonetic components for training. With 6000 words, the model would learn most of the phonemes. However, when a new word is encountered, it will be broken down into the phonemes it has already learned. I haven't done rigorous testing, but in my case, the model is able to produce similar sounds.

@manhcuong17072002
Copy link

@jeremy110 Thanks for your useful information

@smlkdev
Copy link
Author

smlkdev commented Nov 22, 2024

@jeremy110 Even if latest messages were not addressed directly to me - I want to thank you as well, you are giving me a fresh point of view how to look at dataset.

  1. Using your suggestion, I'm currently testing how F5-TTS works and what effects it produces. Here are my charts (same dataset as in this thread). Should I stop fine-tuning now that the LR has dropped so low or can it still improve nicely?

Screenshot 2024-11-22 at 18 36 31

  1. And how should I choose the right/best checkpoint? Is it a matter of listening to the dialogue, or should I rely on the loss from the chart, i.e., lower = better? Of course it is hard to get very exact checkpoint when I'm doing save every 2.5k steps.

  2. If I wanted to create a short dataset from scratch, how many times should each word appear in the audio recordings to make it meaningful? Would just once be enough? I imagine that if I were creating a dataset to perform exceptionally well in a specific field, like "cooking," I could build one using words more frequently used in that domain, such as "flour," "knife," or "tomato." I’m also guessing that the sentences in the dataset should include the most commonly used words in English along with niche-specific words, so my model performs as well as possible in that area (though it might struggle in something like "aviation" niche).

@jeremy110
Copy link

jeremy110 commented Nov 23, 2024

@smlkdev

  1. Basically, fine-tuning usually requires less than 10 epochs; the settings from others are used as a reference.
    image
  2. Typically, a model trained for 10 epochs is already good for use, and the last one is usually the best.
  3. If you're training a new language, a small dataset is not ideal; it still needs to be of a certain length to achieve better learning.

Since I'm not sure if you're training a new language, if you are, I tested small datasets and found that around 20–40 hours of audio is necessary, with each clip lasting 2–10 seconds. It can be either multiple speakers or a single speaker, but zero-shot performance is poor with a single speaker. Additionally, I tested with 350 hours of audio and about 300 speakers, and the zero-shot performance was excellent. Using my own voice as a reference, the synthesized voice was very close to my own. Finally, for F5-TTS to generate good audio, the reference is crucial. In my tests, 6–10 second clips worked best. So, for your third question, you can include these specific words in your reference, and the inference quality for this domain will improve.

@kadirnar
Copy link

kadirnar commented Nov 24, 2024

@jeremy110

https://huggingface.co/datasets/reach-vb/jenny_tts_dataset I am training this dataset. I made 10 epochs from the config settings. There is epoch 30 in the train.log file and it is still training. What should I fix?

2024-11-24 06:22:09,477	example	INFO	Train Epoch: 31 [53%]
2024-11-24 06:22:09,477	example	INFO	[2.2495381832122803, 3.04087495803833, 9.244926452636719, 18.031190872192383, 1.9427911043167114, 2.0941860675811768, 100200, 0.0002988770366855993]
2024-11-24 06:23:10,591	example	INFO	Train Epoch: 31 [59%]
2024-11-24 06:23:10,592	example	INFO	[2.145620107650757, 3.066821336746216, 9.333406448364258, 19.36675453186035, 2.052659511566162, 2.4836974143981934, 100400, 0.0002988770366855993]
2024-11-24 06:24:11,103	example	INFO	Train Epoch: 31 [65%]
2024-11-24 06:24:11,104	example	INFO	[2.5389487743377686, 2.33595871925354, 7.182312488555908, 19.055206298828125, 1.9395025968551636, 1.7028437852859497, 100600, 0.0002988770366855993]

How do you think the graphs are? Can you interpret them?

image

@jeremy110
Copy link

@kadirnar hello~
Typically, the parameters would be read from the YAML file, but it's okay. You can stop it at the appropriate time. I’ve kind of forgotten how many steps I originally set for training, but you can refer to my TensorBoard.

I’m not very experienced in the TTS field, but for MeloTTS, it mainly uses loss functions common in GANs, involving the Discriminator and Generator. Personally, I check the loss/g/total and loss/g/mel to assess whether the training results are as expected.

From your graphs, since there is no loss/g/total, I cannot judge the result. From my own training, the values typically range from 45 to 60, depending on your dataset.

@manhcuong17072002
Copy link

@jeremy110 What do you think if I have a small dataset and I combine the audios in the dataset to create a larger one? For example, initially we have 2 audios A and B. We combine the 2 audios as follows: Audio A + Audio B = Audio C, from which we get 3 audios A, B, C. Do you think this will significantly affect the training results compared to the original small dataset? Thank you.

@jeremy110
Copy link

jeremy110 commented Nov 25, 2024

@manhcuong17072002

This approach can indeed enhance the data and may provide a slight improvement, but several points need to be considered. Since MeloTTS uses BERT to extract feature vectors, if we randomly concatenate the text of two audio files and then extract feature vectors, can it still effectively represent the prosody of the text?
You could try it out and see how it performs.

Additionally, you can refer to the Emilia-Dataset (https://huggingface.co/datasets/amphion/Emilia-Dataset) used by F5-TTS. It has a process for generating the final training data, which you might consider using as a method to collect data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants