-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference audio generated at higher speed than training files #79
Comments
Actually, I think I know where the issues comes from. In the training data, you require note durations and phoneme durations, but during inference, you only require note durations. How does the system know where one note ends and the next note starts? For example, if you have: Phoneme 1|Phoneme 2|Phoneme 3|Phoneme 4 It's clear that all four phonemes are sung over the note C, but it's not clear whether we're talking about one C with duration 1 (for a total duration of 1) or two Cs with duration 1 (for a total duration of 2) or four Cs with duration 1 (for a total duration of 4). I think this explains why my playback speeds during inference are off. Am I missing something? |
Great previous question. |
Are you using this for English or Chinese? I'm making some progress here (in English), but haven't solved it yet. Part of it, I think, is the fact that the model's been developed for Chinese, which I think has simpler syllable formation rules. But that doesn't explain everything. I've made a few changes to the data and will do a 500K training run tomorrow just to make sure it's not just an undertraining problem. Interestingly, on seen data, it also struggles with durations initially, but manages to learn that. It's on the unseen data that it's wildly off. Worse comes to worst, I might decide to infer in small chunks and time-stretch the results. But I'm still hoping to be able to tweak the model to come out with the right data. Let me know where you get to. |
Did you get a resolution for this? I have the same problem. |
Finally, after a lot of labor, I got a decent English singer out of the model, which is great. But the audio generated during inference consistently plays back about 1.3 times faster than the training data fed in. The pitch and the phonemes are correct, but everything's sped up. Any idea why that would be the case? Thank you.
The text was updated successfully, but these errors were encountered: