Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed difference for longer input text #296

Open
Ananya21162 opened this issue Dec 9, 2024 · 3 comments
Open

Speed difference for longer input text #296

Ananya21162 opened this issue Dec 9, 2024 · 3 comments

Comments

@Ananya21162
Copy link

We are noticing very slow speed for small sentences and for longer sentences, the model starts normally and then gradually increases the speed to quite noticeably high, which sounds un-natural often.
What could be the possible cause for this? Can anyone please help!

@UmerrAhsan
Copy link

Latency generally increases as the length of the input sentence grows. However, a slowdown for short sentences is not typical and might indicate an issue. I've worked with StyleTTS2 and successfully reduced its latency by 2.5-3 times. If you can share your model file, I can investigate further to pinpoint the issue.

One possible reason for unnatural output is that StyleTTS2 is trained on audiobook datasets, where the style is tailored toward narration. This makes it perform well for longer sentences but struggle with shorter text, leading to degraded quality. Additionally, the model is trained with a high maximum sequence length, which could also explain the inconsistency when dealing with shorter inputs.

@Ananya21162
Copy link
Author

Thank you so much for your response.
I have trained model with libriTTS + 50 hrs of audio with max seq length=512.
For very short input like : "Slide 1", the output is very slow.
For very long inputs like: "The Supplier Accounts Receivable Specialist ensures the accurate submission of supplier invoices by verifying all required details, such as purchase order references and amounts, before uploading them into the system." The output is relatively fast.
I am not sure what could be the possible reason? Is there something we can do while training the model?

@UmerrAhsan
Copy link

Hi @Ananya21162,

Without seeing the code, I can't say much, but what I would suggest is to perform an inner ablation study. Print the time taken for each component during inference—such as the text encoder, BERT, alignment, prosody predictor, decoder, diffusion, and other relevant components. This way, you can identify which specific component is causing the issue, and that will help pinpoint the problem. Then let me know, and we can further debug it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants