Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference latency #288

Open
Ananya21162 opened this issue Oct 10, 2024 · 7 comments
Open

Inference latency #288

Ananya21162 opened this issue Oct 10, 2024 · 7 comments

Comments

@Ananya21162
Copy link

Ananya21162 commented Oct 10, 2024

I was trying out the model with 439 characters and saw 5-6 sec of average latency on libri-TTS dataset. Is there a way we can reduce the latency (decoder takes the most time).
Also, I finetuned the model with a few samples from a new speaker and saw the latency increased by 600-700 ms further, is this expected?
Is the latency expected to increase if the dataset is larger (english only)?
Similarly if we add more languages, is the model inference latency going to increase?

@Respaired
Copy link

Respaired commented Oct 12, 2024

HifiGAN is essentially larger and heavier.

you need to either find another ckpt pretrained on ISTFT or train a new model yourself from scratch. you can also fine tune on top of the LJ ckpt which is not recommended but one of my friends managed to get reasonable results by doing so.

as for your other questions, no the dataset have no impact on the latency. only the parameters of your model and mainly the size of the decoder matters the most.

@Ananya21162
Copy link
Author

Thanks for your reply. We have two models; one trained on libriTTS-R (360+100 hrs) data and the other finetuned on this model with 20 min audio samples for multiple speakers. We kept max_len 100 for the first and 400 for the second one. The first model and the second one have an average latency difference of nearly 1.5 sec.
Is it because of this parameter? What should be the ideal value?

@Respaired
Copy link

You're welcome.
as i've said, your choice of max_len or the dataset shouldn't matter.
only the decoder has the largest impact.

@Ananya21162
Copy link
Author

Ananya21162 commented Oct 21, 2024

Understood. But in our experiment, we checked the size of decoder for both models mentioned above. It was same for both , 217 MBs. But still both models have a latency difference of 1.5 seconds. Do you know of any other possible cause?
In fact, we compared all the model components and they are consistent for all

bert size: 201359360 / bit | 25.17 / MB
bert_encoder size: 12599296 / bit | 1.57 / MB
predictor size: 518227584 / bit | 64.78 / MB
decoder size: 1737263744 / bit | 217.16 / MB
text_encoder size: 179404800 / bit | 22.43 / MB
predictor_encoder size: 444186016 / bit | 55.52 / MB
style_encoder size: 444186016 / bit | 55.52 / MB
diffusion size: 1620926464 / bit | 202.62 / MB
text_aligner size: 251790464 / bit | 31.47 / MB
pitch_extractor size: 168037024 / bit | 21.00 / MB
mpd size: 1315384640 / bit | 164.42 / MB
msd size: 8988864 / bit | 1.12 / MB
wd size: 37556288 / bit | 4.69 / MB
Total Model size: 6939910560 / bit | 867.49 / MB

@Ananya21162
Copy link
Author

Also, one model is trained from scratch and the other one is fine-tuned. Will that make any difference? Num of Model params & model size is same :/

@Respaired
Copy link

Unless you change the decoder, or use very short samples with LFInference, there must not be a whole lot of latency overhead

@UmerrAhsan
Copy link

it’s unusual that fine-tuning StyleTTS2 increases the checkpoint file size, even though the number of parameters in the model remains the same. Has anyone identified the reason behind this size increase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants