Data requirements and recommendations for training large-scaled Zipformer2 #1580
Unanswered
bharathraj-v
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
We have trained normal-scaled zipformer2-transducer on around 1k hours, a subset of a dataset of 4-5k hours of telugu language data which is low quality - lossy 8k sr with 20-30kbps bitrate - and somewhat weakly supervised with an error rate of about 4-7%, and have gotten good results compared to conformer_ctc3, zipformer2_ctc on an unseen test set of 800 samples from the same dataset (in-domain) and 800 samples from google fleurs data (OOD). More details on the comparison:
We also fine-tuned the librispeech large-scaled checkpoint for zipformer2 with the same data to compare the performance and it performed worse than training from scratch. The reason we tried that was that, in a separate experiment, NVIDIA NeMo Fastconformer-CTC en_large fine-tuned on the same data performed very well even without an LM so we figured fine-tuning could help. Could fine-tuning the gigaspeech zipformer checkpoint have gotten better performance?
Mainly, we are looking to train a large-scaled model of zipformer2 with the instructions from pr #1058 to get a better performance in terms of accuracy. What are the data requirements for the large-scaled model, is 4-5k hours of the data mentioned above + 600 hours of data from better sources enough for training a large-scaled model that is more robust and performs better?
Our goal is to prepare a base e2e telugu model that can be fine-tuned on domain-specific telephony data, any suggestions for that or the answers to the questions would be greatly appreciated and would be of much help!
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions