Training strategy for Zipformer using fp16 ?? #1461
Unanswered
ZQuang2202
asked this question in
Q&A
Replies: 1 comment 1 reply
-
Are you using single GPU and max-duration=300? The gradient noise might be large with such a small batch size. You could try a smaller base-lr, like 0.025, and keep lr_batch/lr_epoch unchanged. Usually you don't need to tune the Balancer and Whitener configurations. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I am a student attempting to reproduce the results of Zipformer on Librispeech 100h, facing limitations in hardware resources that prevent me from using the recommended configuration. Due to these constraints, I have reduced the batch size (max_duration) to 300, as opposed to the recommended 1000. However, I am struggling to find the appropriate configuration for Eden.
Following the training strategy that suggests decreasing the learning rate by √k times when the batch size decreases by k times, I initially set the base_lr to 0.03, keeping other configurations at their default values. But the training process diverges. Despite attempts to adjust lr_batches, lr_epochs (3.5-6), and base_lr (0.3-0.45), it's still not working. Notably, the training process encounters divergence when the batch_count is around 700-900, leading to 'parameter domination' issues in the embed_conv and some attention modules. I attach some log information below.
In an effort to address this, I attempted to reduce the gradient scale of the layers experiencing 'parameter domination,' but this proved ineffective."
I have few questions:
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions