Zipformer Training Issues:Gradient too small and cuda out of memory issue #1751

bsshruthi22 · 2024-09-10T06:33:23Z

Hello All,
we are training a zip former model for about 3400 hours of Tamil data.
We were facing this issue:

RuntimeError:
grad_scale is too small, exiting: 1.4901161193847656e-08

We have NVIDIA A6000 50GB GPU.
So as per the suggestion in the terminal,we reduced the max duration from 1000,600,500,400,300,150
and Learning rate changed from 0.045 to 0.04,0.035,0.02.

Finally we used max duration is 150 and learning rate is 0.02 and trained for 4 epochs.
Then we faced cuda out of memory issue as below.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 42.23 GiB already allocated; 2.80 GiB free; 43.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Kindly suggest if any parameter changes are required so that the training can be continued.

csukuangfj · 2024-09-10T13:46:52Z

Could you use
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/local/display_manifest_statistics.py#L48
to get the statistics of your data?

Sample output is given in

icefall/egs/librispeech/ASR/local/display_manifest_statistics.py

Lines 48 to 70 in 329e34a

    
           ## train-clean-100 
        
           Cuts count: 85617 
        
           Total duration (hours): 303.8 
        
           Speech duration (hours): 303.8 (100.0%) 
        
           *** 
        
           Duration statistics (seconds): 
        
           mean    12.8 
        
           std     3.8 
        
           min     1.3 
        
           0.1%    1.9 
        
           0.5%    2.2 
        
           1%      2.5 
        
           5%      4.2 
        
           10%     6.4 
        
           25%     11.4 
        
           50%     13.8 
        
           75%     15.3 
        
           90%     16.7 
        
           95%     17.3 
        
           99%     18.1 
        
           99.5%   18.4 
        
           99.9%   18.8 
        
           max     27.2

csukuangfj · 2024-09-10T13:47:47Z

By the way, have you enabled

icefall/egs/librispeech/ASR/zipformer/train.py

Line 1339 in 329e34a

train_cuts = train_cuts.filter(remove_short_and_long_utt)

and if yes, what are the thresholds you are using?

epicyclism · 2024-09-11T14:00:34Z

I think I had the CUDA problem you report, reading the error and PyTorch documentation suggested adding

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

before starting to train. This worked for me.

bsshruthi22 · 2024-09-12T11:54:27Z

Could you use https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/local/display_manifest_statistics.py#L48 to get the statistics of your data?

Sample output is given in

icefall/egs/librispeech/ASR/local/display_manifest_statistics.py

Lines 48 to 70 in 329e34a

## train-clean-100

Cuts count: 85617

Total duration (hours): 303.8

Speech duration (hours): 303.8 (100.0%)

***

Duration statistics (seconds):

mean 12.8

std 3.8

min 1.3

0.1% 1.9

0.5% 2.2

1% 2.5

5% 4.2

10% 6.4

25% 11.4

50% 13.8

75% 15.3

90% 16.7

95% 17.3

99% 18.1

99.5% 18.4

99.9% 18.8

max 27.2

Attached the manifest statistics file
Manifests_statistics.docx

bsshruthi22 · 2024-09-12T11:55:36Z

I think I had the CUDA problem you report, reading the error and PyTorch documentation suggested adding

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

before starting to train. This worked for me.

This command was not available in the version of pytorch which I was using.So I used export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

bsshruthi22 · 2024-09-12T12:00:06Z

By the way, have you enabled

icefall/egs/librispeech/ASR/zipformer/train.py

Line 1339 in 329e34a

train_cuts = train_cuts.filter(remove_short_and_long_utt)

and if yes, what are the thresholds you are using?

I have not explicitly set any thresholds. It must be taking the default values.Will check where it is being set in lhotse.

csukuangfj · 2024-09-12T12:08:40Z

I have not explicitly set any thresholds. It must be taking the default values.Will check where it is being set in lhotse.

Default values are fine for your dataset as long as you have used train_cuts.filter in your code; otherwise, some very long cut may lead to OOM.

bsshruthi22 · 2024-09-13T02:22:48Z

After resuming from 4th epoch,I again faced this error. I had set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 43.98 GiB already allocated; 914.00 MiB free; 44.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Please advise whether we can train in NVIDIA A6000 50 GB GPU ?
Last year we had trained about 245 hours of data in Quadro RTX 4000 8GB GPU for 30 epochs.There was no issue.

csukuangfj · 2024-09-13T06:39:14Z

Please advise whether we can train in NVIDIA A6000 50 GB GPU ?

Yes, you can. We are using 32GB V100.

Which sampler are you using? Are you using

icefall/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py

Line 309 in 6f1abd8

train_sampler = DynamicBucketingSampler(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 43.98 GiB already allocated; 914.00 MiB free; 44.95 GiB reserved in total by PyTorch)

Have you changed train.py or any other files? If yes, could you post the git diff?

bsshruthi22 · 2024-09-14T16:00:26Z

I have not changed sampler ,I am using all default settings.i have only changed the sampling rate to 8k and train, dev ,test folder names

bsshruthi22 · 2024-09-19T09:02:56Z

Please advise whether we can train in NVIDIA A6000 50 GB GPU ?

Yes, you can. We are using 32GB V100.

Which sampler are you using? Are you using

icefall/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py

Line 309 in 6f1abd8

train_sampler = DynamicBucketingSampler(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 43.98 GiB already allocated; 914.00 MiB free; 44.95 GiB reserved in total by PyTorch)

Have you changed train.py or any other files? If yes, could you post the git diff?
I have not changed sampler ,I am using all default settings.i have only changed the sampling rate to 8k and train, dev ,test folder names.
Any thing you can advise on this.

lingjzhu · 2024-10-19T19:01:09Z

Please advise whether we can train in NVIDIA A6000 50 GB GPU ?

Yes, you can. We are using 32GB V100.
Which sampler are you using? Are you using

icefall/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py

Line 309 in 6f1abd8

train_sampler = DynamicBucketingSampler(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 43.98 GiB already allocated; 914.00 MiB free; 44.95 GiB reserved in total by PyTorch)

Have you changed train.py or any other files? If yes, could you post the git diff?
I have not changed sampler ,I am using all default settings.i have only changed the sampling rate to 8k and train, dev ,test folder names.
Any thing you can advise on this.

If you have removed all outlier audios, then you can try running torch.cuda.empty_cache() every few hundred iterations to release unused memory. Works for me for a 300M model on a 48G GPU.

bsshruthi22 · 2024-11-05T02:40:15Z

Please advise whether we can train in NVIDIA A6000 50 GB GPU ?

Yes, you can. We are using 32GB V100.
Which sampler are you using? Are you using

icefall/egs/librispeech/ASR/tdnn_lstm_ctc/asr_datamodule.py

Line 309 in 6f1abd8

train_sampler = DynamicBucketingSampler(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 43.98 GiB already allocated; 914.00 MiB free; 44.95 GiB reserved in total by PyTorch)

Have you changed train.py or any other files? If yes, could you post the git diff?
I have not changed sampler ,I am using all default settings.i have only changed the sampling rate to 8k and train, dev ,test folder names.
Any thing you can advise on this.

If you have removed all outlier audios, then you can try running torch.cuda.empty_cache() every few hundred iterations to release unused memory. Works for me for a 300M model on a 48G GPU.

I tried with emptying cache as suggested,it generated a few check points and before the completion of a epoch,ran into the same issue. So thought of reducing the data to around 1000hrs and checking, as I am not able to deduce on solving the error

danpovey · 2024-11-05T10:41:42Z

It should dump the offending batch if your script is fairly up to date. You could load the .pt file with torch.load(), and see the characteristics of the batch. E.g. might be very short or very long utterances.

JinZr closed this as completed Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zipformer Training Issues:Gradient too small and cuda out of memory issue #1751

Zipformer Training Issues:Gradient too small and cuda out of memory issue #1751

bsshruthi22 commented Sep 10, 2024

csukuangfj commented Sep 10, 2024

csukuangfj commented Sep 10, 2024

epicyclism commented Sep 11, 2024

bsshruthi22 commented Sep 12, 2024

bsshruthi22 commented Sep 12, 2024

bsshruthi22 commented Sep 12, 2024

csukuangfj commented Sep 12, 2024

bsshruthi22 commented Sep 13, 2024 •

edited

Loading

csukuangfj commented Sep 13, 2024

bsshruthi22 commented Sep 14, 2024

bsshruthi22 commented Sep 19, 2024

lingjzhu commented Oct 19, 2024

bsshruthi22 commented Nov 5, 2024

danpovey commented Nov 5, 2024

Zipformer Training Issues:Gradient too small and cuda out of memory issue #1751

Zipformer Training Issues:Gradient too small and cuda out of memory issue #1751

Comments

bsshruthi22 commented Sep 10, 2024

csukuangfj commented Sep 10, 2024

csukuangfj commented Sep 10, 2024

epicyclism commented Sep 11, 2024

bsshruthi22 commented Sep 12, 2024

bsshruthi22 commented Sep 12, 2024

bsshruthi22 commented Sep 12, 2024

csukuangfj commented Sep 12, 2024

bsshruthi22 commented Sep 13, 2024 • edited Loading

csukuangfj commented Sep 13, 2024

bsshruthi22 commented Sep 14, 2024

bsshruthi22 commented Sep 19, 2024

lingjzhu commented Oct 19, 2024

bsshruthi22 commented Nov 5, 2024

danpovey commented Nov 5, 2024

bsshruthi22 commented Sep 13, 2024 •

edited

Loading