-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zipformer Training Issues:Gradient too small and cuda out of memory issue #1751
Comments
Could you use Sample output is given in icefall/egs/librispeech/ASR/local/display_manifest_statistics.py Lines 48 to 70 in 329e34a
|
By the way, have you enabled icefall/egs/librispeech/ASR/zipformer/train.py Line 1339 in 329e34a
and if yes, what are the thresholds you are using? |
I think I had the CUDA problem you report, reading the error and PyTorch documentation suggested adding export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True before starting to train. This worked for me. |
Attached the manifest statistics file |
This command was not available in the version of pytorch which I was using.So I used export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 |
I have not explicitly set any thresholds. It must be taking the default values.Will check where it is being set in lhotse. |
Default values are fine for your dataset as long as you have used |
After resuming from 4th epoch,I again faced this error. I had set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 43.98 GiB already allocated; 914.00 MiB free; 44.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Please advise whether we can train in NVIDIA A6000 50 GB GPU ? |
Yes, you can. We are using 32GB V100. Which sampler are you using? Are you using
Have you changed train.py or any other files? If yes, could you post the |
I have not changed sampler ,I am using all default settings.i have only changed the sampling rate to 8k and train, dev ,test folder names |
|
If you have removed all outlier audios, then you can try running |
I tried with emptying cache as suggested,it generated a few check points and before the completion of a epoch,ran into the same issue. So thought of reducing the data to around 1000hrs and checking, as I am not able to deduce on solving the error |
It should dump the offending batch if your script is fairly up to date. You could load the .pt file with torch.load(), and see the characteristics of the batch. E.g. might be very short or very long utterances. |
Hello All,
we are training a zip former model for about 3400 hours of Tamil data.
We were facing this issue:
RuntimeError:
grad_scale is too small, exiting: 1.4901161193847656e-08
We have NVIDIA A6000 50GB GPU.
So as per the suggestion in the terminal,we reduced the max duration from 1000,600,500,400,300,150
and Learning rate changed from 0.045 to 0.04,0.035,0.02.
Finally we used max duration is 150 and learning rate is 0.02 and trained for 4 epochs.
Then we faced cuda out of memory issue as below.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 42.23 GiB already allocated; 2.80 GiB free; 43.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Kindly suggest if any parameter changes are required so that the training can be continued.
The text was updated successfully, but these errors were encountered: