Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grad_norm is abnormal in the initial stage of training. Is this normal? #11708

Open
6lixueting opened this issue May 13, 2024 · 0 comments
Open
Assignees

Comments

@6lixueting
Copy link

In the training process, there are abnormalities with the grad_norm, and sometimes the grad_norm value becomes inf. Part of the training log is detailed below, but this situation generally only occurs during the first and second epochs, with subsequent training gradually normalizing and metrics appearing roughly normal. Does this indicate a problem with the code, or is this instability at the beginning of training due to weight initialization considered normal? If this is abnormal, could you provide suggestions for modifying or debugging the code?(请问在训练过程中出现了grad_norm不正常的现象,并且有时会出现grad_norm值为inf, 部分训练日志具体如下,但是整个训练过程中一般只有第1个epoch和第2个epoch会出现这种情况,后续训练逐渐正常,指标大致正常,这种情况是表示代码有问题还是说这种情况在训练过程中由于一开始权重初始化的原因导致训练不稳定,这也能算是正常现象吗?如果不正常,能否提供相应的修改或调试代码的建议)
2024/05/10 15:46:31 - mmengine - INFO - Epoch(train) [2][ 50/788] lr: 2.0000e-03 eta: 4:14:00 time: 0.8167 data_time: 0.0169 memory: 15943 grad_norm: 3.4572 loss: 1.5043 loss_cls: 0.8781 loss_bbox: 0.6262
2024/05/10 15:47:12 - mmengine - INFO - Epoch(train) [2][100/788] lr: 2.0000e-03 eta: 4:12:59 time: 0.8248 data_time: 0.0158 memory: 15871 grad_norm: 31101415.5974 loss: 63.1261 loss_cls: 61.5992 loss_bbox: 1.5268
2024/05/10 15:47:54 - mmengine - INFO - Epoch(train) [2][150/788] lr: 2.0000e-03 eta: 4:12:14 time: 0.8396 data_time: 0.0166 memory: 15559 grad_norm: 3.1308 loss: 1.4770 loss_cls: 0.8578 loss_bbox: 0.6191
2024/05/10 15:48:37 - mmengine - INFO - Epoch(train) [2][200/788] lr: 2.0000e-03 eta: 4:11:35 time: 0.8455 data_time: 0.0161 memory: 15638 grad_norm: 3.0802 loss: 1.5022 loss_cls: 0.8754 loss_bbox: 0.6268
2024/05/10 15:49:19 - mmengine - INFO - Epoch(train) [2][250/788] lr: 2.0000e-03 eta: 4:10:57 time: 0.8465 data_time: 0.0159 memory: 15683 grad_norm: 38140.4242 loss: 9.6294 loss_cls: 9.0064 loss_bbox: 0.6230
2024/05/10 15:50:02 - mmengine - INFO - Epoch(train) [2][300/788] lr: 2.0000e-03 eta: 4:10:22 time: 0.8507 data_time: 0.0160 memory: 15608 grad_norm: 5.6323 loss: 1.4896 loss_cls: 0.8704 loss_bbox: 0.6192
2024/05/10 15:50:44 - mmengine - INFO - Epoch(train) [2][350/788] lr: 2.0000e-03 eta: 4:09:45 time: 0.8502 data_time: 0.0163 memory: 16224 grad_norm: 11363.6991 loss: 1.8404 loss_cls: 1.2254 loss_bbox: 0.6150
2024/05/10 15:51:27 - mmengine - INFO - Epoch(train) [2][400/788] lr: 2.0000e-03 eta: 4:09:08 time: 0.8494 data_time: 0.0155 memory: 15800 grad_norm: 7.4980 loss: 1.4227 loss_cls: 0.8127 loss_bbox: 0.6100
2024/05/10 15:52:09 - mmengine - INFO - Epoch(train) [2][450/788] lr: 2.0000e-03 eta: 4:08:29 time: 0.8475 data_time: 0.0156 memory: 15777 grad_norm: 3347.6338 loss: 1.9629 loss_cls: 1.3547 loss_bbox: 0.6082
2024/05/10 15:52:51 - mmengine - INFO - Epoch(train) [2][500/788] lr: 2.0000e-03 eta: 4:07:50 time: 0.8482 data_time: 0.0156 memory: 15591 grad_norm: 10206866.6160 loss: 162.7081 loss_cls: 162.0813 loss_bbox: 0.6268
2024/05/10 15:53:34 - mmengine - INFO - Epoch(train) [2][550/788] lr: 2.0000e-03 eta: 4:07:10 time: 0.8476 data_time: 0.0159 memory: 15672 grad_norm: inf loss: 54140869.3325 loss_cls: 224337.7220 loss_bbox: 53916531.3748
2024/05/10 15:54:16 - mmengine - INFO - Epoch(train) [2][600/788] lr: 2.0000e-03 eta: 4:06:26 time: 0.8412 data_time: 0.0166 memory: 15208 grad_norm: inf loss: 4266526276.7297 loss_cls: 177515.1533 loss_bbox: 4266348847.3060
2024/05/10 15:54:58 - mmengine - INFO - Epoch(train) [2][650/788] lr: 2.0000e-03 eta: 4:05:39 time: 0.8352 data_time: 0.0147 memory: 15530 grad_norm: 394415498607192704.0000 loss: 50129647.2727 loss_cls: 49865.3645 loss_bbox: 50079782.2788
2024/05/10 15:55:39 - mmengine - INFO - Epoch(train) [2][700/788] lr: 2.0000e-03 eta: 4:04:52 time: 0.8362 data_time: 0.0155 memory: 15958 grad_norm: 3940673478440881.5000 loss: 16192429.2060 loss_cls: 5022.1809 loss_bbox: 16187407.0018
2024/05/10 15:56:22 - mmengine - INFO - Epoch(train) [2][750/788] lr: 2.0000e-03 eta: 4:04:11 time: 0.8456 data_time: 0.0157 memory: 15777 grad_norm: 2433308972409.9951 loss: 294119.2577 loss_cls: 188.6984 loss_bbox: 293930.5601

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants