You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello guys,
so i've implemented a transformer language model to train for a translation tasks, when i set device=1 args at pl.Trainer, the training loss works fine and return non nan value, but wen i set device = 2 args at pl.Trainer, the training loss becomes nan values, and output "NaN or Inf found in input tensor." while training, please what might be the problem here?
File "/home/airobotics/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 287, in _call_strategy_hook output = fn(*args, **kwargs) File "/home/airobotics/.local/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 200, in backward self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs) File "/home/airobotics/.local/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 67, in backward model.backward(tensor, *args, **kwargs) File "/home/airobotics/.local/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1046, in backward loss.backward(*args, **kwargs) File "/home/airobotics/.local/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/airobotics/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
this is the output i got when i passed in detect_anomaly=True
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello guys,
so i've implemented a transformer language model to train for a translation tasks, when i set device=1 args at pl.Trainer, the training loss works fine and return non nan value, but wen i set device = 2 args at pl.Trainer, the training loss becomes nan values, and output "NaN or Inf found in input tensor." while training, please what might be the problem here?
sample codes
` config = EncDec_Configs()
config = EncDec_Configs()
config.vocab_size = tokenizer.vocab_size
config.sinusoid = True
Epoch 0: 8%|██████▏ | 50/653 [00:10<02:06, 4.76it/s, v_num=16, train_loss_step=nan.0]NaN or Inf found in input tensor.
Epoch 0: 11%|█████████ | 74/653 [00:14<01:56, 4.97it/s, v_num=16, train_loss_step=nan.0]
`
Beta Was this translation helpful? Give feedback.
All reactions