You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I really like this repo,and I am also doing some experiments on some datasets and models with this.
However, unfortunately, I have repeatedly encountered the following error during the training stage of the inversion model. This error seems to be caused by the targets exceeding the number of classes in the model. I am using the MSMARCO dataset for training and a custom validation set (text-to-embedding combinations) for validation. I don't believe this type of dataset should cause such an error. Additionally, this error only occurs after training for some time. I would like to ask for your advice on possible causes and debugging directions.
I would be very grateful if you could help me.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
{'loss': 2.8884, 'grad_norm': 1.3910571336746216, 'learning_rate': 0.0005, 'epoch': 0.15792706109437604}
{'loss': 2.8976, 'grad_norm': 1.9730249643325806, 'learning_rate': 0.0005, 'epoch': 0.15938934869710175}
Traceback (most recent call last):
File "vec2text/vec2text/run.py", line 16, in <module>
main()
File "vec2text/vec2text/run.py", line 12, in main
experiment.run()
File "vec2text/vec2text/experiments.py", line 153, in run
self.train()
File "vec2text/vec2text/experiments.py", line 184, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File ".pyenv/versions/3.9.6/lib/python3.9/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File ".pyenv/versions/3.9.6/lib/python3.9/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "vec2text/vec2text/../vec2text/trainers/inversion.py", line 32, in training_step
return super().training_step(model, inputs)
File "python3.9/site-packages/transformers/trainer.py", line 3241, in training_step
torch.cuda.empty_cache()
File ".pyenv/versions/3.9.6/lib/python3.9/site-packages/torch/cuda/memory.py", line 162, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The text was updated successfully, but these errors were encountered:
Hi! Your error is one that I have not seen before, but I certainly need more information before I'll be able to reproduce it. Here are a few questions:
What command are you running?
Did you change any code inside vec2text or did you just install it and train a model via the command-line?
What happens when you run the same command with CUDA_LAUNCH_BLOCKING=1? (that will give us a real error message; I don't think that empty_cache() is really what's causing the problem)
What hardware, CUDA version, torch version, and linux version are you on? (try updating everything)
Hi, I really like this repo,and I am also doing some experiments on some datasets and models with this.
However, unfortunately, I have repeatedly encountered the following error during the training stage of the inversion model. This error seems to be caused by the targets exceeding the number of classes in the model. I am using the MSMARCO dataset for training and a custom validation set (text-to-embedding combinations) for validation. I don't believe this type of dataset should cause such an error. Additionally, this error only occurs after training for some time. I would like to ask for your advice on possible causes and debugging directions.
I would be very grateful if you could help me.
The text was updated successfully, but these errors were encountered: