About train error in inversion experiment #59

K1sna · 2024-07-25T05:50:58Z

Hi, I really like this repo,and I am also doing some experiments on some datasets and models with this.

However, unfortunately, I have repeatedly encountered the following error during the training stage of the inversion model. This error seems to be caused by the targets exceeding the number of classes in the model. I am using the MSMARCO dataset for training and a custom validation set (text-to-embedding combinations) for validation. I don't believe this type of dataset should cause such an error. Additionally, this error only occurs after training for some time. I would like to ask for your advice on possible causes and debugging directions.

I would be very grateful if you could help me.

../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
{'loss': 2.8884, 'grad_norm': 1.3910571336746216, 'learning_rate': 0.0005, 'epoch': 0.15792706109437604}
{'loss': 2.8976, 'grad_norm': 1.9730249643325806, 'learning_rate': 0.0005, 'epoch': 0.15938934869710175}
Traceback (most recent call last):
  File "vec2text/vec2text/run.py", line 16, in <module>
    main()
  File "vec2text/vec2text/run.py", line 12, in main
    experiment.run()
  File "vec2text/vec2text/experiments.py", line 153, in run
    self.train()
  File "vec2text/vec2text/experiments.py", line 184, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File ".pyenv/versions/3.9.6/lib/python3.9/site-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
  File ".pyenv/versions/3.9.6/lib/python3.9/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "vec2text/vec2text/../vec2text/trainers/inversion.py", line 32, in training_step
    return super().training_step(model, inputs)
  File "python3.9/site-packages/transformers/trainer.py", line 3241, in training_step
    torch.cuda.empty_cache()
  File ".pyenv/versions/3.9.6/lib/python3.9/site-packages/torch/cuda/memory.py", line 162, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

jxmorris12 · 2024-07-25T18:46:12Z

Hi! Your error is one that I have not seen before, but I certainly need more information before I'll be able to reproduce it. Here are a few questions:

What command are you running?
Did you change any code inside vec2text or did you just install it and train a model via the command-line?
What happens when you run the same command with CUDA_LAUNCH_BLOCKING=1? (that will give us a real error message; I don't think that empty_cache() is really what's causing the problem)
What hardware, CUDA version, torch version, and linux version are you on? (try updating everything)

jxmorris12 added the bug Something isn't working label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About train error in inversion experiment #59

About train error in inversion experiment #59

K1sna commented Jul 25, 2024 •

edited

Loading

jxmorris12 commented Jul 25, 2024

About train error in inversion experiment #59

About train error in inversion experiment #59

Comments

K1sna commented Jul 25, 2024 • edited Loading

jxmorris12 commented Jul 25, 2024

K1sna commented Jul 25, 2024 •

edited

Loading