Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About train error in inversion experiment #59

Open
K1sna opened this issue Jul 25, 2024 · 1 comment
Open

About train error in inversion experiment #59

K1sna opened this issue Jul 25, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@K1sna
Copy link

K1sna commented Jul 25, 2024

Hi, I really like this repo,and I am also doing some experiments on some datasets and models with this.

However, unfortunately, I have repeatedly encountered the following error during the training stage of the inversion model. This error seems to be caused by the targets exceeding the number of classes in the model. I am using the MSMARCO dataset for training and a custom validation set (text-to-embedding combinations) for validation. I don't believe this type of dataset should cause such an error. Additionally, this error only occurs after training for some time. I would like to ask for your advice on possible causes and debugging directions.

I would be very grateful if you could help me.

../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
{'loss': 2.8884, 'grad_norm': 1.3910571336746216, 'learning_rate': 0.0005, 'epoch': 0.15792706109437604}
{'loss': 2.8976, 'grad_norm': 1.9730249643325806, 'learning_rate': 0.0005, 'epoch': 0.15938934869710175}
Traceback (most recent call last):
  File "vec2text/vec2text/run.py", line 16, in <module>
    main()
  File "vec2text/vec2text/run.py", line 12, in main
    experiment.run()
  File "vec2text/vec2text/experiments.py", line 153, in run
    self.train()
  File "vec2text/vec2text/experiments.py", line 184, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File ".pyenv/versions/3.9.6/lib/python3.9/site-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
  File ".pyenv/versions/3.9.6/lib/python3.9/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "vec2text/vec2text/../vec2text/trainers/inversion.py", line 32, in training_step
    return super().training_step(model, inputs)
  File "python3.9/site-packages/transformers/trainer.py", line 3241, in training_step
    torch.cuda.empty_cache()
  File ".pyenv/versions/3.9.6/lib/python3.9/site-packages/torch/cuda/memory.py", line 162, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@jxmorris12 jxmorris12 added the bug Something isn't working label Jul 25, 2024
@jxmorris12
Copy link
Owner

Hi! Your error is one that I have not seen before, but I certainly need more information before I'll be able to reproduce it. Here are a few questions:

  1. What command are you running?
  2. Did you change any code inside vec2text or did you just install it and train a model via the command-line?
  3. What happens when you run the same command with CUDA_LAUNCH_BLOCKING=1? (that will give us a real error message; I don't think that empty_cache() is really what's causing the problem)
  4. What hardware, CUDA version, torch version, and linux version are you on? (try updating everything)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants