Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running dpo_finetuning_example.ipynb training loss is zero starting from the second step #61

Open
fairydreaming opened this issue Dec 6, 2024 · 7 comments

Comments

@fairydreaming
Copy link

When performing DPO fine-tuning with this notebook: dpo_finetuning_example.ipynb starting from the second step loss value is zero:
dpo-zero-loss

I don't think this is the expected behavior.

I tried the notebook both locally and in Colab and it happens in both environments.

@fairydreaming
Copy link
Author

I found that the reason for this was using the model in fp16 precision. For some reason it introduced nan values in gradient computations, which manifested as zero loss values.

When I changed it to torch_dtype=torch.float32 in model creation:

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)

I got this:
dpo-nonan

I'm going to plot the losses next to see if it actually learns anything.

@fairydreaming
Copy link
Author

OK, here's the plot of the loss function values:
dpo_loss

Doesn't look very good...

@burtenshaw
Copy link
Collaborator

Thanks for sharing this. Here are a few points, try them out and feel free to share plots again:

  • I would try out wandb or tensorboard for plotting because it gives all the standard metrics for each trainer.
  • we are mainly looking for an upward trend in reward margins. There is a brief examples of that in the DPOTrainer docs.
  • Here is an example of logs from a successful DPO run

@burtenshaw
Copy link
Collaborator

burtenshaw commented Dec 7, 2024

Also, if you open a [SUBMISSION] pr with your changes including the plots, we can help you out with a review. 🙂

@fairydreaming
Copy link
Author

* we are mainly looking for an upward trend in reward margins. There is a brief examples of that in the [`DPOTrainer`](https://huggingface.co/docs/trl/main/en/dpo_trainer) docs.

That's interesting, why isn't this mentioned in dpo.md? IMHO ability to distinguishing the successful fine-tuning run from the failure is like the most fundamental thing we should learn from the course. Meanwhile, the only info that I found in the dpo.md is:

During training, carefully monitor the loss convergence

So in my case I carefully monitored the loss divergence ;)

@burtenshaw
Copy link
Collaborator

Thanks for the continued feedback @fairydreaming . We're still ironing out the creases on this module. I'll get to work on a PR to make it clearer.

@mdagost
Copy link

mdagost commented Dec 11, 2024

I'm seeing the same weirdness with torch.float16. I added in the test split for an eval set:

Screen Shot 2024-12-11 at 5 20 02 PM

When I changed to torch.float32 I see results that make much more sense, including the increasing reward margins:

Screen Shot 2024-12-11 at 5 30 56 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants