When running dpo_finetuning_example.ipynb training loss is zero starting from the second step #61

fairydreaming · 2024-12-06T18:03:48Z

When performing DPO fine-tuning with this notebook: dpo_finetuning_example.ipynb starting from the second step loss value is zero:

I don't think this is the expected behavior.

I tried the notebook both locally and in Colab and it happens in both environments.

fairydreaming · 2024-12-06T18:24:16Z

I found that the reason for this was using the model in fp16 precision. For some reason it introduced nan values in gradient computations, which manifested as zero loss values.

When I changed it to torch_dtype=torch.float32 in model creation:

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)

I got this:

I'm going to plot the losses next to see if it actually learns anything.

fairydreaming · 2024-12-06T18:37:39Z

OK, here's the plot of the loss function values:

Doesn't look very good...

burtenshaw · 2024-12-06T21:56:26Z

Thanks for sharing this. Here are a few points, try them out and feel free to share plots again:

I would try out wandb or tensorboard for plotting because it gives all the standard metrics for each trainer.
we are mainly looking for an upward trend in reward margins. There is a brief examples of that in the DPOTrainer docs.
Here is an example of logs from a successful DPO run

burtenshaw · 2024-12-07T06:33:52Z

Also, if you open a [SUBMISSION] pr with your changes including the plots, we can help you out with a review. 🙂

fairydreaming · 2024-12-07T13:00:43Z

* we are mainly looking for an upward trend in reward margins. There is a brief examples of that in the [`DPOTrainer`](https://huggingface.co/docs/trl/main/en/dpo_trainer) docs.

That's interesting, why isn't this mentioned in dpo.md? IMHO ability to distinguishing the successful fine-tuning run from the failure is like the most fundamental thing we should learn from the course. Meanwhile, the only info that I found in the dpo.md is:

During training, carefully monitor the loss convergence

So in my case I carefully monitored the loss divergence ;)

burtenshaw · 2024-12-07T16:17:32Z

Thanks for the continued feedback @fairydreaming . We're still ironing out the creases on this module. I'll get to work on a PR to make it clearer.

mdagost · 2024-12-11T23:31:13Z

I'm seeing the same weirdness with torch.float16. I added in the test split for an eval set:

When I changed to torch.float32 I see results that make much more sense, including the increasing reward margins:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When running dpo_finetuning_example.ipynb training loss is zero starting from the second step #61

When running dpo_finetuning_example.ipynb training loss is zero starting from the second step #61

fairydreaming commented Dec 6, 2024

fairydreaming commented Dec 6, 2024

fairydreaming commented Dec 6, 2024

burtenshaw commented Dec 6, 2024

burtenshaw commented Dec 7, 2024 •

edited

Loading

fairydreaming commented Dec 7, 2024

burtenshaw commented Dec 7, 2024

mdagost commented Dec 11, 2024

When running dpo_finetuning_example.ipynb training loss is zero starting from the second step #61

When running dpo_finetuning_example.ipynb training loss is zero starting from the second step #61

Comments

fairydreaming commented Dec 6, 2024

fairydreaming commented Dec 6, 2024

fairydreaming commented Dec 6, 2024

burtenshaw commented Dec 6, 2024

burtenshaw commented Dec 7, 2024 • edited Loading

fairydreaming commented Dec 7, 2024

burtenshaw commented Dec 7, 2024

mdagost commented Dec 11, 2024

burtenshaw commented Dec 7, 2024 •

edited

Loading