Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Increase in Training Loss after Upgrading from Transformers 4.47.1 to 4.48.0 #35787

Open
2 of 4 tasks
mjkmain opened this issue Jan 20, 2025 · 1 comment
Open
2 of 4 tasks
Labels

Comments

@mjkmain
Copy link

mjkmain commented Jan 20, 2025

System Info

  • huggingface_hub version: 0.27.1
  • Platform: Linux-5.15.0-1029-nvidia-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Running in iPython ?: No
  • Running in notebook ?: No
  • Running in Google Colab ?: No
  • Running in Google Colab Enterprise ?: No
  • Token path?: /raid/MLP/. cache/huggingface/token
  • Has saved token ?: True
  • Who am I ?: mjkmain
  • Configured git credential helpers: store
  • FastAI: N/A
  • Tensorflow: N/A
  • Torch: 2.4.1
  • Jinja2: 3.1.4
  • Graphviz: N/A
  • keras: N/A
  • Pydot: N/A
  • Pillow: 11.0.0
  • hf_transfer: 0.1.9
  • gradio: 5.9.1
  • tensorboard: N/A
  • numpy: 1.26.4
  • pydantic: 2.10.4
  • aiohttp: 3.11.9
  • ENDPOINT: https:/ /huggingface.co
  • HF_HUB_CACHE: /raid/MLP/• cache/huggingface/hub
  • HF_ASSETS_CACHE: /raid/MLP/. cache/huggingface/assets
  • HF_TOKEN_PATH: /raid/MLP/. cache/huggingface/token
  • HF_STORED_TOKENS_PATH: /raid/MLP/. cache/huggingface/stored_tokens
  • HF_HUB_OFFLINE: False
  • HF_HUB_DISABLE_TELEMETRY: False
  • HF_HUB_DISABLE_PROGRESS_BARS: None
  • HF_HUB_DISABLE_SYMLINKS_WARNING: False
  • HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
  • HF_HUB_DISABLE_IMPLICIT_TOKEN: False
  • HF_HUB_ENABLE_HF_TRANSFER: False
  • HF_HUB_ETAG_TIMEOUT: 10
  • HF_HUB_DOWNLOAD_TIMEOUT: 10

Who can help?

@ArthurZucker, @muellerzr

Issue Description

After upgrading the Transformers library from version 4.47.1 to 4.48.0, I’ve observed a drastic increase in loss values during training. Under the same training script and configurations, the loss values in 4.47.1 are around 2.48 - 2.54, while in 4.48.0 they suddenly jump to 40+ (and sometimes even higher).

Below are sample logs from the first few training steps. The only change is the Transformers version; everything else remains identical:

  • Transformers v4.47.1:

{'loss': 2.4859, 'grad_norm': 1.453125, 'learning_rate': 1.0660980810234542e-07, 'epoch': 0.0}
{'loss': 2.5428, 'grad_norm': 1.3359375, 'learning_rate': 2.1321961620469084e-07, 'epoch': 0.0}
{'loss': 2.5043, 'grad_norm': 1.234375, 'learning_rate': 3.1982942430703626e-07, 'epoch': 0.0}
{'loss': 2.4861, 'grad_norm': 1.3203125, 'learning_rate': 4.264392324093817e-07, 'epoch': 0.0}
  • Transformers v4.48.0:

{'loss': 40.6004, 'grad_norm': 34.5, 'learning_rate': 1.0660980810234542e-07, 'epoch': 0.0}
{'loss': 42.416, 'grad_norm': 34.0, 'learning_rate': 2.1321961620469084e-07, 'epoch': 0.0}
{'loss': 41.237, 'grad_norm': 32.5, 'learning_rate': 3.1982942430703626e-07, 'epoch': 0.0}
{'loss': 42.2229, 'grad_norm': 39.0, 'learning_rate': 4.264392324093817e-07, 'epoch': 0.0}

In the same setup using 4.48.0, if I change only the gradient accumulation steps from 16 to 1, the loss behaves similarly to 4.47.1, as shown below:

{'loss': 2.3792, 'grad_norm': 8.25, 'learning_rate': 6.666666666666668e-09, 'epoch': 0.0}
{'loss': 2.4288, 'grad_norm': 5.96875, 'learning_rate': 1.3333333333333335e-08, 'epoch': 0.0}
{'loss': 2.5774, 'grad_norm': 6.1875, 'learning_rate': 2e-08, 'epoch': 0.0}
{'loss': 2.4495, 'grad_norm': 6.96875, 'learning_rate': 2.666666666666667e-08, 'epoch': 0.0}
{'loss': 2.8155, 'grad_norm': 7.53125, 'learning_rate': 3.3333333333333334e-08, 'epoch': 0.0}

Any insights into changes between 4.47.1 and 4.48.0 that might cause this behavior?

Thank you for your time, and I appreciate any help or pointers to relevant changes or fixes!

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Environment

  • Transformers versions tested: 4.47.1 (good), 4.48.0 (issue)
  • Tested models: meta-llama/Llama-3.2-1B
  • Batch size: 1
  • Optimizer: adamw_bnb_8bit (from bitsandbytes)
  • Gradient checkpointing: True
  • Gradient accumulation steps: 16 (issue observed); 1 (issue disappears)
  • Number of devices: 8 (distributed training)

Steps to Reproduce

Install Transformers 4.47.1 and run the training script with the above parameters (gradient accumulation steps = 16) — observe normal loss values around 2-3.
Upgrade to Transformers 4.48.0 (no other changes in code or environment) and rerun the same training script (gradient accumulation steps = 16) — notice a large increase in loss values (40+).
Still using Transformers 4.48.0, change gradient accumulation steps to 1 — observe that loss now returns to normal levels (around 2-3).

Expected behavior

Loss values should remain consistent (as in 4.47.1) if there are no major changes in hyperparameters, data, or environment aside from the Transformers library version.

@mjkmain mjkmain added the bug label Jan 20, 2025
@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jan 20, 2025

Patch is coming today for #35651 which should fix this I hope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants