Significant Increase in Training Loss after Upgrading from Transformers 4.47.1 to 4.48.0 #35787

mjkmain · 2025-01-20T10:28:01Z

System Info

huggingface_hub version: 0.27.1
Platform: Linux-5.15.0-1029-nvidia-x86_64-with-glibc2.35
Python version: 3.10.12
Running in iPython ?: No
Running in notebook ?: No
Running in Google Colab ?: No
Running in Google Colab Enterprise ?: No
Token path?: /raid/MLP/. cache/huggingface/token
Has saved token ?: True
Who am I ?: mjkmain
Configured git credential helpers: store
FastAI: N/A
Tensorflow: N/A
Torch: 2.4.1
Jinja2: 3.1.4
Graphviz: N/A
keras: N/A
Pydot: N/A
Pillow: 11.0.0
hf_transfer: 0.1.9
gradio: 5.9.1
tensorboard: N/A
numpy: 1.26.4
pydantic: 2.10.4
aiohttp: 3.11.9
ENDPOINT: https:/ /huggingface.co
HF_HUB_CACHE: /raid/MLP/• cache/huggingface/hub
HF_ASSETS_CACHE: /raid/MLP/. cache/huggingface/assets
HF_TOKEN_PATH: /raid/MLP/. cache/huggingface/token
HF_STORED_TOKENS_PATH: /raid/MLP/. cache/huggingface/stored_tokens
HF_HUB_OFFLINE: False
HF_HUB_DISABLE_TELEMETRY: False
HF_HUB_DISABLE_PROGRESS_BARS: None
HF_HUB_DISABLE_SYMLINKS_WARNING: False
HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
HF_HUB_DISABLE_IMPLICIT_TOKEN: False
HF_HUB_ENABLE_HF_TRANSFER: False
HF_HUB_ETAG_TIMEOUT: 10
HF_HUB_DOWNLOAD_TIMEOUT: 10

Who can help?

Issue Description

After upgrading the Transformers library from version 4.47.1 to 4.48.0, I’ve observed a drastic increase in loss values during training. Under the same training script and configurations, the loss values in 4.47.1 are around 2.48 - 2.54, while in 4.48.0 they suddenly jump to 40+ (and sometimes even higher).

Below are sample logs from the first few training steps. The only change is the Transformers version; everything else remains identical:

Transformers v4.47.1:

{'loss': 2.4859, 'grad_norm': 1.453125, 'learning_rate': 1.0660980810234542e-07, 'epoch': 0.0}
{'loss': 2.5428, 'grad_norm': 1.3359375, 'learning_rate': 2.1321961620469084e-07, 'epoch': 0.0}
{'loss': 2.5043, 'grad_norm': 1.234375, 'learning_rate': 3.1982942430703626e-07, 'epoch': 0.0}
{'loss': 2.4861, 'grad_norm': 1.3203125, 'learning_rate': 4.264392324093817e-07, 'epoch': 0.0}

Transformers v4.48.0:

{'loss': 40.6004, 'grad_norm': 34.5, 'learning_rate': 1.0660980810234542e-07, 'epoch': 0.0}
{'loss': 42.416, 'grad_norm': 34.0, 'learning_rate': 2.1321961620469084e-07, 'epoch': 0.0}
{'loss': 41.237, 'grad_norm': 32.5, 'learning_rate': 3.1982942430703626e-07, 'epoch': 0.0}
{'loss': 42.2229, 'grad_norm': 39.0, 'learning_rate': 4.264392324093817e-07, 'epoch': 0.0}

In the same setup using 4.48.0, if I change only the gradient accumulation steps from 16 to 1, the loss behaves similarly to 4.47.1, as shown below:

{'loss': 2.3792, 'grad_norm': 8.25, 'learning_rate': 6.666666666666668e-09, 'epoch': 0.0}
{'loss': 2.4288, 'grad_norm': 5.96875, 'learning_rate': 1.3333333333333335e-08, 'epoch': 0.0}
{'loss': 2.5774, 'grad_norm': 6.1875, 'learning_rate': 2e-08, 'epoch': 0.0}
{'loss': 2.4495, 'grad_norm': 6.96875, 'learning_rate': 2.666666666666667e-08, 'epoch': 0.0}
{'loss': 2.8155, 'grad_norm': 7.53125, 'learning_rate': 3.3333333333333334e-08, 'epoch': 0.0}

Any insights into changes between 4.47.1 and 4.48.0 that might cause this behavior?

Thank you for your time, and I appreciate any help or pointers to relevant changes or fixes!

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Environment

Transformers versions tested: 4.47.1 (good), 4.48.0 (issue)
Tested models: meta-llama/Llama-3.2-1B
Batch size: 1
Optimizer: adamw_bnb_8bit (from bitsandbytes)
Gradient checkpointing: True
Gradient accumulation steps: 16 (issue observed); 1 (issue disappears)
Number of devices: 8 (distributed training)

Steps to Reproduce

Install Transformers 4.47.1 and run the training script with the above parameters (gradient accumulation steps = 16) — observe normal loss values around 2-3.
Upgrade to Transformers 4.48.0 (no other changes in code or environment) and rerun the same training script (gradient accumulation steps = 16) — notice a large increase in loss values (40+).
Still using Transformers 4.48.0, change gradient accumulation steps to 1 — observe that loss now returns to normal levels (around 2-3).

Expected behavior

Loss values should remain consistent (as in 4.47.1) if there are no major changes in hyperparameters, data, or environment aside from the Transformers library version.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2025-01-20T14:30:53Z

Patch is coming today for #35651 which should fix this I hope

mjkmain added the bug label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant Increase in Training Loss after Upgrading from Transformers 4.47.1 to 4.48.0 #35787

Significant Increase in Training Loss after Upgrading from Transformers 4.47.1 to 4.48.0 #35787

mjkmain commented Jan 20, 2025

Transformers v4.47.1:

Transformers v4.48.0:

ArthurZucker commented Jan 20, 2025 •

edited

Loading

Significant Increase in Training Loss after Upgrading from Transformers 4.47.1 to 4.48.0 #35787

Significant Increase in Training Loss after Upgrading from Transformers 4.47.1 to 4.48.0 #35787

Comments

mjkmain commented Jan 20, 2025

System Info

Who can help?

Issue Description

Transformers v4.47.1:

Transformers v4.48.0:

Information

Tasks

Reproduction

Environment

Steps to Reproduce

Expected behavior

ArthurZucker commented Jan 20, 2025 • edited Loading

ArthurZucker commented Jan 20, 2025 •

edited

Loading