feat: Improve model checkpoint loading #253

5Hyeons · 2024-06-13T05:51:20Z

Summary

This PR fixes the checkpoint loading issue in the second stage of training when using a single GPU. The second stage adds a 'module.' prefix to all parameter names, causing a mismatch with the first stage parameters.

Changes

Improved checkpoint loading to handle mismatched state_dict keys.
If direct loading fails, a new state_dict with adjusted keys is created and loaded.

Notes

Previously, print('%s loaded' % key) suggested parameters were loaded, even though strict=False prevented actual loading if keys did not match. This PR addresses this by ensuring proper parameter loading.

Related Issue

Issue #120

feat: Improve model checkpoint loading

1d46593

5Hyeons mentioned this pull request Jun 13, 2024

Stage 2 Training Fails with NaN Loss on Single GPU Due to Inconsistent Checkpoint Keys #254

Open

martinambrus mentioned this pull request Aug 21, 2024

error in first train gen loss=0.0 #206

Open

martinambrus mentioned this pull request Sep 1, 2024

Help Wanted For Stage-1 #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improve model checkpoint loading #253

feat: Improve model checkpoint loading #253

5Hyeons commented Jun 13, 2024

feat: Improve model checkpoint loading #253

Are you sure you want to change the base?

feat: Improve model checkpoint loading #253

Conversation

5Hyeons commented Jun 13, 2024

Summary

Changes

Notes

Related Issue