-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint every_n_steps reruns epoch on restore #19815
Labels
Comments
heth27
added
bug
Something isn't working
needs triage
Waiting to be triaged by maintainers
labels
Apr 25, 2024
I think this is also related to #18595. The fact that the modelcheckpoint is saved before properly incrementing all parts of the counters seems to lead to a host of unforeseen and hard to debug issues. |
I think it is also related to this issue #18060 |
Yes, its the same issue, I didn't check enough if it already existed |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Bug description
The checkpoint callback is run before
batch_progress.increment_completed()
in training_epoch_loop'sadvance
method. Thus in the checkpointcheckpoint['loops']['fit_loop']['epoch_loop.batch_progress']['total']['completed'] e.g. 9
is one smaller than for example
checkpoint['loops']['fit_loop']['epoch_loop.batch_progress']['total']['processed'] e.g. 10 or global step.
same for
checkpoint['loops']['fit_loop']['epoch_loop.state_dict']['_batches_that_stepped']
Thus when restoring from the checkpoint the batch with batch_idx 9 is run again, even though optimizer step was already done for this batch.
This behavior is unexpected enough to at least warrant a hint in the documentation if not regarded as a bug.
What version are you seeing the problem on?
master
How to reproduce the bug
Error messages and logs
None
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: