Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a checkpoint trains past 100% #3096

Open
sam598 opened this issue Apr 19, 2024 · 0 comments
Open

Loading a checkpoint trains past 100% #3096

sam598 opened this issue Apr 19, 2024 · 0 comments

Comments

@sam598
Copy link

sam598 commented Apr 19, 2024

Describe the bug
Loading previously trained checkpoint will train past --max-num-iterations.

To Reproduce
Steps to reproduce the behavior:

  1. Train a model (I used Splatfacto)
  2. Train it again with --load-dir
  3. Have --max-num-iterations set to higher than the saved checkpoint
  4. The model will not stop training for 24+ hours.

Expected behavior
The training should stop when it reaches --max-num-iterations

Additional context
I trained a Splatfacto model to 10,000 steps. Afterwards I loaded the saved checkpoint with --load-dir, and set --max-num-iterations to 20,000.

It starting training with an output that looked like this:

10090 (50.10%) 3m 45s

When it approached 20,000 steps it looked like this:

19810 (98.80%) 10s

Then it keeps training, apparently for 24 hours if unstopped.

20090 (101.00%) 23h 59m 40s

If --max-num-iterations was not set, or set lower than the checkpoint steps, I get why it would train indefinitely. But the logical (and more useful) behavior would be for it to train to the defined value.

What is perplexing is looking at the code for trainer.py this is does not seem like it should be possible. This code looks like it should run how I am expecting it to.

for step in range(self._start_step, self._start_step + num_iterations):

Where is this indefinite training coming from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant