Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JLD2OutputWriter and Checkpointer don't work when max_filesize and part are specified. #3399

Open
xkykai opened this issue Dec 1, 2023 · 2 comments

Comments

@xkykai
Copy link
Collaborator

xkykai commented Dec 1, 2023

Below is a minimal working example of the problem:

using Oceananigans
using Printf

grid = RectilinearGrid(Float64,
                       size = (4, 4, 4),
                       x = (0, 1),
                       y = (0, 1),
                       z = (-1, 0),
                       topology = (Periodic, Periodic, Bounded))

b_initial(x, y, z) = rand()

model = NonhydrostaticModel(; 
            grid = grid,
            buoyancy = BuoyancyTracer(),
            tracers = (:b),
            timestepper = :RungeKutta3)

simulation = Simulation(model, Δt=0.1, stop_iteration=200)
outputs = merge(model.velocities, model.tracers)

simulation.output_writers[:jld2] = JLD2OutputWriter(model, outputs,
                                                          filename = "$(FILE_DIR)/instantaneous_fields.jld2",
                                                          schedule = IterationInterval(1),
                                                          max_filesize=200e3,
                                                          part=1)

simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=TimeInterval(100), prefix="$(FILE_DIR)/model_checkpoint")

run!(simulation)

# up until here most recent output file is instantaneous_fields_part4.jld2, but I want to continue running the simulation

simulation.stop_iteration = 400

simulation.output_writers[:jld2] = JLD2OutputWriter(model, outputs,
                                                          filename = "instantaneous_fields.jld2",
                                                          schedule = IterationInterval(1),
                                                          max_filesize=200e3,
                                                          part=4)

run!(simulation, pickup="model_checkpoint_iteration0.jld2")

What I'm doing is creating a directory test_outputwriter, and then writing fields into it with a specified file size and starting part number.
After the first run!(simulation), 4 output files were written, most recent being instantaneous_fields_part4.jld2, and a checkpoint file model_checkpoint_iteration0.jld2 is written.

Let's say I want to keep running this model, so I increase simulation.stop_iteration. I pick up the model from the most recent checkpoint, and specify part=4 (the most recent file written). This creates a instantaneous_fields.jld2 and keeps writing into it, while throwing a warning

Warning: Failed to save and serialize [:grid, :coriolis, :buoyancy, :closure] in ./test_outputwriter/instantaneous_fields.jld2 because ArgumentError: ArgumentError: a group or dataset named Nx is already present within this group

It never actually writes into instantaneous_fields_part4.jld2, and it keeps writing and rewriting into instantaneous_fields.jld2 . If instead I specify part=10 or any number larger than 4, the same problem occurs.

If I use part=1 in my 2nd spin up of the simulation, it throws

ERROR: ArgumentError: '.\./test_outputwriter/instantaneous_fields_part1.jld2' exists. `force=true` is required to remove '.\./test_outputwriter/instantaneous_fields_part1.jld2' before moving.

Not sure what the intended user experience but I was imagining that if for some reason the simulation stops and I want to rerun the simulation from a checkpoint, 2 potential options would be available:

  1. The model runs from the latest checkpoint, and continues writing into the most recent output file once it catches up to the latest unsaved iteration. Note that since the model is running from the checkpoint the saved iterations which the model is running at could be in earlier parts than the most recent output. But the simulation should know that and only starts writing into the latest part once it catches up to the latest saved iteration.
  2. I specify a part number that is larger than all the previous output files, and the simulation picks up from the checkpoint and writes into the new part number. This could mean that there are repetitive iterations saved when examining all output files (new and old).
  1. is potentially the most important and common use case, but 2) might not be an unreasonable usage as well. However in the current implementation neither can be achieved.
@glwagner
Copy link
Member

glwagner commented Dec 2, 2023

The intended user experience is that only one line should need to be changed: pickup=false to pickup=true in run!.

Therefore, users should not have to manually specify the "part" that they want to pick up from. I don't like option 2 above.

I think that fixing this problem may become much easier if we can "delay" the creation of the output file. Right now, the output file is created when we build the output writer. But at that point, we have no way of knowing whether we are going to pick up or not.

I've long wanted to implement this "delay" but more pressing matters have intervened...

The basic thing we need to do is to add an initialize!(output_writer, sim) utility, which will create the output file. That function then will know whether the simulation is starting fresh (because iteration(sim) == 0, or whether it is "continuing"). One huge feature this will enable is the ability to avoid overwriting an existing file when it represents the output from the current continuing run. That's a huge problem with the current interface, is that you have to be really careful about overwrite_existing if you are trying to pickup from a checkpoint. And I think that's a big problem.

With that feature I think we can also figure out how to handle output that is split into multiple files --- because we know if a simulation is continuing that we will have to figure out which part to use (if any).

continues writing into the most recent output file once it catches up to the latest unsaved iteration.

This is a separate feature from what I was talking about, but I think it's also a great idea! There also may be a clue how to solve a roundoff error issue, where two outputs are written one iteration separate from one another, but at virtually identical times (eg distinguished only by machine epsilon).

PS: I simplified the example a bit to help me understand it

@glwagner
Copy link
Member

glwagner commented Dec 2, 2023

Why do we even have the "part" kw for JLD2OutputWriter? I feel this is a weird detail and users should not have to set that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants