You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Compaction races with modifying the timeline's archival config.
offload_timeline gets called for a timeline that was recently archived, errors out if the timeline isn't archived. However, by that point it has already called DeleteTimelineFlow::prepare, which puts the timeline into Stopping state.
So that's how it's stuck right now: the Timeline exists but is in Stopping state.
I think that moving the timeline out from a Stopping state would not be good. I.e. Stopping should be an irreversible path to walk. Otherwise we'd have to implement a "initialization light" operation, and think about how all the different timeline components deal with initialization.
So more or less, what should happen is that once the timeline is in Stopping state, we should be completing the offload operation, and then unoffload it when the archival config request is retried. As the offload operation is triggered by the compaction task, we'll need some mechanism that continues the offload once that errors out: it's untenable to error on all archival config requests until compaction gets to the timeline again.
Actionable items:
We need some thread-safe switch that both timeline offloading and timeline need to flip to "their" direction so that there is no race going on. This would fix the bug at hand. For offloading, it could be a flag in the Stopping state maybe, as well as a new IsUnarchiving state, idk. Or we add it somewhere to the upload queue and have it live under its locks, idk.
We should also think about retries of offloading (issued by compaction). if it fails, it might leave the timeline in a stopping state, so waiting until the compaction loop gets to it again isn't good. Maybe we should spawn a task that retries indefinitely? Or have the compaction loop just have a sub-loop that retries offloading until completion? What if it errors each time?
We should also think about retries of archival config: say a timeline gets put into IsUnarchiving state: is there some running task that ensures this runs to completion? Or do we want to be dependent on users to retry unarchival, and not just giving up eventually? I think latter is where we want to move deletion eventually, so we can probably just demand this for unarchival as well: if an unarchive operation is started, we expect it to be retried until completion.
think about retries of offloading (issued by compaction)
as the compaction loop is per-tenant, it's probably not a good idea to block compactions of other timelines on this. however, compaction doesn't sleep if there is still work left to do. so maybe if there is an error during offloading, we could make it piggy back on that mechanism. of course, we should make sure that actual compaction doesn't get into the way.
via INC-357.
https://neondb.slack.com/archives/C085L8N9B4P
The text was updated successfully, but these errors were encountered: