pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220

jcsp · 2024-12-20T13:57:29Z

via INC-357.

Compaction races with modifying the timeline's archival config.
offload_timeline gets called for a timeline that was recently archived, errors out if the timeline isn't archived. However, by that point it has already called DeleteTimelineFlow::prepare, which puts the timeline into Stopping state.
So that's how it's stuck right now: the Timeline exists but is in Stopping state.

https://neondb.slack.com/archives/C085L8N9B4P

The text was updated successfully, but these errors were encountered:

arpad-m · 2024-12-20T17:35:06Z

I think that moving the timeline out from a Stopping state would not be good. I.e. Stopping should be an irreversible path to walk. Otherwise we'd have to implement a "initialization light" operation, and think about how all the different timeline components deal with initialization.

So more or less, what should happen is that once the timeline is in Stopping state, we should be completing the offload operation, and then unoffload it when the archival config request is retried. As the offload operation is triggered by the compaction task, we'll need some mechanism that continues the offload once that errors out: it's untenable to error on all archival config requests until compaction gets to the timeline again.

Actionable items:

We need some thread-safe switch that both timeline offloading and timeline need to flip to "their" direction so that there is no race going on. This would fix the bug at hand. For offloading, it could be a flag in the Stopping state maybe, as well as a new IsUnarchiving state, idk. Or we add it somewhere to the upload queue and have it live under its locks, idk.
We should also think about retries of offloading (issued by compaction). if it fails, it might leave the timeline in a stopping state, so waiting until the compaction loop gets to it again isn't good. Maybe we should spawn a task that retries indefinitely? Or have the compaction loop just have a sub-loop that retries offloading until completion? What if it errors each time?
We should also think about retries of archival config: say a timeline gets put into IsUnarchiving state: is there some running task that ensures this runs to completion? Or do we want to be dependent on users to retry unarchival, and not just giving up eventually? I think latter is where we want to move deletion eventually, so we can probably just demand this for unarchival as well: if an unarchive operation is started, we expect it to be retried until completion.

arpad-m · 2024-12-20T20:46:24Z

think about retries of offloading (issued by compaction)

as the compaction loop is per-tenant, it's probably not a good idea to block compactions of other timelines on this. however, compaction doesn't sleep if there is still work left to do. so maybe if there is an error during offloading, we could make it piggy back on that mechanism. of course, we should make sure that actual compaction doesn't get into the way.

jcsp added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Dec 20, 2024

jcsp assigned arpad-m Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220

pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220

jcsp commented Dec 20, 2024 •

edited by arssher

Loading

arpad-m commented Dec 20, 2024

arpad-m commented Dec 20, 2024 •

edited

Loading

pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220

pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220

Comments

jcsp commented Dec 20, 2024 • edited by arssher Loading

arpad-m commented Dec 20, 2024

arpad-m commented Dec 20, 2024 • edited Loading

jcsp commented Dec 20, 2024 •

edited by arssher

Loading

arpad-m commented Dec 20, 2024 •

edited

Loading