Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220

Open
jcsp opened this issue Dec 20, 2024 · 2 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@jcsp
Copy link
Collaborator

jcsp commented Dec 20, 2024

via INC-357.

Compaction races with modifying the timeline's archival config.
offload_timeline gets called for a timeline that was recently archived, errors out if the timeline isn't archived. However, by that point it has already called DeleteTimelineFlow::prepare, which puts the timeline into Stopping state.
So that's how it's stuck right now: the Timeline exists but is in Stopping state.

https://neondb.slack.com/archives/C085L8N9B4P

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Dec 20, 2024
@arpad-m
Copy link
Member

arpad-m commented Dec 20, 2024

I think that moving the timeline out from a Stopping state would not be good. I.e. Stopping should be an irreversible path to walk. Otherwise we'd have to implement a "initialization light" operation, and think about how all the different timeline components deal with initialization.

So more or less, what should happen is that once the timeline is in Stopping state, we should be completing the offload operation, and then unoffload it when the archival config request is retried. As the offload operation is triggered by the compaction task, we'll need some mechanism that continues the offload once that errors out: it's untenable to error on all archival config requests until compaction gets to the timeline again.

Actionable items:

  • We need some thread-safe switch that both timeline offloading and timeline need to flip to "their" direction so that there is no race going on. This would fix the bug at hand. For offloading, it could be a flag in the Stopping state maybe, as well as a new IsUnarchiving state, idk. Or we add it somewhere to the upload queue and have it live under its locks, idk.
  • We should also think about retries of offloading (issued by compaction). if it fails, it might leave the timeline in a stopping state, so waiting until the compaction loop gets to it again isn't good. Maybe we should spawn a task that retries indefinitely? Or have the compaction loop just have a sub-loop that retries offloading until completion? What if it errors each time?
  • We should also think about retries of archival config: say a timeline gets put into IsUnarchiving state: is there some running task that ensures this runs to completion? Or do we want to be dependent on users to retry unarchival, and not just giving up eventually? I think latter is where we want to move deletion eventually, so we can probably just demand this for unarchival as well: if an unarchive operation is started, we expect it to be retried until completion.

@arpad-m
Copy link
Member

arpad-m commented Dec 20, 2024

think about retries of offloading (issued by compaction)

as the compaction loop is per-tenant, it's probably not a good idea to block compactions of other timelines on this. however, compaction doesn't sleep if there is still work left to do. so maybe if there is an error during offloading, we could make it piggy back on that mechanism. of course, we should make sure that actual compaction doesn't get into the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants