-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix snapshot related issue during restart of follower. #214
base: main
Are you sure you want to change the base?
Conversation
We should return false if we are not persistig snapshot. Same return null for last snapshot.
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #214 +/- ##
==========================================
- Coverage 68.69% 66.78% -1.91%
==========================================
Files 30 32 +2
Lines 1581 1725 +144
Branches 163 186 +23
==========================================
+ Hits 1086 1152 +66
- Misses 408 480 +72
- Partials 87 93 +6 ☔ View full report in Codecov by Sentry. |
It looks like this is still a draft? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you share why we got oom? It looks to me like we wrongly return true for apply_snapshot, it could have correctness issue but why oom?
Also we don't truncate, why we went through the baseline resync?
How the last_snapshot is used on follower side?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not make any change to snapshot code here.
what we should do is to disable snapshot before we complete baseline resync.
so I think we can disable this item (in RaftReplService::start) for now.
or set consensus.snapshot_freq_distance to uint32_max
.with_snapshot_enabled(HS_DYNAMIC_CONFIG(consensus.snapshot_freq_distance))
pls spend some time to figure out the root cause. I don`t think we can merge this PR before that |
I dont think we have to figure out the root cause because the feature is not complete. Snapshot is incomplete without the write PR and snapshot save and recovery from you. I only did the read part. How can we test end to end without this ? If we see this issue even after the write PR is done, we have to find the root cause. It was already false. By mistake I made to to true. Already checked with Jungsang and its not a valid case that we return true but we dont save the state. If we return true, state has to be saved. |
can we disable snapshot by disabling this? or can we revert the read_part code in master branch and move it to a feature branch? as we disscussed last week, we need other changes to read_part code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I dont know how to review this PR honestly. I didnt get
- apply_snapshot returns false it will shutdown the system --- are we doing that in this PR to make it like an assert?
- last_snapshot return nullptr ---- how this help to fix the problem?
Yes; there is some confusion about how snapshot works. It's non trivial and it took me a lot of time to understand it when doing it for NuBlox. Let's have a meeting about this so everyone understands how it's supposed to work, i can whiteboard it if there's confusion. |
|
@xiaoxichen @szmyd @sanebay this the error stack I find, can you proceed this PR ? so that issue will be solved |
We should return false if we are not persistig snapshot. apply_snapshot expects to return true if it saved and false will cause it exit. Same return null for last snapshot till we dont persist the snapshot. Behaviour is undefined. Write baseline PR can add this feature to save the snapshot and recover it and return in last_snapshot.