Fix snapshot related issue during restart of follower. #214

sanebay · 2024-09-17T03:44:07Z

We should return false if we are not persistig snapshot. apply_snapshot expects to return true if it saved and false will cause it exit. Same return null for last snapshot till we dont persist the snapshot. Behaviour is undefined. Write baseline PR can add this feature to save the snapshot and recover it and return in last_snapshot.

We should return false if we are not persistig snapshot. Same return null for last snapshot.

codecov-commenter · 2024-09-17T03:59:48Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.

Project coverage is 66.78%. Comparing base (acb04e8) to head (6549ed9).
Report is 20 commits behind head on main.

Files with missing lines	Patch %	Lines
...ib/homestore_backend/replication_state_machine.cpp	14.28%	6 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #214      +/-   ##
==========================================
- Coverage   68.69%   66.78%   -1.91%     
==========================================
  Files          30       32       +2     
  Lines        1581     1725     +144     
  Branches      163      186      +23     
==========================================
+ Hits         1086     1152      +66     
- Misses        408      480      +72     
- Partials       87       93       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

xiaoxichen · 2024-09-17T07:47:25Z

It looks like this is still a draft?

sanebay · 2024-09-17T21:37:42Z

It looks like this is still a draft?
No its not draft. Ran the tests couple of times and we dont see OOM.

xiaoxichen

could you share why we got oom? It looks to me like we wrongly return true for apply_snapshot, it could have correctness issue but why oom?

Also we don't truncate, why we went through the baseline resync?

How the last_snapshot is used on follower side?

sanebay · 2024-09-18T00:37:05Z

could you share why we got oom? It looks to me like we wrongly return true for I
I dont know either. Checked with jungsang too. Its not a valid condition and so we dont know the behaviour. We will have to check once write is enabled if we hit OOM.

JacksonYao287

I think we should not make any change to snapshot code here.

what we should do is to disable snapshot before we complete baseline resync.

so I think we can disable this item (in RaftReplService::start) for now.
or set consensus.snapshot_freq_distance to uint32_max

  .with_snapshot_enabled(HS_DYNAMIC_CONFIG(consensus.snapshot_freq_distance))

src/lib/homestore_backend/replication_state_machine.cpp

JacksonYao287 · 2024-09-18T03:28:38Z

I dont know either

pls spend some time to figure out the root cause. I don`t think we can merge this PR before that

sanebay · 2024-09-18T04:29:23Z

I dont know either

pls spend some time to figure out the root cause. I don`t think we can merge this PR before that

I dont think we have to figure out the root cause because the feature is not complete. Snapshot is incomplete without the write PR and snapshot save and recovery from you. I only did the read part. How can we test end to end without this ? If we see this issue even after the write PR is done, we have to find the root cause. It was already false. By mistake I made to to true. Already checked with Jungsang and its not a valid case that we return true but we dont save the state. If we return true, state has to be saved.

JacksonYao287 · 2024-09-18T09:37:20Z

.with_snapshot_enabled(HS_DYNAMIC_CONFIG(consensus.snapshot_freq_distance))

can we disable snapshot by disabling this?

or can we revert the read_part code in master branch and move it to a feature branch?
when we have 3-replica test framework for HO, and read/write part all pass the UT , then we merge it to master branch.

as we disscussed last week, we need other changes to read_part code.
1 when we create shard, we need the chunk id , so that we can guarantee the follower has the same layout of each chunk.
2 if we involve 1 , we need change the baseline resync proto.

xiaoxichen

sorry I dont know how to review this PR honestly. I didnt get

apply_snapshot returns false it will shutdown the system --- are we doing that in this PR to make it like an assert?
last_snapshot return nullptr ---- how this help to fix the problem?

szmyd · 2024-09-19T03:38:57Z

Yes; there is some confusion about how snapshot works. It's non trivial and it took me a lot of time to understand it when doing it for NuBlox. Let's have a meeting about this so everyone understands how it's supposed to work, i can whiteboard it if there's confusion.
Safer is to fail create_snapshot for now if we want to avoid it; but this will not prevent read_logical_snapshot from being called, the two are unrelated (though seem to be at first glance). Keeping the leader's log_store->start_index() at 0 will ensure a log sync is used rather than a snapshot.

sanebay · 2024-09-19T18:50:28Z

sorry I dont know how to review this PR honestly. I didnt get

apply_snapshot returns false it will shutdown the system --- are we doing that in this PR to make it like an assert?

last_snapshot return nullptr ---- how this help to fix the problem?

apply_snapshot() is called at the end of is_last_obj only when read/write snapshot is called. In this case no snapshot is happening. I tested this and we dont cause OOM. This change has no effect. Whoever writes the snapshot save and recovery will change it. https://github.com/eBay/NuRaft/blob/master/src/handle_snapshot_sync.cxx#L563
It was already nullptr with Arjun's change 9042b8d. The problem is snapshot should be saved and recovered which was created during create_snapshot. But test case has follower restart and because its restarted, here last_snapshot() is returning a shared_ptrnuraft::snapshot with snapshot struct is not initialized and invalid.
https://github.com/eBay/NuRaft/blob/master/src/raft_server.cxx#L112

JacksonYao287 · 2024-10-16T08:09:10Z

@xiaoxichen @szmyd
now homeobject tests will be blocked by this issue and this PR will fix this issue. since we will definitely make other changes about snapshot in the near future to adopt identical shard layout, so I suggest we can merge this PR for now to keep homeobject from being blocked and revisit here if needed when we add other changes to snapshot. i am +1

@sanebay this the error stack I find, can you proceed this PR ? so that issue will be solved

Fix snapshot related issue during restart of follower.

6549ed9

We should return false if we are not persistig snapshot. Same return null for last snapshot.

sanebay requested review from xiaoxichen, JacksonYao287 and raakella1 September 17, 2024 03:44

xiaoxichen reviewed Sep 18, 2024

View reviewed changes

JacksonYao287 reviewed Sep 18, 2024

View reviewed changes

src/lib/homestore_backend/replication_state_machine.cpp Show resolved Hide resolved

sanebay requested a review from szmyd September 18, 2024 21:13

xiaoxichen reviewed Sep 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix snapshot related issue during restart of follower. #214

Fix snapshot related issue during restart of follower. #214

sanebay commented Sep 17, 2024 •

edited

Loading

codecov-commenter commented Sep 17, 2024

xiaoxichen commented Sep 17, 2024

sanebay commented Sep 17, 2024

xiaoxichen left a comment

sanebay commented Sep 18, 2024

JacksonYao287 left a comment •

edited

Loading

JacksonYao287 commented Sep 18, 2024

sanebay commented Sep 18, 2024

JacksonYao287 commented Sep 18, 2024 •

edited

Loading

xiaoxichen left a comment

szmyd commented Sep 19, 2024 •

edited

Loading

sanebay commented Sep 19, 2024 •

edited

Loading

JacksonYao287 commented Oct 16, 2024 •

edited

Loading

Fix snapshot related issue during restart of follower. #214

Are you sure you want to change the base?

Fix snapshot related issue during restart of follower. #214

Conversation

sanebay commented Sep 17, 2024 • edited Loading

codecov-commenter commented Sep 17, 2024

Codecov Report

xiaoxichen commented Sep 17, 2024

sanebay commented Sep 17, 2024

xiaoxichen left a comment

Choose a reason for hiding this comment

sanebay commented Sep 18, 2024

JacksonYao287 left a comment • edited Loading

Choose a reason for hiding this comment

JacksonYao287 commented Sep 18, 2024

sanebay commented Sep 18, 2024

JacksonYao287 commented Sep 18, 2024 • edited Loading

xiaoxichen left a comment

Choose a reason for hiding this comment

szmyd commented Sep 19, 2024 • edited Loading

sanebay commented Sep 19, 2024 • edited Loading

JacksonYao287 commented Oct 16, 2024 • edited Loading

sanebay commented Sep 17, 2024 •

edited

Loading

JacksonYao287 left a comment •

edited

Loading

JacksonYao287 commented Sep 18, 2024 •

edited

Loading

szmyd commented Sep 19, 2024 •

edited

Loading

sanebay commented Sep 19, 2024 •

edited

Loading

JacksonYao287 commented Oct 16, 2024 •

edited

Loading