Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix snapshot related issue during restart of follower. #214

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sanebay
Copy link
Contributor

@sanebay sanebay commented Sep 17, 2024

We should return false if we are not persistig snapshot. apply_snapshot expects to return true if it saved and false will cause it exit. Same return null for last snapshot till we dont persist the snapshot. Behaviour is undefined. Write baseline PR can add this feature to save the snapshot and recover it and return in last_snapshot.

We should return false if we are not persistig snapshot.
Same return null for last snapshot.
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.

Project coverage is 66.78%. Comparing base (acb04e8) to head (6549ed9).
Report is 20 commits behind head on main.

Files with missing lines Patch % Lines
...ib/homestore_backend/replication_state_machine.cpp 14.28% 6 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #214      +/-   ##
==========================================
- Coverage   68.69%   66.78%   -1.91%     
==========================================
  Files          30       32       +2     
  Lines        1581     1725     +144     
  Branches      163      186      +23     
==========================================
+ Hits         1086     1152      +66     
- Misses        408      480      +72     
- Partials       87       93       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xiaoxichen
Copy link
Collaborator

It looks like this is still a draft?

@sanebay
Copy link
Contributor Author

sanebay commented Sep 17, 2024

It looks like this is still a draft?
No its not draft. Ran the tests couple of times and we dont see OOM.

Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you share why we got oom? It looks to me like we wrongly return true for apply_snapshot, it could have correctness issue but why oom?

Also we don't truncate, why we went through the baseline resync?

How the last_snapshot is used on follower side?

@sanebay
Copy link
Contributor Author

sanebay commented Sep 18, 2024

could you share why we got oom? It looks to me like we wrongly return true for I
I dont know either. Checked with jungsang too. Its not a valid condition and so we dont know the behaviour. We will have to check once write is enabled if we hit OOM.

Copy link
Collaborator

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not make any change to snapshot code here.

what we should do is to disable snapshot before we complete baseline resync.

so I think we can disable this item (in RaftReplService::start) for now.
or set consensus.snapshot_freq_distance to uint32_max

  .with_snapshot_enabled(HS_DYNAMIC_CONFIG(consensus.snapshot_freq_distance))

@JacksonYao287
Copy link
Collaborator

I dont know either

pls spend some time to figure out the root cause. I don`t think we can merge this PR before that

@sanebay
Copy link
Contributor Author

sanebay commented Sep 18, 2024

I dont know either

pls spend some time to figure out the root cause. I don`t think we can merge this PR before that

I dont think we have to figure out the root cause because the feature is not complete. Snapshot is incomplete without the write PR and snapshot save and recovery from you. I only did the read part. How can we test end to end without this ? If we see this issue even after the write PR is done, we have to find the root cause. It was already false. By mistake I made to to true. Already checked with Jungsang and its not a valid case that we return true but we dont save the state. If we return true, state has to be saved.

@JacksonYao287
Copy link
Collaborator

JacksonYao287 commented Sep 18, 2024

.with_snapshot_enabled(HS_DYNAMIC_CONFIG(consensus.snapshot_freq_distance))

can we disable snapshot by disabling this?

or can we revert the read_part code in master branch and move it to a feature branch?
when we have 3-replica test framework for HO, and read/write part all pass the UT , then we merge it to master branch.

as we disscussed last week, we need other changes to read_part code.
1 when we create shard, we need the chunk id , so that we can guarantee the follower has the same layout of each chunk.
2 if we involve 1 , we need change the baseline resync proto.

@sanebay sanebay requested a review from szmyd September 18, 2024 21:13
Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I dont know how to review this PR honestly. I didnt get

  1. apply_snapshot returns false it will shutdown the system --- are we doing that in this PR to make it like an assert?
  2. last_snapshot return nullptr ---- how this help to fix the problem?

@szmyd
Copy link
Collaborator

szmyd commented Sep 19, 2024

Yes; there is some confusion about how snapshot works. It's non trivial and it took me a lot of time to understand it when doing it for NuBlox. Let's have a meeting about this so everyone understands how it's supposed to work, i can whiteboard it if there's confusion.
Safer is to fail create_snapshot for now if we want to avoid it; but this will not prevent read_logical_snapshot from being called, the two are unrelated (though seem to be at first glance). Keeping the leader's log_store->start_index() at 0 will ensure a log sync is used rather than a snapshot.

@sanebay
Copy link
Contributor Author

sanebay commented Sep 19, 2024

sorry I dont know how to review this PR honestly. I didnt get

  1. apply_snapshot returns false it will shutdown the system --- are we doing that in this PR to make it like an assert?
  2. last_snapshot return nullptr ---- how this help to fix the problem?
  1. apply_snapshot() is called at the end of is_last_obj only when read/write snapshot is called. In this case no snapshot is happening. I tested this and we dont cause OOM. This change has no effect. Whoever writes the snapshot save and recovery will change it. https://github.com/eBay/NuRaft/blob/master/src/handle_snapshot_sync.cxx#L563

  2. It was already nullptr with Arjun's change 9042b8d. The problem is snapshot should be saved and recovered which was created during create_snapshot. But test case has follower restart and because its restarted, here last_snapshot() is returning a shared_ptrnuraft::snapshot with snapshot struct is not initialized and invalid.
    https://github.com/eBay/NuRaft/blob/master/src/raft_server.cxx#L112

@JacksonYao287
Copy link
Collaborator

JacksonYao287 commented Oct 16, 2024

@xiaoxichen @szmyd
now homeobject tests will be blocked by this issue and this PR will fix this issue. since we will definitely make other changes about snapshot in the near future to adopt identical shard layout, so I suggest we can merge this PR for now to keep homeobject from being blocked and revisit here if needed when we add other changes to snapshot. i am +1

gggg

@sanebay this the error stack I find, can you proceed this PR ? so that issue will be solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants