Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only push latest snapshot hashes in SnapshotGossipManager #31154

Merged
merged 1 commit into from
Apr 12, 2023

Conversation

brooksprumo
Copy link
Contributor

@brooksprumo brooksprumo commented Apr 11, 2023

Problem

Discovery of snapshots in bootstrap is hard (see #30759). The main reason is that full snapshots and incremental snapshots are gossiped out on different CRDS values.

Going forward, we can (read: will) use CRDS SnapshotHashes to gossip out the full and incremental snapshots together at the same time.

We also need to maintain compatibility with the existing behavior that expects to get full snapshots via LegacySnapshotHashes, and incremental snapshots via SnapshotHashes (formerly called IncrementalSnapshotHashes).

Luckily, that's basically what the current code already does!

The bootstrap code requires that the full snapshot info from LegacySnapshotHashes and the base snapshot info from SnapshotHashes are the same, which is an invariant we will maintain. And here in SnapshotPackagerService we always use the latest full snapshot to populate the incremental snapshot information anyway.

Also, bootstrap only cares about the latest snapshots. This is the meat of this change: we only need to push the latest snapshot hashes to the cluster.

Summary of Changes

Refactor the SnapshotGossipManager to only push the latest snapshot hashes (for SnapshotHashes). We retain the legacy behavior of keeping all the full snapshots for LegacySnapshotHashes too.

Note: Due to the repurposing of how SnapshotHashes is now used, the diff is not the most helpful as the design of SGM has changed. I'll try to point out any tricky and/or big changes.

Note 2: It is important that nodes running a version older (i.e. beta) are compatible. Specifically, this means that bootstrap with v1.14 must still operate basically the same if its other nodes in the cluster are running with this change.

@brooksprumo brooksprumo self-assigned this Apr 11, 2023
Copy link
Contributor Author

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out a few spots that I think may be helpful. It also may be beneficial to read the old code and new code on their own, separate from the diff (as the GitHub diff doesn't really do a good job of matching up the code, IMO).

Comment on lines 17 to +18
max_full_snapshot_hashes: usize,
max_incremental_snapshot_hashes: usize,
full_snapshot_hashes: FullSnapshotHashes,
incremental_snapshot_hashes: IncrementalSnapshotHashes,
legacy_full_snapshot_hashes: FullSnapshotHashes,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually both of these fields will go away. The exist currently to support LegacySnapshotHashes.

@@ -26,120 +24,153 @@ impl SnapshotGossipManager {
pub fn new(
cluster_info: Arc<ClusterInfo>,
max_full_snapshot_hashes: usize,
max_incremental_snapshot_hashes: usize,
_max_incremental_snapshot_hashes: usize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left this parameter here to make the diff smaller. I have a subsequent PR that removes it.

Comment on lines +44 to +46
let Some(starting_snapshot_hashes) = starting_snapshot_hashes else {
return;
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was originally coded before the let-else chains were added to Rust. I find this impl to be more readable. Also, it'll make the next PR where I move this fn into new() a bit more obvious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that reduces nesting gets a thumbs up from me

Comment on lines -76 to -80
assert_eq!(
base_slot, latest_full_snapshot_hash.0,
"the incremental snapshot's base slot ({}) must match the latest full snapshot hash's slot ({})",
base_slot, latest_full_snapshot_hash.0,
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert is moved, not deleted.

(Now is in update_incremental_snapshot_hash(), which is called by push_incremental_snapshot_hash.)

fn update_latest_full_snapshot_hash(&mut self, full_snapshot_hash: (Slot, SnapshotHash)) {
self.latest_snapshot_hashes = Some(LatestSnapshotHashes {
full: full_snapshot_hash,
incremental: None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we've gotten a new full snapshot, we know the cannot be any incremental snapshots yet (based on this full snapshot).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be nice as a comment in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #31171 to handle this request.

Comment on lines +114 to +117
let latest_snapshot_hashes = self
.latest_snapshot_hashes
.as_mut()
.expect("there must already be a full snapshot hash");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the other side of the same coin is here. In order to get a new incremental snapshot, there must already be a full snapshot to base the incremental on.

);
latest_snapshot_hashes.incremental = Some(incremental_snapshot_hash);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only care about the latest snapshots, we don't need to push to a vector and then trim the size. Just directly set the new value.

// Check to see what changed in `push_snapshot_hashes()` and handle the new
// error condition here.
/// Push the latest snapshot hashes to the cluster via CRDS
fn push_latest_snapshot_hashes_to_cluster(&self) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impl of this function looks similar to the bottom half of the old push_incremental_snapshot_hash(). (It was mostly copied from it 😸 )


/// Add `full_snapshot_hash` to the vector of full snapshot hashes, then push that vector to
/// the cluster via CRDS.
fn push_legacy_full_snapshot_hash(&mut self, full_snapshot_hash: FullSnapshotHash) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fn is unchanged from the old push_full_snapshot_hash(). Just renamed and moved here.

@codecov
Copy link

codecov bot commented Apr 11, 2023

Codecov Report

Merging #31154 (0701395) into master (e1cb8a6) will decrease coverage by 0.1%.
The diff coverage is 14.4%.

@@            Coverage Diff            @@
##           master   #31154     +/-   ##
=========================================
- Coverage    81.5%    81.5%   -0.1%     
=========================================
  Files         728      729      +1     
  Lines      206197   206209     +12     
=========================================
+ Hits       168210   168219      +9     
- Misses      37987    37990      +3     

@brooksprumo brooksprumo marked this pull request as ready for review April 12, 2023 01:56
@brooksprumo brooksprumo requested a review from bw-solana April 12, 2023 01:56
Copy link
Contributor

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Couple questions/nits

@brooksprumo brooksprumo merged commit 944b9d5 into solana-labs:master Apr 12, 2023
@brooksprumo brooksprumo deleted the snapshot-gossip/sps branch April 12, 2023 15:00
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants