Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrap: Sunset using LegacySnapshotHashes for snapshot discovery #31275

Conversation

brooksprumo
Copy link
Contributor

@brooksprumo brooksprumo commented Apr 19, 2023

Problem

The bootstrap code uses both LegacySnapshotHashes and SnapshotHashes from CRDS to discover snapshots. This sometimes doesn't work out, since these two data sources are not synchronized. So getting the full snapshots from Legacy and the incremental snapshots from non-Legacy can cause bootstrap to incorrectly not download an incremental snapshot.

The full context and background can be found here:
#26180

Since that issue, we've now changed the non-Legacy CRDS data to always send correct, up-to-date full and incremental snapshot hashes together. This obviates the need for using LegacySnapshotHashes entirely.

We can now robustly discover all snapshot hashes, and can remove many conditional/error/edge cases that could arise when using two CRDS data sources.

Summary of Changes

Remove all use of LegacySnapshotHashes.

Note: I've added comments (#31275 (review)) that may be easier to read first, before looking at the diff top-down.

Additional Testing

I bootstrapped a validator running this PR against mnb and confirmed it was successfully able to download snapshots—and the correct ones!

Fixes: #30759
Fixes: #30740
Fixes: #28478
Fixes: #26180

@brooksprumo brooksprumo self-assigned this Apr 19, 2023
@brooksprumo brooksprumo force-pushed the bsb/bootstrap/remove-legacy-snapshot-hashes branch from 91f9215 to f3932c8 Compare April 19, 2023 20:44
Copy link
Contributor Author

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reading the diff top-to-bottom, I've added comments in a way that may make understanding the changes easier (at least for me!).

/// The `get_snapshot_hashes_for_node` parameter is a function that map a pubkey to its snapshot
/// hashes. This parameter exist to provide a way to test the inner algorithm without needing
/// runtime information such as the ClusterInfo or ValidatorConfig.
fn build_known_snapshot_hashes<'a>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main changes are here in build_known_snapshot_hashes. I'll come back to add more PR comments; but the main thing to notice is that there are no longer separate parameter to get the full and incremental snapshot hashes per node.

.map(|hashes| (hashes.full, hashes.incremental))
};
// Get the snapshot hashes for a node from CRDS
let get_snapshot_hashes_for_node = |node| get_snapshot_hashes_for_node(cluster_info, node);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here within get_snapshot_hashes_from_known_validators() is where we call build_known_snapshot_hashes(). And this line is how we get the full and incremental snapshot hashes from a node.

Before, we would call into cluster_info, but now we can use the same function that we later use to ask the peer (i.e. non-known) nodes what their snapshot hashes are (get_snapshot_hashes_for_node()). Yay!

Comment on lines +1294 to +1295
fn get_snapshot_hashes_for_node(cluster_info: &ClusterInfo, node: &Pubkey) -> Option<SnapshotHash> {
cluster_info.get_snapshot_hashes_for_node(node).map(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old impl had separate functions for getting the full and the incremental snapshot hashes. The full hashes came from LegacySnapshotHashes and the incremental hashes came from SnapshotHashes. Now we get both from SnapshotHashes.

This impl is basically the same as the old get_highest_incremental_snapshot_hash_for_peer(), just with a few renames.

Comment on lines +990 to +993
let peer_snapshot_hashes = rpc_peers
.iter()
.flat_map(|rpc_peer| {
get_snapshot_hashes_for_node(cluster_info, &rpc_peer.id).map(|snapshot_hash| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And getting the peer snapshot hashes here now becomes a lot easier. Since each node only has one snapshot hash (and not a vec anymore), or None, we map and collect up all the Some ones.

This inner call to get_snapshot_hashes_for_node() does the heavy-ish lifting of getting the node's highest snapshot hash.

};
get_snapshot_hashes_for_node: impl Fn(&'a Pubkey) -> Option<SnapshotHash>,
) -> bool {
let node_has_snapshot_hashes = |node| get_snapshot_hashes_for_node(node).is_some();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if the known validators now have snapshot hashes is now much simpler too. We don't need to check multiple CRDS sources, just the one. And again since we don't have a vec of snapshot hashes, we can simply ask if the node has snapshot hashes or not.

Comment on lines +926 to +928
let Some(SnapshotHash {full: full_snapshot_hash, incr: incremental_snapshot_hash}) = get_snapshot_hashes_for_node(node) else {
continue 'to_next_node;
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the best part. There's only a maximum of one full and incremental snapshot hash per node, instead of vecs of each. So we entirely remove the complexity of looping over both possible loops.

Comment on lines +949 to +950
let known_incremental_snapshot_hashes =
known_snapshot_hashes.entry(full_snapshot_hash).or_default();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this, the whole else block on old-line 1053 becomes impossible. With .entry() we always have the correct spot to insert an incremental snapshot hash (if there is one).

And then we also don't need to re-find the correct location, like in old-line 1026.

Comment on lines +1356 to +1357
// simulate a set of known validators with various snapshot hashes
let oracle = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test for build_known_snapshot_hashes(), it's now much easier to generate the oracle as well. Yay!

@codecov
Copy link

codecov bot commented Apr 19, 2023

Codecov Report

Merging #31275 (f3932c8) into master (b4ea104) will increase coverage by 0.0%.
The diff coverage is 70.3%.

@@           Coverage Diff            @@
##           master   #31275    +/-   ##
========================================
  Coverage    81.5%    81.5%            
========================================
  Files         732      732            
  Lines      206999   206861   -138     
========================================
- Hits       168767   168699    -68     
+ Misses      38232    38162    -70     

@brooksprumo brooksprumo marked this pull request as ready for review April 19, 2023 22:19
@brooksprumo brooksprumo requested a review from bw-solana April 19, 2023 22:19
);
continue 'to_next_node;
}
let Some(SnapshotHash {full: full_snapshot_hash, incr: incremental_snapshot_hash}) = get_snapshot_hashes_for_node(node) else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brooks is the king of the let/else

@@ -977,134 +923,81 @@ where
}

'to_next_node: for node in nodes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the label now that the inner loop is gone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. Would that be preferred? I kinda like the label; since the for-loop is so big, whenever I encounter a continue it has the label alongside it. (The for-loop is not nearly as big as it was before, so this may not be an issue anymore.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your call. I can see the merits of either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Since this PR has already passed CI, I'd like to leave it. It can be removed in a subsequent PR, and will be trivial to review as well.

@bw-solana
Copy link
Contributor

image
Winning

Copy link
Contributor

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 nit, but this looks great to me!

@brooksprumo brooksprumo merged commit 04bbf3b into solana-labs:master Apr 20, 2023
@brooksprumo brooksprumo deleted the bsb/bootstrap/remove-legacy-snapshot-hashes branch April 20, 2023 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment