Bootstrap cannot differentiate between "incremental snapshots don't exist" and "incremental snapshots haven't been seen yet" #30759

brooksprumo · 2023-03-16T17:58:10Z

Problem

Bootstrap runs into issues when it is discovering snapshots due to full snapshots and incremental snapshots gossiping out separately. This makes it impossible to know if an incremental snapshot does not exist yet, or if incremental snapshots just haven't been seen yet in gossip.

The current impl was an issue for a cluster restart. More context here #30740

For example:

full snapshot interval: 100 slots
incremental snapshot interval: 10 slots

scenario 1:

cluster at slot 101
full snapshot at 100 exists
incremental snapshot with base slot of 100 does not exist
bootstrap sees full snapshot and no incremental snapshot; cannot determine that an incremental snapshot does not exist yet

scenario 2:

cluster at slot 115
full snapshot at 100 exists
incremental snapshot at 110, with base slot of 100, exists
bootstrap sees both, all is good

scenario 3:

cluster at slot 115
full snapshot at 100 exists
incremental snapshot at 110, with base slot of 100, exists
bootstrap doesn't have its gossip tables fully populated yet, only sees full snapshot and no incremental snapshot; cannot determine that an incremental snapshot does exist and it just hasn't seen it yet.

Possible Solutions

Solution 1: Combine full and incremental snapshots together into a new CRDS tag

will require backports
will require duplicating gossip traffic for a while until everyone is updated (maybe a dealbreaker?)
can fully solve the problem

Solution 2: Push a sentinel "empty" incremental snapshot on new full snapshot generation

probably should still require backports, but could be made without (depends on sentinel value)
would not change gossip, which is good
cannot fully solve issue, since would only fix the case where we do have incremental snapshot gossip tables populated
- if the gossip table is unpopulated, we would know to wait longer, which could/should be sufficient for an incremental (ha) improvement over the current impl
  - this would fix the issue see at cluster restart in Bootstrap does not download a snapshot in cluster restart scenarios #30740
sentinel values steal values from the values gossiped
- probably isn't an issue; could have the incremental snapshot sentinel be slot 0 and hash [0; 32], or all FFs too. I like the lower slot better, because we know it's invalid to have an incremental snapshot with a slot lower than its full snapshot's slot.

If we could go back in time, maybe Solution 1 would be better, and we could design it to accommodate more changes in the future. For the current network, likely Solution 2 is more practical.

Edit: Looks like Solution 1 has buy in!

The text was updated successfully, but these errors were encountered:

mvines · 2023-03-16T18:09:48Z

In the solution 1 case, wdyt about the new crds value only holding a single full snapshot hash and Option representing the latest it holds? The node may actually store more in case an older snapshot is requested

brooksprumo · 2023-03-16T20:36:28Z

In the solution 1 case, wdyt about the new crds value only holding a single full snapshot hash and Option representing the latest it holds? The node may actually store more in case an older snapshot is requested

Yeah, that likely should work.

I'm reluctant to add more crds values for snapshots based on previous comments I've seen from @behzadnouri. I'm not sure what the process is for being able to deprecate and remove old crds values. Having only the single latest full snapshot and (optional) single latest incremental snapshot would be more straight forward in the bootstrap code, and be less data sent to gossip.

behzadnouri · 2023-03-16T20:40:19Z

I'm not sure what the process is for being able to deprecate and remove old crds values.

Yeah, this is going to be a very painful slow process. see the "Update plan:" here: #29596

That being said it might be the right option to go with if in the longer term there are fewer values to gossip (as in the case of new ContactInfo).

brooksprumo · 2023-03-16T21:02:23Z

I'm not sure what the process is for being able to deprecate and remove old crds values.

Yeah, this is going to be a very painful slow process. see the "Update plan:" here: #29596

Gotcha, yeah figured as much. How do the patches roll out across different clusters? Are feature-gates needed/used, or do targeted backports happen (i.e. a PR is only backported to stable once some large percent of beta is running it on testnet and ensures everything is working)?

That being said it might be the right option to go with if in the longer term there are fewer values to gossip (as in the case of new ContactInfo).

The two snapshot CRDS values combined are allowed to hold up to 16 + 25 + 1 = 42 entries of (Slot, Hash) (not counting the pubkey and wallclock fields). We could reduce that down to

struct NewSnapshotCrdsValue { 
    full: (Slot, Hash), 
    incremental: Option<(Slot, Hash)>,
}

(and could even try out changing incremental's Slot to a NonZeroU64¹, which could allow the Option to use that spare bit).

Is that sufficient savings for it to be worth it?

Since an incremental snapshot must have a full snapshot that it's based on, the lowest possible incremental snapshot slot is 1, so we're guaranteed the incremental slot will be non-zero. ↩

behzadnouri · 2023-03-17T15:02:10Z

How do the patches roll out across different clusters? Are feature-gates needed/used, or do targeted backports happen

Just as new releases get adopted by the cluster.
There is no concept of "slot" in gossip, so feature gating has the risk that at epoch boundary half the cluster does one thing and the other half another thing (because their root bank are not in sync), so you may get catastrophic traffic amplification.

Is that sufficient savings for it to be worth it?

yeah, probably.

mvines · 2023-03-17T16:15:17Z

Might be nice to reserve space in a NewSnapshotCrdsValue for future snapshot types too, as one day ideally we'd have a snapshot type for ancient accounts too.

struct NewSnapshotCrdsValue { 
    latest: Vec<(SnapshotType, Slot, Hash)>, 
}

where SnapshotType is is enum with Full and Incremental for now, and could be expanded later.

brooksprumo · 2023-03-30T21:28:39Z

Might be nice to reserve space in a NewSnapshotCrdsValue for future snapshot types too, as one day ideally we'd have a snapshot type for ancient accounts too.
struct NewSnapshotCrdsValue { 
    latest: Vec<(SnapshotType, Slot, Hash)>, 
}
where SnapshotType is is enum with Full and Incremental for now, and could be expanded later.

Yeah, that makes sense. I like the idea of planning for the future, but I'm also a little wary of an open-ended Vec here, or if we need to have something more than just Slot and Hash later. I'm leaning more towards a type that's just for the current full and incremental snapshots. And then addressing future changes when they come up.

@behzadnouri Do you have thoughts on how to handle/plan for forwards compatibility? I imagine smaller size of the data is good for performance, and also adding new versions of values can be painful...

behzadnouri · 2023-03-31T15:50:23Z

@behzadnouri Do you have thoughts on how to handle/plan for forwards compatibility? I imagine smaller size of the data is good for performance, and also adding new versions of values can be painful...

I think it would be better to just add a new enum variant here:
https://github.com/solana-labs/solana/blob/ff9a42a35/gossip/src/crds_value.rs#L79-L97
if we want to modify the type in the future; same as LegacyContactInfo => ContactInfo.
Hopefully we don't have to make too many such changes.

brooksprumo changed the title ~~Bootstrap cannot differentiate between incremental snapshots don't exist and incremental snapshots haven't been seen~~ Bootstrap cannot differentiate between "incremental snapshots don't exist" and "incremental snapshots haven't been seen yet" Mar 16, 2023

brooksprumo mentioned this issue Mar 31, 2023

Uses AccountsHashes type for AccountsHashes CrdsData variant #31003

Merged

brooksprumo self-assigned this Mar 31, 2023

brooksprumo added this to Better Snapshot Bootstrap (BSB) Mar 31, 2023

brooksprumo moved this to Todo in Better Snapshot Bootstrap (BSB) Apr 4, 2023

behzadnouri mentioned this issue Apr 10, 2023

Rename SnapshotHashes to LegacySnapshotHashes #31086

Merged

brooksprumo mentioned this issue Apr 11, 2023

Only push latest snapshot hashes in SnapshotGossipManager #31154

Merged

brooksprumo linked a pull request Apr 20, 2023 that will close this issue

bootstrap: Sunset using LegacySnapshotHashes for snapshot discovery #31275

Merged

brooksprumo mentioned this issue Apr 20, 2023

bootstrap: Sunset using LegacySnapshotHashes for snapshot discovery #31275

Merged

brooksprumo closed this as completed Apr 20, 2023

github-project-automation bot moved this from Todo to Done in Better Snapshot Bootstrap (BSB) Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrap cannot differentiate between "incremental snapshots don't exist" and "incremental snapshots haven't been seen yet" #30759

Bootstrap cannot differentiate between "incremental snapshots don't exist" and "incremental snapshots haven't been seen yet" #30759

brooksprumo commented Mar 16, 2023 •

edited

Loading

mvines commented Mar 16, 2023

brooksprumo commented Mar 16, 2023

behzadnouri commented Mar 16, 2023

brooksprumo commented Mar 16, 2023

behzadnouri commented Mar 17, 2023

mvines commented Mar 17, 2023

brooksprumo commented Mar 30, 2023

behzadnouri commented Mar 31, 2023

Bootstrap cannot differentiate between "incremental snapshots don't exist" and "incremental snapshots haven't been seen yet" #30759

Bootstrap cannot differentiate between "incremental snapshots don't exist" and "incremental snapshots haven't been seen yet" #30759

Comments

brooksprumo commented Mar 16, 2023 • edited Loading

Problem

Possible Solutions

Solution 1: Combine full and incremental snapshots together into a new CRDS tag

Solution 2: Push a sentinel "empty" incremental snapshot on new full snapshot generation

mvines commented Mar 16, 2023

brooksprumo commented Mar 16, 2023

behzadnouri commented Mar 16, 2023

brooksprumo commented Mar 16, 2023

Footnotes

behzadnouri commented Mar 17, 2023

mvines commented Mar 17, 2023

brooksprumo commented Mar 30, 2023

behzadnouri commented Mar 31, 2023

brooksprumo commented Mar 16, 2023 •

edited

Loading