Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-evaluate effectiveness of broadcast retries triggered by replay/propagated-stats #30098

Closed
behzadnouri opened this issue Feb 2, 2023 · 10 comments
Labels
do-not-close Add this tag to exempt an issue/PR from being closed by the stalebot

Comments

@behzadnouri
Copy link
Contributor

behzadnouri commented Feb 2, 2023

Problem

Based on PropagatedStats, replay code may trigger broadcast retries for its leader slots:
https://github.com/solana-labs/solana/blob/ae7803a55/core/src/progress_map.rs#L192
https://github.com/solana-labs/solana/blob/ae7803a55/core/src/replay_stage.rs#L927-L931
https://github.com/solana-labs/solana/blob/ae7803a55/core/src/replay_stage.rs#L1747-L1754

The downside is that this can exacerbate network traffic because of duplicate shreds and the fact that there will be two nodes simultaneously broadcasting shreds from their respective leader slots, which then might even further hurt shreds propagation for the following slots.

The alternative is to just let the cluster skip the slot.
Similar issue with repair: #28637

Proposed Solution

  • Evaluate if these broadcast retries are effective/helpful.
    • e.g. one metric could be that how often a slot which was retried ultimately got rooted; or was it skipped anyways. Retried slots are tracked by: https://github.com/solana-labs/solana/blob/b77e12b3a/core/src/replay_stage.rs#L1068-L1073
    • or, how often retrying a slot causes more instability for the following slots; e.g. in terms of repair metrics for those.
    • or, we can run an a/b testing experiment with these retries disabled and then compare.
  • If there is not enough evidence of retries helping the cluster overall then remove this code.

cc @carllin @jbiseda @bw-solana

@behzadnouri
Copy link
Contributor Author

From discord:
https://discord.com/channels/428295358100013066/439194979856809985/1071195371402964992

I think this broadcast retry was introduced to handle long partitions where you have multiple forks, say 10 forks each with 10% stake. Then per the design, this retry would:

  1. Prevent the individual partitions from adding more blocks before the network connections healed (b/c they would stop making new blocks in favor of retransmitting their old ones)
  2. After the network healed, leaders would rebroadcast which would allow for faster resolution of the partitions than relying only on repair

@carllin
I think we can temporarily pause block production regardless of retrying old shreds or not (i.e that sounds orthogonal). That being said

  • If my node pauses block production during its leader slots, what prevents the next leader from generating ticks and skipping my leader slots anyways? iow wouldn't this require even more coordination between nodes and make things even more unstable if they are not in sync?

  • What is the best way to measure if retrying broadcast is effective in achieving this?

    faster resolution of the partitions than relying only on repair

    One issue is that what if half the cluster is doing one thing and the other half another thing? For example, my node is retrying its old shreds while the current leader is also generating shreds, wouldn't that 2x shreds traffic and make things worse?

Also should note that current turbine is less subject to network partitions in the sense that staked nodes placement on the randomly shuffled broadcast tree is the same across nodes, even if some are not visible in gossip. Also with 32:32 erasure batches theoretically at least there should be more resilience to network partitions.

@carllin
Copy link
Contributor

carllin commented Feb 6, 2023

If my node pauses block production during its leader slots, what prevents the next leader from generating ticks and >skipping my leader slots anyways? iow wouldn't this require even more coordination between nodes and make things even >more unstable if they are not in sync?

Nothing :), this was designed assuming everybody would halt block production during a major partition

What is the best way to measure if retrying broadcast is effective in achieving this?

I think measuring if a retried slot gets confirmed is probably the best way

One issue is that what if half the cluster is doing one thing and the other half another thing? For example, my node is retrying its old shreds while the current leader is also generating shreds, wouldn't that 2x shreds traffic and make things worse?

The idea was that because the leader is skipping its new slot in favor of rebroadcasting the old one, the bandwidth usage shouldn't be too much worse, but I can see how the traffic could be an issue.

Also should note that current turbine is less subject to network partitions in the sense that staked nodes placement on the randomly shuffled broadcast tree is the same across nodes, even if some are not visible in gossip. Also with 32:32 erasure batches theoretically at least there should be more resilience to network partitions.

Yeah this is a fair point, but what would happen during a long running 66-33 partition? I"m guessing each side of the partition would still see their own blocks because repairing from the leader makes up for the holes in turbine?

Then when the partition resolves, the smaller side would repair the blocks made by the heavier side? Should we have an alternative fast/large repair protocol for this? Regular repair might take a while if there are a lot of blocks on the heavy side that didn't stop block production. I think if such a protocol existed we could get rid of the rebroadcast logic.

@behzadnouri
Copy link
Contributor Author

The idea was that because the leader is skipping its new slot in favor of rebroadcasting the old one, the bandwidth usage shouldn't be too much worse

But there are 2 different leaders. no?
I mean while my node is retrying its old shreds, another node is the current leader and they might be broadcasting shreds for their leader slot. So 2x shreds traffic.

what would happen during a long running 66-33 partition? I"m guessing each side of the partition would still see their own blocks because repairing from the leader makes up for the holes in turbine?

I think this is where repair can be improved. For one thing, nodes shouldn't be repairing from the slot leader because that would make it easy for a malicious leader to create partitions in the cluster. Related issue:

Should we have an alternative fast/large repair protocol for this?

Might be a bit tricky security-wise if the slot is not rooted yet. I mean if a node gets all shreds of a certain slot from few nodes then again might make it easier to create partitions.

@bw-solana
Copy link
Contributor

The idea was that because the leader is skipping its new slot in favor of rebroadcasting the old one, the bandwidth usage shouldn't be too much worse

But there are 2 different leaders. no? I mean while my node is retrying its old shreds, another node is the current leader and they might be broadcasting shreds for their leader slot. So 2x shreds traffic.

There are two places where we will signal to retransmit shreds:

  1. Near the end of replay stage
  2. As part of checking if we should start leader

I suspect 1 is where the bulk of the retransmit signal occurrences would be happening. In this case, I believe @behzadnouri is right that we would see shreds for multiple blocks from different leaders transmitting over turbine concurrently. I'm guessing 2 is the case that @carllin is referring to, which wouldn't exacerbate network traffic because this node would be "stealing" its own leader slot

@bw-solana
Copy link
Contributor

Created draft PR #30155 for adding retransmit slot metrics. It's pretty ugly, but I can clean it up if this is somewhat what we're going for

@bw-solana
Copy link
Contributor

bw-solana commented Feb 7, 2023

So far, looking back 24 hours, I'm not seeing any evidence that retransmitting is helping.

Looking at metrics on testnet:

  • No reporting of replay_stage-retransmit metrics by any node
  • replay_stage-retransmit-timing-based entries for node td3n5NGhP7JKWrL638gzau3NY7mF4K3ztZww3GkpywJ for slots 173,756,697 & 173,756,698. There are >5k retry iterations spanning multiple days. Slot 173,756,695 was confirmed back in mid-Jan.

I mean... time to give up my guy

Looking at metrics on MNB:

  • No reporting of replay_stage-retransmit metrics by any node
  • BYHuaEKgW3KsvGLeydrs9cuEkfDjUGmLZxwjC1FWdQtR 176097752-5. >260 retry iterations spanning multiple hours. None of these slots were confirmed.
  • 6NmKxzGAdJ9ttewC8hwJEFaJdn4GScjpqimaL5BnbWgx 176309096-9. 18 retries. None of these slots were confirmed.
  • 4RLcStbSkt5S1Xm4mStup6N13PGyAfAWy7Vs5d7yJRnY 176354880-3. 9 retries. None of these slots were confirmed.
  • 5LGN129TL3TsMU7NkUsAZJvxyNHwFAXhNYy3Z7WUsbr1 176361692-5 36 retries. None of these slots were confirmed.

@carllin
Copy link
Contributor

carllin commented Feb 7, 2023

There are two places where we will signal to retransmit shreds:
Near the end of replay stage

Ah yeah, thanks for clarifying

think this is where repair can be improved. For one thing, nodes shouldn't be repairing from the slot leader because that would make it easy for a malicious leader to create partitions in the cluster. Related issue:

@behzadnouri this makes sense

I think we can temporarily pause block production regardless of retrying old shreds or not (i.e that sounds orthogonal).

Ok, I think we can remove the rebroadcast logic but keep the block production pausing?

I think we can have a separate design discussion for efficient repair after a long partition. Even with block production paused PoH is still running so there will be a lot of ticks, we'll probably need tick compression @bw-solana 😃

@behzadnouri
Copy link
Contributor Author

behzadnouri commented Mar 10, 2023

From the metrics on mainnet right now, effectively every slot being retried only a few milliseconds after the initial broadcast which might indicate there is a bug somewhere.
230310-1505-retransmit-retries
230310-1522-retransmit-timing-based

@behzadnouri
Copy link
Contributor Author

Seems like this .unwrap_or(true) is the bug:
https://github.com/solana-labs/solana/blob/94ef881de/core/src/progress_map.rs#L178
It always returns true the first time around because retry_time is None.
So every slot is broadcasted twice.

#30681 to patch this.

mmcgee-jump added a commit to firedancer-io/solana that referenced this issue Sep 18, 2023
The POH recorder is changed to send directly to Firedancer via. a new
mcache / dcache pair. There is some unpleasantness in this,

 (a) We serialize on the poh recorder thread. This acutally doesn't
     matter so much, since it's similar cost to a memcpy and the
     poh recorder sleeps a lot due to having spare cycles.

 (b) We can backpressure the thread, which would cause it to fall
     behind. There's not really a good solution here. Solana just uses
     an infinite buffer size for this channel.

But then there's major pleasantness as well, mainly that we save some
copies of the data across crossbeam channels.

In the process of removing all the wiring, we also saw and subsequently
removed the `maybe_retransmit_unpropagated_slots` functionality from the
replay stage. This is considered "band-aid" functionality and is
potentially being removed from Solana Labs, so we choose to remove it
rather than plumb this channel through to our shred stage as well. See
solana-labs#30098
mmcgee-jump added a commit to firedancer-io/solana that referenced this issue Sep 18, 2023
The POH recorder is changed to send directly to Firedancer via. a new
mcache / dcache pair. There is some unpleasantness in this,

 (a) We serialize on the poh recorder thread. This acutally doesn't
     matter so much, since it's similar cost to a memcpy and the
     poh recorder sleeps a lot due to having spare cycles.

 (b) We can backpressure the thread, which would cause it to fall
     behind. There's not really a good solution here. Solana just uses
     an infinite buffer size for this channel.

But then there's major pleasantness as well, mainly that we save some
copies of the data across crossbeam channels.

In the process of removing all the wiring, we also saw and subsequently
removed the `maybe_retransmit_unpropagated_slots` functionality from the
replay stage. This is considered "band-aid" functionality and is
potentially being removed from Solana Labs, so we choose to remove it
rather than plumb this channel through to our shred stage as well. See
solana-labs#30098
mmcgee-jump added a commit to firedancer-io/solana that referenced this issue Sep 18, 2023
The POH recorder is changed to send directly to Firedancer via. a new
mcache / dcache pair. There is some unpleasantness in this,

 (a) We serialize on the poh recorder thread. This acutally doesn't
     matter so much, since it's similar cost to a memcpy and the
     poh recorder sleeps a lot due to having spare cycles.

 (b) We can backpressure the thread, which would cause it to fall
     behind. There's not really a good solution here. Solana just uses
     an infinite buffer size for this channel.

But then there's major pleasantness as well, mainly that we save some
copies of the data across crossbeam channels.

In the process of removing all the wiring, we also saw and subsequently
removed the `maybe_retransmit_unpropagated_slots` functionality from the
replay stage. This is considered "band-aid" functionality and is
potentially being removed from Solana Labs, so we choose to remove it
rather than plumb this channel through to our shred stage as well. See
solana-labs#30098
ptaffet-jump pushed a commit to firedancer-io/solana that referenced this issue Sep 19, 2023
The POH recorder is changed to send directly to Firedancer via. a new
mcache / dcache pair. There is some unpleasantness in this,

 (a) We serialize on the poh recorder thread. This acutally doesn't
     matter so much, since it's similar cost to a memcpy and the
     poh recorder sleeps a lot due to having spare cycles.

 (b) We can backpressure the thread, which would cause it to fall
     behind. There's not really a good solution here. Solana just uses
     an infinite buffer size for this channel.

But then there's major pleasantness as well, mainly that we save some
copies of the data across crossbeam channels.

In the process of removing all the wiring, we also saw and subsequently
removed the `maybe_retransmit_unpropagated_slots` functionality from the
replay stage. This is considered "band-aid" functionality and is
potentially being removed from Solana Labs, so we choose to remove it
rather than plumb this channel through to our shred stage as well. See
solana-labs#30098
mmcgee-jump added a commit to firedancer-io/solana that referenced this issue Sep 26, 2023
The POH recorder is changed to send directly to Firedancer via. a new
mcache / dcache pair. There is some unpleasantness in this,

 (a) We serialize on the poh recorder thread. This acutally doesn't
     matter so much, since it's similar cost to a memcpy and the
     poh recorder sleeps a lot due to having spare cycles.

 (b) We can backpressure the thread, which would cause it to fall
     behind. There's not really a good solution here. Solana just uses
     an infinite buffer size for this channel.

But then there's major pleasantness as well, mainly that we save some
copies of the data across crossbeam channels.

In the process of removing all the wiring, we also saw and subsequently
removed the `maybe_retransmit_unpropagated_slots` functionality from the
replay stage. This is considered "band-aid" functionality and is
potentially being removed from Solana Labs, so we choose to remove it
rather than plumb this channel through to our shred stage as well. See
solana-labs#30098
mmcgee-jump added a commit to firedancer-io/solana that referenced this issue Sep 26, 2023
The POH recorder is changed to send directly to Firedancer via. a new
mcache / dcache pair. There is some unpleasantness in this,

 (a) We serialize on the poh recorder thread. This acutally doesn't
     matter so much, since it's similar cost to a memcpy and the
     poh recorder sleeps a lot due to having spare cycles.

 (b) We can backpressure the thread, which would cause it to fall
     behind. There's not really a good solution here. Solana just uses
     an infinite buffer size for this channel.

But then there's major pleasantness as well, mainly that we save some
copies of the data across crossbeam channels.

In the process of removing all the wiring, we also saw and subsequently
removed the `maybe_retransmit_unpropagated_slots` functionality from the
replay stage. This is considered "band-aid" functionality and is
potentially being removed from Solana Labs, so we choose to remove it
rather than plumb this channel through to our shred stage as well. See
solana-labs#30098
mmcgee-jump added a commit to firedancer-io/solana that referenced this issue Oct 3, 2023
The POH recorder is changed to send directly to Firedancer via. a new
mcache / dcache pair. There is some unpleasantness in this,

 (a) We serialize on the poh recorder thread. This acutally doesn't
     matter so much, since it's similar cost to a memcpy and the
     poh recorder sleeps a lot due to having spare cycles.

 (b) We can backpressure the thread, which would cause it to fall
     behind. There's not really a good solution here. Solana just uses
     an infinite buffer size for this channel.

But then there's major pleasantness as well, mainly that we save some
copies of the data across crossbeam channels.

In the process of removing all the wiring, we also saw and subsequently
removed the `maybe_retransmit_unpropagated_slots` functionality from the
replay stage. This is considered "band-aid" functionality and is
potentially being removed from Solana Labs, so we choose to remove it
rather than plumb this channel through to our shred stage as well. See
solana-labs#30098
mmcgee-jump added a commit to firedancer-io/solana that referenced this issue Oct 5, 2023
The POH recorder is changed to send directly to Firedancer via. a new
mcache / dcache pair. There is some unpleasantness in this,

 (a) We serialize on the poh recorder thread. This acutally doesn't
     matter so much, since it's similar cost to a memcpy and the
     poh recorder sleeps a lot due to having spare cycles.

 (b) We can backpressure the thread, which would cause it to fall
     behind. There's not really a good solution here. Solana just uses
     an infinite buffer size for this channel.

But then there's major pleasantness as well, mainly that we save some
copies of the data across crossbeam channels.

In the process of removing all the wiring, we also saw and subsequently
removed the `maybe_retransmit_unpropagated_slots` functionality from the
replay stage. This is considered "band-aid" functionality and is
potentially being removed from Solana Labs, so we choose to remove it
rather than plumb this channel through to our shred stage as well. See
solana-labs#30098
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 11, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 19, 2024
@behzadnouri behzadnouri added do-not-close Add this tag to exempt an issue/PR from being closed by the stalebot and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Mar 20, 2024
@behzadnouri behzadnouri reopened this Mar 20, 2024
Copy link
Contributor

This repository is no longer in use. Please re-open this issue in the agave repo: https://github.com/anza-xyz/agave

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-close Add this tag to exempt an issue/PR from being closed by the stalebot
Projects
None yet
Development

No branches or pull requests

3 participants