-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-evaluate effectiveness of broadcast retries triggered by replay/propagated-stats #30098
Comments
From discord:
@carllin
Also should note that current turbine is less subject to network partitions in the sense that staked nodes placement on the randomly shuffled broadcast tree is the same across nodes, even if some are not visible in gossip. Also with 32:32 erasure batches theoretically at least there should be more resilience to network partitions. |
Nothing :), this was designed assuming everybody would halt block production during a major partition
I think measuring if a retried slot gets confirmed is probably the best way
The idea was that because the leader is skipping its new slot in favor of rebroadcasting the old one, the bandwidth usage shouldn't be too much worse, but I can see how the traffic could be an issue.
Yeah this is a fair point, but what would happen during a long running 66-33 partition? I"m guessing each side of the partition would still see their own blocks because repairing from the leader makes up for the holes in turbine? Then when the partition resolves, the smaller side would repair the blocks made by the heavier side? Should we have an alternative fast/large repair protocol for this? Regular repair might take a while if there are a lot of blocks on the heavy side that didn't stop block production. I think if such a protocol existed we could get rid of the rebroadcast logic. |
But there are 2 different leaders. no?
I think this is where repair can be improved. For one thing, nodes shouldn't be repairing from the slot leader because that would make it easy for a malicious leader to create partitions in the cluster. Related issue:
Might be a bit tricky security-wise if the slot is not rooted yet. I mean if a node gets all shreds of a certain slot from few nodes then again might make it easier to create partitions. |
There are two places where we will signal to retransmit shreds:
I suspect 1 is where the bulk of the retransmit signal occurrences would be happening. In this case, I believe @behzadnouri is right that we would see shreds for multiple blocks from different leaders transmitting over turbine concurrently. I'm guessing 2 is the case that @carllin is referring to, which wouldn't exacerbate network traffic because this node would be "stealing" its own leader slot |
Created draft PR #30155 for adding retransmit slot metrics. It's pretty ugly, but I can clean it up if this is somewhat what we're going for |
So far, looking back 24 hours, I'm not seeing any evidence that retransmitting is helping. Looking at metrics on testnet:
I mean... time to give up my guy Looking at metrics on MNB:
|
Ah yeah, thanks for clarifying
@behzadnouri this makes sense
Ok, I think we can remove the rebroadcast logic but keep the block production pausing? I think we can have a separate design discussion for efficient repair after a long partition. Even with block production paused PoH is still running so there will be a lot of ticks, we'll probably need tick compression @bw-solana 😃 |
Seems like this #30681 to patch this. |
The POH recorder is changed to send directly to Firedancer via. a new mcache / dcache pair. There is some unpleasantness in this, (a) We serialize on the poh recorder thread. This acutally doesn't matter so much, since it's similar cost to a memcpy and the poh recorder sleeps a lot due to having spare cycles. (b) We can backpressure the thread, which would cause it to fall behind. There's not really a good solution here. Solana just uses an infinite buffer size for this channel. But then there's major pleasantness as well, mainly that we save some copies of the data across crossbeam channels. In the process of removing all the wiring, we also saw and subsequently removed the `maybe_retransmit_unpropagated_slots` functionality from the replay stage. This is considered "band-aid" functionality and is potentially being removed from Solana Labs, so we choose to remove it rather than plumb this channel through to our shred stage as well. See solana-labs#30098
The POH recorder is changed to send directly to Firedancer via. a new mcache / dcache pair. There is some unpleasantness in this, (a) We serialize on the poh recorder thread. This acutally doesn't matter so much, since it's similar cost to a memcpy and the poh recorder sleeps a lot due to having spare cycles. (b) We can backpressure the thread, which would cause it to fall behind. There's not really a good solution here. Solana just uses an infinite buffer size for this channel. But then there's major pleasantness as well, mainly that we save some copies of the data across crossbeam channels. In the process of removing all the wiring, we also saw and subsequently removed the `maybe_retransmit_unpropagated_slots` functionality from the replay stage. This is considered "band-aid" functionality and is potentially being removed from Solana Labs, so we choose to remove it rather than plumb this channel through to our shred stage as well. See solana-labs#30098
The POH recorder is changed to send directly to Firedancer via. a new mcache / dcache pair. There is some unpleasantness in this, (a) We serialize on the poh recorder thread. This acutally doesn't matter so much, since it's similar cost to a memcpy and the poh recorder sleeps a lot due to having spare cycles. (b) We can backpressure the thread, which would cause it to fall behind. There's not really a good solution here. Solana just uses an infinite buffer size for this channel. But then there's major pleasantness as well, mainly that we save some copies of the data across crossbeam channels. In the process of removing all the wiring, we also saw and subsequently removed the `maybe_retransmit_unpropagated_slots` functionality from the replay stage. This is considered "band-aid" functionality and is potentially being removed from Solana Labs, so we choose to remove it rather than plumb this channel through to our shred stage as well. See solana-labs#30098
The POH recorder is changed to send directly to Firedancer via. a new mcache / dcache pair. There is some unpleasantness in this, (a) We serialize on the poh recorder thread. This acutally doesn't matter so much, since it's similar cost to a memcpy and the poh recorder sleeps a lot due to having spare cycles. (b) We can backpressure the thread, which would cause it to fall behind. There's not really a good solution here. Solana just uses an infinite buffer size for this channel. But then there's major pleasantness as well, mainly that we save some copies of the data across crossbeam channels. In the process of removing all the wiring, we also saw and subsequently removed the `maybe_retransmit_unpropagated_slots` functionality from the replay stage. This is considered "band-aid" functionality and is potentially being removed from Solana Labs, so we choose to remove it rather than plumb this channel through to our shred stage as well. See solana-labs#30098
The POH recorder is changed to send directly to Firedancer via. a new mcache / dcache pair. There is some unpleasantness in this, (a) We serialize on the poh recorder thread. This acutally doesn't matter so much, since it's similar cost to a memcpy and the poh recorder sleeps a lot due to having spare cycles. (b) We can backpressure the thread, which would cause it to fall behind. There's not really a good solution here. Solana just uses an infinite buffer size for this channel. But then there's major pleasantness as well, mainly that we save some copies of the data across crossbeam channels. In the process of removing all the wiring, we also saw and subsequently removed the `maybe_retransmit_unpropagated_slots` functionality from the replay stage. This is considered "band-aid" functionality and is potentially being removed from Solana Labs, so we choose to remove it rather than plumb this channel through to our shred stage as well. See solana-labs#30098
The POH recorder is changed to send directly to Firedancer via. a new mcache / dcache pair. There is some unpleasantness in this, (a) We serialize on the poh recorder thread. This acutally doesn't matter so much, since it's similar cost to a memcpy and the poh recorder sleeps a lot due to having spare cycles. (b) We can backpressure the thread, which would cause it to fall behind. There's not really a good solution here. Solana just uses an infinite buffer size for this channel. But then there's major pleasantness as well, mainly that we save some copies of the data across crossbeam channels. In the process of removing all the wiring, we also saw and subsequently removed the `maybe_retransmit_unpropagated_slots` functionality from the replay stage. This is considered "band-aid" functionality and is potentially being removed from Solana Labs, so we choose to remove it rather than plumb this channel through to our shred stage as well. See solana-labs#30098
The POH recorder is changed to send directly to Firedancer via. a new mcache / dcache pair. There is some unpleasantness in this, (a) We serialize on the poh recorder thread. This acutally doesn't matter so much, since it's similar cost to a memcpy and the poh recorder sleeps a lot due to having spare cycles. (b) We can backpressure the thread, which would cause it to fall behind. There's not really a good solution here. Solana just uses an infinite buffer size for this channel. But then there's major pleasantness as well, mainly that we save some copies of the data across crossbeam channels. In the process of removing all the wiring, we also saw and subsequently removed the `maybe_retransmit_unpropagated_slots` functionality from the replay stage. This is considered "band-aid" functionality and is potentially being removed from Solana Labs, so we choose to remove it rather than plumb this channel through to our shred stage as well. See solana-labs#30098
The POH recorder is changed to send directly to Firedancer via. a new mcache / dcache pair. There is some unpleasantness in this, (a) We serialize on the poh recorder thread. This acutally doesn't matter so much, since it's similar cost to a memcpy and the poh recorder sleeps a lot due to having spare cycles. (b) We can backpressure the thread, which would cause it to fall behind. There's not really a good solution here. Solana just uses an infinite buffer size for this channel. But then there's major pleasantness as well, mainly that we save some copies of the data across crossbeam channels. In the process of removing all the wiring, we also saw and subsequently removed the `maybe_retransmit_unpropagated_slots` functionality from the replay stage. This is considered "band-aid" functionality and is potentially being removed from Solana Labs, so we choose to remove it rather than plumb this channel through to our shred stage as well. See solana-labs#30098
This repository is no longer in use. Please re-open this issue in the agave repo: https://github.com/anza-xyz/agave |
Problem
Based on
PropagatedStats
, replay code may trigger broadcast retries for its leader slots:https://github.com/solana-labs/solana/blob/ae7803a55/core/src/progress_map.rs#L192
https://github.com/solana-labs/solana/blob/ae7803a55/core/src/replay_stage.rs#L927-L931
https://github.com/solana-labs/solana/blob/ae7803a55/core/src/replay_stage.rs#L1747-L1754
The downside is that this can exacerbate network traffic because of duplicate shreds and the fact that there will be two nodes simultaneously broadcasting shreds from their respective leader slots, which then might even further hurt shreds propagation for the following slots.
The alternative is to just let the cluster skip the slot.
Similar issue with repair: #28637
Proposed Solution
cc @carllin @jbiseda @bw-solana
The text was updated successfully, but these errors were encountered: