Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP feat: bounded consensus events channel #8251

Conversation

fgimenez
Copy link
Member

Requires #8193

There is an unbounded channel for consensus events sent from the RPC.

The channel to_engine is used by the BeaconConsensusEngineHandle to handle messages from the RPC. The channel is unbounded and therefore susceptible to DoS attacks or excessive resource consumption in times of congestion.

The RPC is only callable from the consensus client which is considered a trusted actor and therefore the risk of DoS attack is low. However, during times of congestion a large number of events could build up. Processing events includes executing blocks and updating the blockchain tree. Thus, it may be possible that events are added to the channel faster than they may be processed and the channel will increase in volume.

This PR replaces the unbounded channel with a bounded one, for now setting a default channel size of 1000, this should be adapted after gathering usage metrics.

@fgimenez fgimenez added A-consensus Related to the consensus engine C-security Issue or pull request related to security. labels May 14, 2024
@fgimenez fgimenez force-pushed the fgimenez/bounded-consensus-events-channel branch from 23581c8 to 3c3bdb2 Compare May 14, 2024 12:27
@emhane emhane added the S-blocked This cannot more forward until something else changes label May 18, 2024
Copy link
Member

@emhane emhane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I would like to see the MeteredSender and MeteredReceiver types already.

not convinced dropping engine messages in congestion is sufficient error handling? can the node recover if it drops any engine message? otherwise this introduces an even bigger vulnerability than it aims to fix.

if we cannot guarantee recovery from dropping all engine message types, then I lean towards using UnbounbdedMeteredSender and UnbounbdedMeteredReceiver here for now, or a re-fetch system if that's possible to implement.

@kirk-baird @AgeManning @mattsse @Rjected wdyt?

@emhane
Copy link
Member

emhane commented May 18, 2024

indeed, dropping engine messages in congestion conflicts with this issue about not dropping engine messages #8203

@kirk-baird
Copy link

Hmm this is a tough one which seems it isn't great either way. To add some further pain to the issue, the EngineAPI specs do not prevent replay attacks which would be a pain here. Note the timestamp needs to be within 1 minute of the current clock so replays are only valid that long.

The authentication scheme is not designed to

  • Prevent attackers with capability to read ('sniff') network traffic from reading the traffic,
  • Prevent attackers with capability to read ('sniff') network traffic from performing replay-attacks of earlier messages.

I don't imagine this is a common case but someone with read access to the EL - CL connection could replay messages. An example of a malicious case would be if there are two FCU updates which change from chain A to chain B. If the attacker spammed the connection with FCU A - FCU B - FCU A - FCU B - ... it would be quite a lot of processing. Now if we have a bounded channel they'd be able to use this replay to cause genuine messages to be dropped which is equally as bad as a resource DoS.

In terms of dropping excess FCUs, if the most recent FCU has a valid canonical head then all of the older FCUs can be safely dropped. However, if a FCU does not contain the canonical head then older FCU are relevant as they could update the canonical head. So there's potential here to minimize the number of FCUs we process by filtering some old ones.

In terms of dropping new payloads, it would not be good to drop any canonical blocks as these would need to be fetched again and processed later. So I believe we'd need to process all of the new payloads unless some logic is implemented to determine which are not canonical.

tl;dr So maybe a tidier solution here is to not bound the channel and drain all of the messages into a queue. In the queue some management can occur

  • detecting and combining duplicate FCUs or new payloads
  • some optimistic ordering of FCUs to process the most recent first (would need to make sure we've processed enough new payloads)
  • dropping any FCUs older that the most recent valid FCU that has been processed.

@fgimenez fgimenez force-pushed the fgimenez/event-listeners-broadcast-channel branch from 5da034a to 77df31b Compare May 20, 2024 17:36
@emhane
Copy link
Member

emhane commented May 21, 2024

how about just upgrading the connection to CL to https @kirk-baird ?

@kirk-baird
Copy link

how about just upgrading the connection to CL to https @kirk-baird ?

Yep HTTPS has replay protection so that would prevent re-using old requests.

@fgimenez fgimenez force-pushed the fgimenez/event-listeners-broadcast-channel branch from 4f7d39d to c400887 Compare May 22, 2024 07:16
@emhane
Copy link
Member

emhane commented May 22, 2024

how about just upgrading the connection to CL to https @kirk-baird ?

Yep HTTPS has replay protection so that would prevent re-using old requests.

would it require a fix on CL side to support this as an extra security feature or is it already supported? @kirk-baird

@Rjected
Copy link
Member

Rjected commented May 22, 2024

The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope, would it need to feed the right messages to the CL s.t. it sends these messages?

@Rjected Rjected closed this May 22, 2024
@Rjected Rjected reopened this May 22, 2024
@Rjected
Copy link
Member

Rjected commented May 22, 2024

sorry clicked the wrong button...

@fgimenez fgimenez deleted the branch paradigmxyz:main May 22, 2024 17:50
@fgimenez fgimenez closed this May 22, 2024
@fgimenez
Copy link
Member Author

ups this was pointing to #8193 's branch and was automatically closed when it was merged and the branch removed, reopening

@fgimenez fgimenez reopened this May 22, 2024
@fgimenez fgimenez changed the base branch from fgimenez/event-listeners-broadcast-channel to main May 22, 2024 18:44
@emhane
Copy link
Member

emhane commented May 22, 2024

The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope would it need to feed the right messages to the CL s.t. it sends these messages?

right, risk is very low. however, with this fix, the risk that node can't recover in network congestion due to dropped engine messages comes into existence @Rjected

@kirk-baird
Copy link

kirk-baird commented May 23, 2024

The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope, would it need to feed the right messages to the CL s.t. it sends these messages?

Good question, so the issue here is that the valid jwt is replayable. That means if an attacker reads a message with a JWT they can replay that message as many times as they desire until it expires ~1 minute. An attacker isn't able to create new messages since they can't forge a jwt only re-use existing tokens. Note that each JWT is tied to a single message via HMAC so an attacker can't take a valid one and use it on a different message.

So the likelihood of this issue is determined based on how easy it is to read messages and send them again. Now by default these connections are not encrypted and use http or websockets. If the CL + EL are on the same machine then reading the connection means you've access to the machine and so you can do much worse attacks, we're not worried about this case.

The case to think about is when the EL+CL are on different machines and communicate over an untrusted network e.g. regular HTTP can be eavesdropped. Upgrading this connection to HTTPS when they're not on the same machine would resolve this issue as it prevents replay and also encrypts the messages.

@gnattishness
Copy link

Setting the channel to bounded, without doing much else, could cause similar trouble with the message sitting in a waiting tokio task instead of the channel.
Instead, you'd likely want to think about a solution involving try_send or send_timeout

In terms of correctness with the CL, the execution API spec already describes a notion of a timeout and retry.
When under heavy load with an unbounded channel, Reth's eventual response could be long past the timeout and ignored.
The channel could also contain duplicate (retried) requests.

@emhane
Copy link
Member

emhane commented May 25, 2024

thanks @gnattishness

engine_newPayloadV3 and engine_forkchoiceUpdatedV3 have 8 second timeouts.

how about keeping the unbounded channel in that case, but polling it not when EL is ready to process the engine message, instead at short intervals. we do this and queue the engine messages in a type that evicts entries on-op based on the timeout, when inserting new entries. much like this type https://github.com/sigp/discv5/blob/8eac0fee8da65c4021c6bba7c976d395151309ed/src/lru_time_cache.rs#L7-L18.

I think that type can replace the VecDeque mentioned in the conflicting issue #8203.

@emhane
Copy link
Member

emhane commented May 30, 2024

closing in favour of addressing #8203 wrt to spec'd timeouts for messages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-consensus Related to the consensus engine C-security Issue or pull request related to security. S-blocked This cannot more forward until something else changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants