Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should a NetworkBehavior be informed about a dropped ToSwarm::NotifyHandler event? #5718

Open
elenaf9 opened this issue Dec 5, 2024 · 9 comments

Comments

@elenaf9
Copy link
Contributor

elenaf9 commented Dec 5, 2024

Description

Behaviors can send events to a connection handler by emitting a ToSwarm::NotifyHandler event.
Depending on the PendingNotifyHandler::{One(_), Any(_)} parameter, the swarm will then select either one specific, or any connection handler, that will receive the event.

If the connection has closed in the mean time (either the one specific that is targeted, or all connections), the event gets silently discarded.

@jxs and I had a brief out-of-band discussion that we might want to inform the behavior about the event being dropped.

However, thinking about it now, I am not sure if the behavior even needs that info.
As already mentioned, the event is only dropped if the connection closed. The behavior is being informed about connections closing. So behaviors like request-response can simply track pending events for each connection, and based on that infer that a pending event/ request is discarded.
That must be done anyway because it could be that the handler already consumed the event and then the connection closes.

@jxs is there a specific case where it's still problematic that the NotifyHandler event is silently dropped? And if so, how does that differ from the second scenario that I described, where the connection closes after the handler consumed the event?

@drHuangMHT
Copy link
Contributor

drHuangMHT commented Dec 9, 2024

The behaviour can check which peers it has connection to beforehand to have a good guess on whether the event will be dropped, though this may not be reliable enough for automated systems.
The question is down to how important the event is. If it is important, we should notify the behaviour, otherwise we could get away with it.
The solution can be another optional callback on NetworkBehaviour trait. It can be another variant of FromSwarm event but I think that interface has already become a bit cluttered.

@jxs
Copy link
Member

jxs commented Dec 9, 2024

Thanks for opening and elaborating this @elenaf9.

However, thinking about it now, I am not sure if the behavior even needs that info.
As already mentioned, the event is only dropped if the connection closed. The behavior is being informed about connections closing.

Yeah this is true, but to be precise if I understand correctly connections in this case are ConnectionIds (see 1) which translate to ConnectionHandler instances ( or streams) to another PeerId. A connection to a PeerId may be made of multiple ConnectionIds.

This to say that it may be bug prone for Behaviors to keep track of all the ConnectionIds to a given peer. And when there's a disconnect event check if there are remaining ConnectionIds left to the given PeerId.
My suggestion is that along that, we offer an event like FromSwarm::NotifyHandlerError when NotifyHandler::One(peer_id) peer_id is no longer connected and when NotifyHandler::Any cannot find any ConnectionId so that users are informed when it happens.
Wdyt? Cc @dariusc93 @guillaumemichel

Currently we don't even log when this event happens (see 1 and 2) so I suggest we at least error log when these events happen.

@elenaf9
Copy link
Contributor Author

elenaf9 commented Dec 9, 2024

Yeah this is true, but to be precise if I understand correctly connections in this case are ConnectionIds (see 1) which translate to ConnectionHandler instances ( or streams) to another PeerId. A connection to a PeerId may be made of multiple ConnectionIds.

Afaik there is exactly one ConnectionId and one ConnectionHandler per connection. But there may be multiple connections to a peer, and per connection there may be multiple streams (stream != connection handler instance). Was that what you mean?

My suggestion is that along that, we offer an event like FromSwarm::NotifyHandlerError when NotifyHandler::One(peer_id) peer_id is no longer connected and when NotifyHandler::Any cannot find any ConnectionId so that users are informed when it happens.

But what if sending the NotifyHandler succeeded, but then the connection closes immediately afterwards for whatever reason? Then the NetworkBehavior still won't be informed, right?

Currently we don't even log when this event happens (see 1 and 2) so I suggest we at least error log when these events happen.

Adding a log message sounds good to me.

@drHuangMHT
Copy link
Contributor

But what if sending the NotifyHandler succeeded, but then the connection closes immediately afterwards for whatever reason? Then the NetworkBehavior still won't be informed, right?

In this case I think we'd better implement the ACK ourselves along the ConnectionHandler::FromSwarm, be it a oneshot channel or else.

@jxs
Copy link
Member

jxs commented Dec 10, 2024

Afaik there is exactly one ConnectionId and one ConnectionHandler per connection. But there may be multiple connections to a >peer, and per connection there may be multiple streams (stream != connection handler instance). Was that what you mean?

Thanks for the correction Elena, yup you are right, a ConnectionId and it's ConnectionHandler match a connection, but there may be multiple connections to a peer (different transports/ports), and per connection there may be multiple streams, but a stream doesn't have parity to ConnectionIds nor ConnectinHandlers.

In this case I think we'd better implement the ACK ourselves along the ConnectionHandler::FromSwarm, be it a oneshot channel or else.

yeah exactly, as we already to for gossipsub after #5595

@guillaumemichel
Copy link
Contributor

My suggestion is that along that, we offer an event like FromSwarm::NotifyHandlerError when NotifyHandler::One(peer_id) peer_id is no longer connected and when NotifyHandler::Any cannot find any ConnectionId so that users are informed when it happens.

I am not too familiar with this part of the codebase, but what @jxs suggests sounds good

@jxs
Copy link
Member

jxs commented Dec 10, 2024

Had an off band conversation with @elenaf9 about this issue where Elena helped me recall why I think this should be addressed with more than just logging.

There is currently a potential data race when a NetworkBehaviour dispatches an event to the ConnectionHandler where the connection has already been closed but as we poll the NetworkBehaviour prior to polling the connection pool so NetworkBehaviour won't know that the peer has already disconnected before returning ToSwarm::NotifyHandler and this will be silently discarded without the end user being able to act upon it.
the aforementioned situation happens here:

rust-libp2p/swarm/src/lib.rs

Lines 1187 to 1197 in 276ce84

// This loop polls the components below in a prioritized order.
//
// 1. [`NetworkBehaviour`]
// 2. Connection [`Pool`]
// 3. [`ListenersStream`]
//
// (1) is polled before (2) to prioritize local work over work coming from a remote.
//
// (2) is polled before (3) to prioritize existing connections
// over upgrading new incoming connections.
loop {

Another hypothesis to try to address this situation is to try to prioritize disconnection events to always take precedence over any other Swarm event, so that when a NetworkBehaviour is polled it already has received the FromSwarm::ConnectionClosed event

@AgeManning
Copy link
Contributor

I recall we did run into this problem and we made some work around, but I have now since forgotten the problem.

Lets say we do implement an NotifyHandlerError event. Do we have an exact use-case for this right now? Like is there a behaviour that would non-trivially handle this event?

@elenaf9
Copy link
Contributor Author

elenaf9 commented Dec 11, 2024

Lets say we do implement an NotifyHandlerError event. Do we have an exact use-case for this right now? Like is there a behaviour that would non-trivially handle this event?

Yes that was also what I was wondering about. Specifically: would any network behavior handle this event differently that a general connection-close event (when there are actions/ requests pending on that connection)?

An action/ request can fail on many steps if the associated connection closes:

  1. when sending the Handler Event to the connection handler
  2. after sending (channel to the handler as default capacity of 32), but before the handler gets to process it
  3. after the handler processed it, but before a related action (which in most cases involves some interaction with the remote peer) succeeded
  4. ... etc

With the NotifyHandlerError, case 1. is treated special and solved. All other cases still need to be catched by the network behavior by tracking pending actions/ requests and handling a FromSwarm::ConnectionClosed.

However, one use-case that came to my mind when talking yesterday with @jxs is that we could return in the NotifyHandlerError the original event. This would allow e.g. request-response to then pick another connection for the request. But only if there are any other remaining connections in the first place, and I am not sure how common that is in practice.

I wonder if we instead can avoid the race condition mentioned by @jxs as much as possible in the first place (it's not 100% avoidable), then maybe we don't need this special treatment.

That said, I am not completely opposed to it. Just would like to avoid growing the API of NetworkBehavior even more. But also mostly have request-response in my head right now, so maybe in other behaviors (e.g. gossipsub) this additional callback does bring greater benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants