webrtc: test pion fixes for state change callbacks ordering #2732
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes: #2614
This is a problem with pion/webrtc's OnConnectionStateChange handler. It should wait for the handler to finish. Right now connection notifications might get reordered leading to listener thinking the connection has established and the dialer thinks connection hasn't established yet.
The fix in pion/webrtc is here:
https://github.com/pion/webrtc/pull/2702/files
The same error in pion/ice
https://github.com/pion/ice/pull/656/files
Update: the text below helped me explains how I went about debugging this
I think this fixes the problemThis might help debug the problem of tests taking 10m. My theory is that the ICE connection succeeds on dialer and times out on listener(20s). So the listener removes the ufrag from its udp mux. After keepalive time, the dialer sends a STUN Request and initiates a new connection on the listener. This of course will fail because the dialer has exited the ice connection establishment mode and is simply sending STUN packets to keep the connection alive.If we look at the failure here: https://github.com/libp2p/go-libp2p/actions/runs/8244362080/job/22546464639
listener is stuck on
2024-03-12T06:24:47.2310488Z /home/runner/work/go-libp2p/go-libp2p/p2p/transport/webrtc/listener.go:234 +0x886
This context expires after 20 seconds.
And all the goroutines on the dialer side are stuck for 9 minutes. And all the goroutines on the listener side are stuck for < minute(which is why that's not shown in the logs)
The logs also show that the dialer believes that ice connection on its side has completed.
Unfortunately this is very difficult to test. This has not flaked in the 16 tries I have done after that.
We still have to explain why the dialer doesn't exit on DTLS failure. DTLS failure should reflect in the OnConnectionStateChange callback.