Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(gossipsub): peer discovery #1168

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

AlejandroCabeza
Copy link
Collaborator

@AlejandroCabeza AlejandroCabeza commented Aug 7, 2024

Implement Gossipsub's test plan's Peer Discovery block.

Note

One of the tests is commented out due to the reasons explained in the comment.

@AlejandroCabeza AlejandroCabeza self-assigned this Aug 7, 2024
@AlejandroCabeza AlejandroCabeza force-pushed the tests/gossipsub/peer-discovery branch 2 times, most recently from 1852506 to 561ca6d Compare August 14, 2024 15:39
@AlejandroCabeza AlejandroCabeza force-pushed the tests/gossipsub/peer-discovery branch from e2e278e to e53d041 Compare August 27, 2024 16:56
@AlejandroCabeza AlejandroCabeza marked this pull request as ready for review August 27, 2024 16:56
@@ -18,7 +18,7 @@ import
discovery/discoverymngr,
discovery/rendezvousinterface,
]
import ./helpers
import ./helpers, ./asyncunit, ./utils/[async, assertions, futures]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests need to be moved to a new PR with a different scope as they are unrelated to GossipSub.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I follow. Why are you linking the comment to the imports?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean the tests in this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say they are related to Gossipsub given they are specified in Gossipsub Testplan's Peer Discovery block, as linked in the description.

Do you perhaps mean to move them to a different file?

Copy link
Contributor

@diegomrsantos diegomrsantos Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendezvous and Peer discovery aren't related to GossipSub.

Copy link
Contributor

@diegomrsantos diegomrsantos Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how gossipsub is informed of newly connected peers https://github.com/vacp2p/nim-libp2p/blob/master/libp2p/protocols/pubsub/pubsub.nim#L665. We don't need to use any discovery mechanism to test this, just connecting to peers using the switch is enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever a new peer is connected, the gossipsub implementation checks to see if the peer implements floodsub and/or gossipsub, and if so, it sends it a hello packet that announces the topics that it is currently subscribing to.

I don't know if we are doing this part checks to see if the peer implements floodsub and/or gossipsub.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are two different things, I believe.
What you mention is the callback that lets you run code when a peer joins or leaves.
This is a discovery mechanism that allows you to find peers available for connection.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I believe that having removed the xasyncTest, this PR's scope is just peer discovery.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as we agreed, it's best to keep PRs small and focused on a single scope.

That said, we should avoid prolonging the PR process unnecessarily. In this case, for example, the cost of delaying to debate whether to split the PR outweighs simply merging it, provided the code is sound.

Approaching the discussion more collaboratively might have helped resolve the issue more quickly.

tests/utils/tests.nim Outdated Show resolved Hide resolved
tests/utils/tests.nim Outdated Show resolved Hide resolved
tests/utils/async.nim Outdated Show resolved Hide resolved
@AlejandroCabeza AlejandroCabeza force-pushed the tests/gossipsub/peer-discovery branch from 7fd3a28 to b40eaa8 Compare September 10, 2024 15:38
@@ -277,54 +278,6 @@ suite "GossipSub":

await allFuturesThrowing(nodes[0].switch.stop(), nodes[1].switch.stop())

asyncTest "GossipSub unsub - resub faster than backoff":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this been removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that test is wrong. The correct one is the one I implemented but marked with xasyncTest because currently there's no mechanism to make it work as it should.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the test wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it doesn't take into account the fanout mechanism. The message is sent via fanout, it doesn't care about backoff period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add the test back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's not working?

Copy link
Contributor

@diegomrsantos diegomrsantos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendezvous tests are unrelated to GossipSub, they shound't be in GossipSub test cases. Please create a new PR with a different description.

)
gossip1.routingRecordsHandler.add(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time I see RoutingRecordsHandler. Would you happen to know what is the purpose of this? Is it something mentioned in the spec?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the callback for peer exchange.

Comment on lines +10 to +25
proc toResult*[T](future: Future[T]): Result[T, string] =
if future.cancelled():
return results.err("Future cancelled/timed out.")
elif future.finished():
if not future.failed():
return future.toOk()
else:
return results.err("Future finished but failed.")
else:
return results.err("Future still not finished.")

proc waitForResult*[T](
future: Future[T], timeout = DURATION_TIMEOUT
): Future[Result[T, string]] {.async.} =
discard await future.withTimeout(timeout)
return future.toResult()
Copy link
Contributor

@diegomrsantos diegomrsantos Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this complexity isn't necessary. Result is a substitute for an Exception, not a Future. Our focus should be to make the codebase simpler and avoid unnecessary complexity.

Copy link
Collaborator Author

@AlejandroCabeza AlejandroCabeza Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result is not a substitute for Exception, but for a more general "any failed computation".
In this case, this simplifies and makes more robust a lot of checks in tests. When you run withTimeout you are effectively collapsing the "uncertainty of the future", that is, making it either have a value, or an error.
Casting that "collapsed future" into a result allows for a more visual and easier handling of that situation.

nodes[1].unsubscribe("foobar", handler)

await passed.wait(5.seconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why has this been removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's been replaced by the waitForResult call. Conceptually similar, but handier for the checks below.

Comment on lines +3 to +5
const
DURATION_TIMEOUT* = 1.seconds
DURATION_TIMEOUT_EXTENDED* = 1500.milliseconds
Copy link
Contributor

@diegomrsantos diegomrsantos Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this shouldn't be defined in an utils file. The constant names don't have any meaningful description and we don't have any requirement or consensus about the use of those values in the test suite. If you want to avoid magic numbers and repetition in your tests, I'd suggest to define them for a particular test file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say DURATION_TIMEOUT is not a horrible name. It specifies the type (duration) and the context (timeout). Given that's the default measurement I'm using, I decided to leave the name as is. I believe the EXTENDED suffix can be easily understood from that.

The reason I'm specifying them in a file is precisely because I'm using those numbers widely across different files, so it felt a bit repetitive having to repeat them all over the test codebase.

@@ -201,3 +207,18 @@ proc waitSubGraph*(nodes: seq[PubSub], key: string) {.async.} =

await sleepAsync(5.milliseconds)
doAssert Moment.now() < timeout, "waitSubGraph timeout!"

proc waitForMesh*(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used in a test that was removed by your suggestion. Removing it.

await rdvB.unsubscribe(namespace)
var
query2 = dmA.request(rdvNamespace)
res2 = await query2.getPeer().waitForResult(1.seconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use withTimeoutand check the result is false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I could. That being said, I consider casting the Future to Result a much cleaner approach which also allows for value checking if need be (not in this case, ofc).

dmB.advertise(rdvNamespace)
let
query1 = dmA.request(rdvNamespace)
res1 = await query1.getPeer().waitForResult(1.seconds)
Copy link
Contributor

@diegomrsantos diegomrsantos Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't supposed to hang forever, so you can just use awaitand check the value is returned as expected. I'd suggest using a more descriptive name than res1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using await means there's no timeout, which is something quite handy for this situation. This operation should take way lower than 1 second, which means if the timeout is hit, there's something very wrong with this and the test should quite. If, by any chance, the reason the timeout is hit is the task is hanging, this makes the test exit and not leave the test suite running forever.

Regarding the name, if you have a look at the tests you will see it's kind of standard practice. If you have a suggestion I'm all ears :)

Comment on lines +102 to +103
query1 = dmA.request(rdvNamespace)
res1 = await query1.getPeer().waitForResult(1.seconds)
Copy link
Contributor

@lchenut lchenut Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A query is a tool to re-use. Recreating one every loop is not the way to do.
It should work just by writing

asyncTest "Frequent sub/unsub":
  let query = dmA.request(rdvNamespace)
  for i in 0 ..< 10:
    (...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did that and it doesn't work, it breaks at the 3rd iteration. I don't know what might the correlation be.

Copy link
Contributor

@lchenut lchenut Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then there's a problem in rendezvous or the discovery manager 😕
But I think I already tried to understand this. It rings a bell.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you rather I do with this? Leave it as is and open an issue or would you like to investigate it before merging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to investigate, it's a real problem. I'll do this next week.


res2.assertIsErr()

asyncTest "Frequent sub/unsub with multiple clients":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing in this test, you should create only one query per discovery manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: backlog
Development

Successfully merging this pull request may close these issues.

4 participants