Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async cannot scale #39

Open
panagiotious opened this issue Feb 13, 2016 · 5 comments
Open

Async cannot scale #39

panagiotious opened this issue Feb 13, 2016 · 5 comments

Comments

@panagiotious
Copy link

Hello,

After spending a several hours debugging, rewriting and implementing a few ideas, I think I can safely conclude that the asynchronous functionality is not scaling. When items are added to the Context object as calls for address, the timeout counter starts; that's devastating! When resolving a handful of domains and not care about precision, it is not an issue, but when the resolution requests are of the order of millions, it is impossible to scale.

In the trivial case, resolving 10,000 domain names, ends up with a few hundreds timing out. The more QNAMEs added, the highest the timeout events. Here is a small example:

In [44]: ctx = getdns.Context()

In [45]: ctx.resolution_type = getdns.RESOLUTION_STUB

In [46]: ctx.upstream_recursive_servers = [{'address_data': '8.8.8.8', 'address_type': 'IPv4'}]

In [47]: ctx.suffix = []

In [48]: ctx.timeout = 10000

In [49]: ctx.address(name='www.google.com', extensions={}, callback='cbk', userarg='www.google.com')
Out[49]: -2897608153477377669

In [50]: time.sleep(10)

In [51]: ctx.run()
Query timed out for www.google.com
@MelindaShore
Copy link
Contributor

That's an issue with the underlying getdns library, which can handle a larger number of asynchronous queries on a single context but which will also start getting timeouts on a very large number. I'll discuss it with the team on Monday.

@panagiotious
Copy link
Author

I am concerned that the problem does not have to do with the number of domains, rather the wait time between initializing the requests table and submitting them. Most async libraries I have worked with, usually submit the requests as they come (in a first come first served fashion), like for example twisted or pyuv, etc. In the case of getdns and the python bindings, the library is expecting a list (?) to be initialized first.

Thank you for your attention! You have been very helpful and I really appreciate your commitment on the project!

@MelindaShore
Copy link
Contributor

I actually do hand off the requests to libgetdns as soon as they come in and they are dispatched immediately. If you jump into wireshark or some such and run the Python interpreter interactively, you can watch the queries go out immediately - there's no waiting around. That said, there clearly are some scaling issues.

I should add that one thing that's been on the back burner but should probably be moved up is exposing a file descriptor on a Context() that can be polled by external async libraries like Twisted, etc. But I am relatively certain that it won't improve the number of queries you can spin off without getting timeouts.

@seb-at-nzrs
Copy link

I'd like to second on this issue, and also point in your comment above getdns sends all the queries it has in one go, despite the limit_outstanding_queries parameter. I've been playing with different values and capturing the queries, and always sends the full list in one go.

In the same way as @panagiotious I'm trying to resolver millions of names against a local resolver.

@wtoorop
Copy link
Contributor

wtoorop commented Jan 31, 2017

Hi Sebastian and panagiotious, this issue is indeed related to the underlying C - library. I've imported this issue there. I don't think you'll be subscribed automatically, you probably need to leave a comment there first. Here is the new issue: getdnsapi/getdns#257
Sorry for not dealing with this earlier. It slipped my attention because I focus on the C-library issues only before. I'll start keeping an eye on these issues from now on too.

For completeness I'll include my response to the issue here too:

Indeed, the limit_outstanding_queries parameter affects full recursion only. We simply forgot/missed implementation in stub resolution mode. This needs to be addressed quickly (before the 1.1 release).

Also, note that using an external eventloop is strongly advised when using getdns with many simultaneous queries. The default eventloop is based on select and can handle only a limited amount of simultaneous queries. This is documented in the README.md (of the C-library) as known issues b.t.w.

Neilcook has a pull request currently that replaces select with poll in the default eventloop extension. I intent to polish it up a little bit before merging (use of custom memory functions, turn it into another eventloop extension for platforms that don't have poll), but it might be worthwhile to try it out already if you want to schedule many simultaneous queries without using an external event library right now.

@panagiotious Indeed, since UDP has no buffer for outgoing messages it is conventional (in other asynchronous libraries) to write a message out immediately. This does not work for TCP and TLS which need handshakes etc before the socket can be written to. I don't think it matters much to be honest, but I'm willing to have a look if writing to UDP sockets immediately can be implemented without too much difficulty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants