-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adguard DNS name resolution times out with OpenWrt/stubby #150
Comments
Interesting! Quick question - is there anything save for stubby in this configuration? dnsmasq or something like that? |
Thanks for the quick response! Yes, dnsmasq (version 2.80-16.3) is used. Here's that configuration as in /etc/config/dhcp:
|
As I understand, there's a way to enable stubby logs if you modify the service file to run stubby with |
Sure, good idea. Here's the output:
Everything seemed fine at first, and I was running nslookup throughout, and results returned fine. After I started seeing the I hope that helps! Any other ideas/thoughts? |
As a data point, I was running stubby with debug output for a good 15-20 minutes with only my computer connected to my router making occasional DNS queries, and it continued working fine during that time. So I stopped stubby, started it again with debug output enabled, and then connected my phone and my Roku to my router to increase the number of connected devices and DNS queries, and very soon thereafter, within a minute or so, those failures started appearing. So it appears that the number and/or frequency of DNS queries could be contributing to this issue, and potentially the number of active connected devices on the router. But again, with Cloudflare DNS as the configured provider and all other settings the same, it continues working, even under high DNS load with many connected devices. I've been running stubby with debug output, using Cloudflare DNS, for the past 30 minutes or so with lots of DNS activity across multiple devices, I didn't see a single
|
My idea is that stubby does not handle remote end closing the connection properly. On our side, we're using an Please try setting |
@saradickinson Sara, could you please take a look? What could be the reason for this stubby behavior? |
@ameshkov So the logs are showing that all the connections are shutdown by the far end, but this is not causing stubby to backoff from using the upstream, stubby is continuing to connect and send requests but it would be better to lower the What I do see is that there are a bunch of timeouts on one connection (e.g. the one to 94.140.14.14 closed at 22:26:06.343125 and the connections problems start after that and all are closed after ~5 seconds. I haven't seen this behaviour before but the fact that restarting stubby fixes it does seem to indicate a problem in the failure handling.... I'll try to dig into it! |
Thanks for the additional investigation and suggestions. I changed the I then changed the
So, with light to moderate DNS traffic, the failures started occurring in about 12 minutes even at an |
As another update, I've been running stubby connected to AdGuard DNS with debug output enabled and an
After killing the stubby process and restarting it, it started working normally again, with the This specifically looks more like a stubby issue than an Adguard DNS issue, so maybe @saradickinson would be best to address this one. But again, I don't experience any of these issues when using Cloudflare DNS with all the same settings otherwise, so there appears to be some kind of incompatibility between stubby and Adguard DNS, which started at some point in the past week (and no recent OpenWrt or stubby updates/changes). I'd be happy to try anything else suggested to troubleshoot further. |
@Ryan-Goldstein could you please try recording a tcpdump for connections to |
Sure, here you go: https://pastebin.com/MzCtb72E Requests started timing out maybe a minute before the end of that recording. Please let me know if you'd like me to do the same with -v or -vv flags. |
For fun (haha), I ran it with -vv. The beginning of it was pushed out of the backlog, but I believe I still captured the time before name resolution started timing out. Unfortunately, it's really hard to pinpoint exactly when it stops working. I think it might have been at 16:11:12.947884, but it's hard to say for sure. Here's that verbose capture: https://pastebin.com/cAURkYwG |
Uh, sorry for me being not specific enough. I wanted a tcpdump with data so that I could analyze it in Wireshark, see what TLS packets are being sent: |
Sure, I just did that. During this capture, instead of showing the Here's another one where I started tcpdump, then ran stubby in debug mode, and it resolved a few names, then the This is a more typical one, where it was working fine for a couple minutes, then I connected my phone and Roku to increase the number of DNS queries, and I started seeing stubby Hope that helps! |
In case you want to do a comparison with Cloudflare DNS, which worked fine, I ran the same tcpdump command, except using 'host 1.1.1.1'. I started tcpdump, then started stubby in debug mode. I then followed the same process that I did to cause the error condition during the captures above: opening Google News app on my phone and clicking through a bunch of articles, most of which cause a high number of DNS queries (and many ad domains that are blocked by Adguard DNS). Stubby with Cloudflare DNS continued working fine during this time. Here's that capture: cap4-cloudflare.zip |
@Ryan-Goldstein the patched version is live, could you please try again? |
Okay, to be precise, it's partly live. Should be rolled out to every server in about 15 minutes:) |
Unfortunately, the issue persists. I've done another capture: cap5.zip EDIT: I didn't see your second message until after I started recording, so it might not have been live when I did the capture above. I'll wait a bit and try again. |
Well, from the pcap it seems that you were using the new version. First two connections were closed gracefully. Other connections were closed forcibly by Stubby, it's unclear what was the reason. Tbh, I am out of ideas now so we'll have to wait. As a temporary solution, you could try use dnsproxy instead of Stubby until the issue is fixed. |
Thanks for the investigation and suggestion. I'll check out dnsproxy for now. In the meantime, in case it helps, I wrote this before I saw your most recent comment: I tried again 25 minutes after your post saying it should be rolled out everywhere in 15 minutes, and I'm still encountering this issue. I first started tcpdump to listen for connections to/from both AG DNS servers as follows:
Then I started stubby (and it immediately indicated a
And the tcpdump capture is here: cap6.zip |
@ameshkov Unfortunately, there doesn't appear to be a straightforward way to get dnsproxy installed on my OpenWrt router (which has less than 1 MB free space). I may try checking out Unbound or otherwise in place of stubby, but I'd prefer stubby, as the other options appear to have more overhead. @saradickinson Please let me know if you want me to do any debugging from the stubby side beyond what I've already done. |
The only thing I can add is that in these dumps every connection is closed from the Stubby side, not from AdGuard DNS. |
@ameshkov I have some packet captures taken just now that seem to show some malformed packets being sent from 94.140.14.14 which seem to trigger the failure mode in stubby. I'm still investigating, but I'll email you the captures instead of posting here.... |
Sure, thank you! |
Update: Looks like the issue is caused by malformed packets coming from the server (intermingled lengths and packets). Leaving this issue open to address improving the stubby failure handling for this scenario because it should handle it... |
Thanks for Sara's help I've been able to identify the issue. What's especially stupid about this issue is that I've fixed it in the upstream of the library we're using, but simply forgot to migrate that fix to our fork. @Ryan-Goldstein we're planning to roll out the fix a little later today. |
Great news! Please let me know when the fix is rolled out, and I'd be happy to test/confirm that it resolves the issue for me. |
@Ryan-Goldstein should be okay now. |
It does appear to be fixed! Try as I might, I haven't been able to cause name resolution failures, and I haven't seen any I'll leave it up to you when/whether to close this issue. I know @saradickinson still wants to improve stubby's error handling, not sure if this issue remaining open will help keep that on the radar, or maybe a new issue should be opened there if one hasn't been already. Thanks again to both of you for your quick investigation and for ultimately resolving this issue! |
@ameshkov I do have one quick question: What's the officially recommended |
I guess this depends on stubby's own logic. AdGuard DNS's own idle timeout is 8 seconds so I guess it should be somewhere close to that.
I guess it's not that important to keep this issue open so I'd prefer to close it. @saradickinson if you need my assistance, please let me know, I guess I can set up a copy of AG DNS with that bug if you need it. |
@ameshkov @Ryan-Goldstein In terms of the I've also opened an issue in the getdns repo to further investigate the problem getdnsapi/getdns#501 |
May I ask, do I understand this correctly that this failure in Stubby's resolving has to do with the handling of the 'edns_client_subnet_private option=1' being 'on' in the stubby configuration? Google's authoritative servers were serving records for China instead of the records for AdGuard DNS servers location. If stubby fails to respond correctly - by closing the connection - this would not have been a problem? |
@hanvinke no, this is completely unrelated. This was a bug on AdGuard DNS side, it could randomly send a malformed DNS packet in response, and this triggered a bug in stubby. |
Thanks for the info, I was afraid of that. It is just that I myself have issues with edns and dnsmasq. |
I've been using AdGuard DNS for many months by connecting to it via DoT through stubby in my OpenWrt-flashed router, running the latest version of both OpenWrt (19.07.7) and stubby (0.3.0-1).
It had been working well up until about a week ago, when I started seeing resolution timeouts, mostly on Google-related domains, and I understand that that has since been fixed per this blog post.
In the past few days, possibly corresponding to the changes made to fix the aforementioned timeout issue, I have started experiencing a different issue, where after 5-15 minutes, name resolution stops working on all clients and my router itself. Running nslookup on any domain when my router enters this state results in:
There's nothing indicated in the OpenWrt system logs at the debug log level, but name resolution simply stops working, with name resolution attempts timing out. Restarting stubby with
service stubby restart
resolves the issue again for approximately 5-15 minutes longer, at which time the same problem occurs.Keeping all stubby settings the same, but switching from Adguard DNS to Cloudflare DNS, that issue no longer occurs, so it seems attributable specifically to Adguard DNS.
Here is my stubby configuration with Adguard DNS being used:
When switching to Cloudflare DNS by removing the commented lines, and commenting out the Adguard DNS lines, then restarting stubby with
service stubby restart
, name resolution works and remains working without having to restart the stubby service.I'd be happy to provide any other requested information and/or follow any suggested troubleshooting steps. Thanks in advance!
The text was updated successfully, but these errors were encountered: