Dnsmasq: Maximum concurrent DNS queries limit

Continuing the discussion from OpenWrt 23.05.0-rc1 first release candidate:

I wondered what was going on here, so read the man page entry, which didn't enlighten me much and I'm guessing you're not a running a large "web-server log file resolver".

So, I asked about this error over on the dnsmasq mail list, and got this response from Buck Horn:

In my never ending quest for root causes :thinking:, is it possible I could convince you two, @erayrafet and @Neverends4, to try some experiments? (Anyone else who has seen this is highly encouraged to jump in!)

First, is this easily reproducible? If not, maybe we shouldn't waste time on it, but if it is...

Have you noted any particular circumstances when this happens? For example, every morning I load like 100+ tabs in my browser to news sites, lkml and so on. Are you doing something similar that you can pinpoint, or does it just happen spontaneously?

Tests

I think the easiest one to try is swapping upstream DNS providers when the problem occurs. If you're using 8.8.8.8, then try 1.1.1.1 or 9.9.9.9, see if it goes away or persists, then swap back to verify that the problem returns.

DNS loop detection is harder, I'm still looking into it. You need to know which URL is giving the DNS resolver grief, then do a route trace with something like dig +trace @up.stream bad.site.url. Finding the URL to try is probably going to be hard, could be an embedded href in some page ten layers deep, so no fun at all.

1 Like

Nope. Sometimes it's rare, like once a week.

I am guessing it happens due to chat apps loading content asynchronously, especially Discord which I use a lot. I can't be certain though.

I have AdGuard DoH set up with dns-http-proxy.

Thanks I think your post fixed my year old surfshark wireguard connectivity issue (still under test but looks promising)....a few more days and I will know.

Background:

I configured wireguard using both Mullvad and Surfshark last year. The Mullvad router always stayed connected. The SS router would lose connectivity after 2 days. A year later I repeated the tests. Same results.

Looked at the system log on the SS router after it lost connectivity and it had around 20 or so DNS queries messages at the end of the log. Not that many but...

Following @erayrafet suggestion I increased the concurrent DNS queries to 500 and the router regained connectivity. I had tested SS around 3-4 times by this time on 2 different routers and v21/v22 over the course of a year and it always failed around the 2 day mark so it looks like this might have fixed it.

What I think happened:

I always run DNSSEC (dnsmasq-full).

SS has 2 listed DNS ip's. The test routers were very lightly used - just me surfing youtube etc.. I guess SS DNS servers were generating more failed look ups than Mullvad? Maybe they couldn't resolve the NTP requests or some of the owrt housekeeping? I have no idea.

I figured it was just something I was doing since if there was a problem then we would see more posts in the forums on it - there weren't. I guess most users don't load dnsmasq-full? Or they were running the auto renew script another user created (I don't use that). But since Mullvad ran I just went with that since SS didn't even offer official support until 2022.

Anyhow thanks again! (especially if it continues working!)

Currently running wireguard on EA8300, v22.03.3
250-275Mbps

SS wireguard would lose connectivity after 48 hours without fail (at the default 150 max DNS queries). Rebooting the router did not restore connectivity.

Well, everyone has different use scenario so I just explain mine:

  1. In my case the "Maximum number of concurrent DNS queries reached" issue is rare, I encountered 2 times only in one year. Both under 22.03.x on my Xiaomi Redmi AC2100-MT7621+128MB memory router. Maybe there are more but I only noticed 2 times.

  2. Every time this issue occurs, all my network went down, no dhcp/dns, I need to manually enter IP on my window PC to access the router. Then I found such entry in syslog. It went back to normal just after a reboot, but no config change is needed.

  3. I have wireguard and softeather VPN server running on the router, https-dns-proxy as upstream DNS server for dnsmasq. I also run P2P apps in docker on one of my linux boxes, so my concurrent connections burst out the openwrt limit (16384 by default) from time to time.

Got RC2 from the Firmware Selector.

 Bad cell count 

Keeps popping up every time I reboot.

Fri Jun 30 09:41:12 2023 daemon.err uhttpd[1640]: [info] luci: accepted login on / for root from 192.168.1.100

Considers a log in action as an error. Strange.

Can't upload a .txt log file here.

So far, both of you run https-dns-proxy upstream of dnsmasq. And https-dns-proxy starts much later than dnsmasq during boot. And https-dns-proxy offers the option to force DNS via firewall rule.

Interesting, but inconclusive.

2 Likes

I ran into this maximum number of concurrent queries limit a couple times earlier this year with my NanoPi R2S using dnsmasq with Quad9 as my upstream. No DoH or DNSSEC. Probably 22.03.3 as the firmware.

I've got a few iOS and Macs and kind of figured they were making a lot of queries at the time.

I doubt https-dns-proxy is the culprit, because no dhcp/dns indicated dnsmasq down. Now I recall I also saw sth. like "dnsmasq crashed xxx loops" log, so maybe it is a bug in dnsmasq with version used in 22.03.x?

An FYI. It seems you're not the only one who thought it was too small...

From git.openwrt.org

dnsmasq: set an increased cachesize default value
author	Hannu Nyman <hannu.nyman@iki.fi>	
Sun, 27 Nov 2022 13:27:06 -0700 (22:27 +0200)
committer	Christian Marangi <ansuelsmth@gmail.com>	
Sat, 21 Jan 2023 03:13:44 -0700 (11:13 +0100)
commit	a57796b137494fc20e984d0049e8e7430e9ebb25
tree	fe84b5640330d976e889a2217c0273c9c64d12ed	tree | snapshot
parent	f183ce35b8ea2fd991ac489fb223b09a1ecb4db0	commit | diff
dnsmasq: set an increased cachesize default value

Dnsmasq DNS cache size is only 150 by default.
Set the uci default value to 1000, so that cache gets used more
and unnecessary DNS queries to upstream can be avoided.

I'm confused, since it calls it the cache, not the concurrent connections? As I see it in Luci, concurrent is 150 default, then there's whats called the cache size, forget the default but has a max of 10000? Confused with the naming here...

I see this change in the main and predating the release of 23.05.0-rc1 and -rc2, but not in the 22.03.4 history. I haven't tried the 23.5 candidates yet, is 1000 the new default?

Edit: Did some reading, there are both a conurrent connection and a cache size number, I probably forgot they both default to 150? I use a miniPC as my router... lots of memory, so typically jack up the cache size. Sorry for confusing the issue.

Anyone able to shed light on what are good values fot cache and concurrent connections? Pros and cons for larger or smaller numbers?

AFAIK those are different things -- concurrent queries vs cache size in the commit you've linked.

Yep... that dawned on me when I looked at the docs and found them both 150. I'd spent so much time with the cache bigger I forgot and thought the commit was referencing the concurrent connections.

What do you think about "proper" sizes for both of these, in case my edit overlapped your comment?