Intermittent loss of DNS resolution using DoH

Hello, I am using DoH with the luci app with dnsmasq. Since setting up DoH, I have noticed nearly daily loss of DNS resolution on all clients. It happens sometimes once every 2 days or sometimes several times a day, often for hours at a time. Most of the time a reboot fixes it. Restarting https_dns_proxy or dnsmasq doesn't solve the problem.

My router:
Model: TP-Link Archer A7 v5
FIrmware: OpenWrt SNAPSHOT r11582-b3779e920e / LuCI Master git-19.327.83508-5e1253f (recently updated to see if it fixed problem but it didn't)
Config: DHCP for all LAN clients, IPv6 disabled everywhere possible, adblock enabled, UPnP disabled

https_dns_proxy config (set via luci app, using cloudflare and dns.sb):

https_dns_proxy.@https_dns_proxy[0]=https_dns_proxy
https_dns_proxy.@https_dns_proxy[0].listen_addr='127.0.0.1'
https_dns_proxy.@https_dns_proxy[0].listen_port='5053'
https_dns_proxy.@https_dns_proxy[0].user='nobody'
https_dns_proxy.@https_dns_proxy[0].group='nogroup'
https_dns_proxy.@https_dns_proxy[0].bootstrap_dns='1.1.1.1,1.0.0.1'
https_dns_proxy.@https_dns_proxy[0].url_prefix='https://cloudflare-dns.com/dns-query?ct=application/dns-json&'
https_dns_proxy.@https_dns_proxy[1]=https_dns_proxy
https_dns_proxy.@https_dns_proxy[1].listen_addr='127.0.0.1'
https_dns_proxy.@https_dns_proxy[1].listen_port='5054'
https_dns_proxy.@https_dns_proxy[1].user='nobody'
https_dns_proxy.@https_dns_proxy[1].group='nogroup'
https_dns_proxy.@https_dns_proxy[1].bootstrap_dns='185.222.222.222,185.184.222.222'
https_dns_proxy.@https_dns_proxy[1].url_prefix='https://doh.dns.sb/dns-query?'

/etc/dhcp/config:

config dnsmasq
        option domainneeded '1'
        option localise_queries '1'
        option rebind_protection '1'
        option rebind_localhost '1'
        option local '/lan/'
        option domain 'lan'
        option expandhosts '1'
        option authoritative '1'
        option readethers '1'
        option leasefile '/tmp/dhcp.leases'
        option nonwildcard '1'
        option localservice '1'
        option noresolv '1'
        list doh_backup_server '127.0.0.1#5053'
        list addnhosts '/tmp/adb_list.overall'
        option logqueries '1'
        option logdhcp '1'
        option logfacility '/tmp/dnsmasq.log'
        list server '127.0.0.1#5053'
        list server '127.0.0.1#5054'

When the DNS resolution fails, I can still use nslookup on the router via ssh since I don't have local DoH configured. But nslookup on clients times out: "nslookup openwrt.org
;; connection timed out; no servers could be reached"

However, since I can resolve on the router, I nslookup on the router to find the IP of a site, then I can ping that direct IP from the clients. So the issue is DNS related in my view. Direct IP address in the browser of clients can see pages.

Example output of /tmp/dnsmasq.log when DNS resolution is WORKING:

Nov 30 19:39:14 dnsmasq[2125]: 1505 192.168.1.149/41641 query[A] openwrt.org from 192.168.1.149
Nov 30 19:39:14 dnsmasq[2125]: 1505 192.168.1.149/41641 forwarded openwrt.org to 127.0.0.1
Nov 30 19:39:14 dnsmasq[2125]: 1505 192.168.1.149/41641 forwarded openwrt.org to 127.0.0.1
Nov 30 19:39:14 dnsmasq[2125]: 1506 192.168.1.149/45237 query[AAAA] openwrt.org from 192.168.1.149
Nov 30 19:39:14 dnsmasq[2125]: 1506 192.168.1.149/45237 forwarded openwrt.org to 127.0.0.1
Nov 30 19:39:14 dnsmasq[2125]: 1505 192.168.1.149/41641 reply openwrt.org is 139.59.209.225
Nov 30 19:39:14 dnsmasq[2125]: 1506 192.168.1.149/45237 reply openwrt.org is 2a03:b0c0:3:d0::1af1:1

When it fails, I do not get the "reply" lines. Just the first 3 lines I think, the first query and the two forwards (I'd give example but not having the problem ATM).

Anyone have any ideas?

I think you have hit same problem I had just a while ago.

config dnsmasq
        option domainneeded '1'
        option localise_queries '1'
        option rebind_protection '1'
        option rebind_localhost '1'
        **option local '/lan/'**

Change line marked ** above to

        list local '/lan/'

Background: I have dnscrypt-proxy setup for my outbound DNS. A while ago I was moving internal name resolution to the same system. I did the changes to config on CLI with uci set command by copying lines to change from what uci show command listed. I found out that uci show does show list-items as option lines and uci set command set a list item to option item without any hick. Just later I found recommendation from UCI documentation to manually edit config and leave uci for scripted operations. Before I had the impression uci and manual configuration file edits are mostly equal and a belief that by using uci it may protect me better from config mistakes.

Thank you I will try that and see if the problem returns.

Also, change that to the non-DoH server to be used when DoH proxy is stopped.

Yes I will, I had thought that line looked out of place, seeing that the backup server was a redundant address already listed elsewhere. I can just put '1.1.1.1' there, for example?

1 Like

To update: With the above changes made, I am having fewer issues. But I still have problems with https_dns_proxy. Whenever I tax the router and download near my ISP max speeds, the https_dns_proxy service either crashes or uses 100% memory. However, now I can recover without a reboot by restarting the service. If I know I am going to keep throughput on the router high, I keep it disabled until I'm done and just use normal DNS for the time being otherwise it just crashes/hangs the DoH service again.

Perhaps I can do some profiling or something on the service to see what it's hung up on?

Smells like a bug in https_dns_proxy. Still the question is, what is in your environment for https_dns_proxy, what makes it bug for you?

Do you have SQM setup for your uplink? If not, it may provide relief for https_dns_proxy.

Logs? Also, the https_dns_proxy has been retired for the newer https-dns-proxy which uses RFC8484 instead of JSON API, so it's pretty much all new code, I would try that.