Dnsmasq occasionally caches the wrong reply? Getting empty replies

Hello. I'm using OpenWrt (22.03.2 r19803-9a599fee93 / LuCI openwrt-22.03 branch git-22.304.65171-ec905e6) on an EdgeRouter X.

I encountered repeated problems of "WARN: NO valid IP found" when using ddns-scripts, and so I dug deeper. Indeed, occasionally I would get

/usr/bin/host -t AAAA <domain>
<domain> has no AAAA record

as a reply. I did a

while true; do /usr/bin/host -t AAAA <domain>  ; sleep 1; done

and the output randomly toggles between

<domain> has IPv6 address 2001:...
<domain> has no AAAA record

I also run a PiHole in my network, and configured dnsmasq to use it as upstream server:

# /etc/config/
config dnsmasq
	option domainneeded '1'
	option localise_queries '1'
	option rebind_protection '1'
	option rebind_localhost '1'
	option local '/lan/'
	option domain 'lan'
	option expandhosts '1'
	option authoritative '1'
	option readethers '1'
	option leasefile '/tmp/dhcp.leases'
	option resolvfile '/tmp/resolv.conf.d/resolv.conf.auto'
	option localservice '1'
	option ednspacket_max '1232'
	list server '192.168.1.6'

A tcpdump reveals that I do get a valid reply:

 tcpdump -n -i br-lan.100 -vvv 'port 53'
tcpdump: listening on br-lan.100, link-type EN10MB (Ethernet), capture size 262144 bytes
19:07:03.514377 IP (tos 0x0, ttl 64, id 31165, offset 0, flags [DF], proto UDP (17), length 75)
    192.168.1.1.38852 > 192.168.1.6.53: [bad udp cksum 0x83a0 -> 0xd1b4!] 10779+ AAAA? <domain>. (47)
19:07:03.667436 IP (tos 0x0, ttl 64, id 49091, offset 0, flags [DF], proto UDP (17), length 103)
    192.168.1.6.53 > 192.168.1.1.38852: [udp sum ok] 10779 q: AAAA? <domain>. 1/0/0 <domain>. [1m] AAAA 2001:... (75)

But the host utility (and I assume because of dnsmasq?) emits the ...has no AAAA record message.

It appears to me (?) this has something to do with the small TTL of 60 of the domain? (It being specifically set up for dynamic DNS by INWX, where I host this domain). Or do I run into some sort of timeout?

Edit:
Additional info: tcpdump reveals a about 170 ms delay between Query and Reply

dnsmasq is version 2.86-15

Have you checked your pihole to see if it too is having the same resolution problems?

I switched from bind-host to drill for testing.

The PiHole doesn't have the resolution problem. That is, if I run the following on my EdgeRouter under OpenWRT (via ssh)

while true; do /usr/bin/drill -V0 -u <domain> AAAA @192.168.1.6 ; sleep 1 ; done

I will always get the correct reply, even when the TTL expires.
Meanwhile, when I run

while true; do /usr/bin/drill -V0 -u <domain> AAAA; sleep 1 ; done

the local dnsmasq will be queried (and forward the Query to PiHole), and, as soon as the TTL expires, return an empty query.

I'm at a loss. But, I disable dnsmasq/dhcp on my router and let my pihole do all that work so I'm not that familiar with Openwrt's implementation of those services or their configuration.

I'm sure someone will come along and provide you with an answer.

Good Luck

I'm currently trying to circumvent the problem by disabling caching in the OpenWRT dnsmasq (since PiHole already does the caching). It worked yesterday, but stopped working today. I'm occasionally getting an empty reply (if the TTL is over 40 s?) if I query OpenWRT's dnsmasq.

Haven't tried it with any other domains yet, though, maybe it is related to ddns-scripts checking the status?

by default, the wan isn't using your dnsmasq for DNS lookups, unless you tell it to.
in stock config, it'd create a loop.

it'll use the upstream DNSes provided by your ISP.

root@OpenWrt:~# service dnsmasq status
inactive
root@OpenWrt:~# ping www.google.gr
PING www.google.gr (142.251.36.3): 56 data bytes
64 bytes from 142.251.36.3: seq=0 ttl=107 time=28.675 ms

Uhm. Did I set the upstream DNS correctly? Network -> DHCP and DNS -> DNS forwardings, and simply set PiHole there? (192.168.1.6).

/etc/resolv.conf itself contains 127.0.0.1 as server.

check /tmp/resolv.conf.d/resolv.conf.auto

wan DNS config is in wan interface > advanced settings

Thanks for the hint! That solves the problem indeed. Especially the "loop" disagnosis was probably right, two queries arrived at the PiHole at nearly the same time, once from OpenWRT and once from the upstream router. That lead to the error.