Trying to debug an issue that's been plaguing my network for a while now. The symptom is that randomly through the day certain websites will stop working, in the browser they report 'Server Not Found', in apps it reports as a generic network error. It always self-resolves after 30-60 seconds. There is nothing reported in the OpenWRT logs when this happens.
I always suspected it was a DNS issue and today I was quick enough to bring up a terminal window and attempt a tracepath to a site that wasn't working. Sure enough, it reported a DNS error.
tracepath: wikipedia.org: Temporary failure in name resolution
What is the significance of the 'Temporary' wording in that error? Is that a specific error from the upstream resolver? Or just a generic error message (i.e. how does tracepath know it's temporary and not permanent?).
I use stubby in Round Robin to several upstream revolvers, I have individually checked each one and, at the time of testing, they all worked, so it isn't that one of my upstream pool has permanently failed. Most likely it's one (or more) of the upstream resolvers having intermittent issues.
Other than pulling each resolver out of the pool and monitoring for a few days (which is going to takes weeks in total) is there a level of logging I could turn on to get a quicker insight into what's happening?
That message indicates that there was no response from upstream DNS server(s). If a server in the pool is dead it is supposed to silently try another and blacklist the bad one for a while.
A permanent failure is when a server does respond with a message that there is no record of that name registered in the DNS systen.
Thanks, that was what I thought it would mean. Clients on the network are configured (and with a catch-all firewall rule to sweep up any rogues) to only use the local OpenWRT router as their DNS resolver, dnsmasq then forwards to Stubby locally, Stubby then picks a resolver from it's pool in a Round Robin fashion, so I assume the issue is that one upstream resolver is failing, then when Stubby moves onto the next resolver it starts working again.
Frustratingly Stubby seems to lack the ability to log such individual upstream failures. It will only log an error if ALL upstream resolvers are unavailable (and will do so mercilessly!)
So I'm stuck on how to identify which of the upstream servers is having a hard time.
I would have to run this during an error condition right? Part of the problem is being next to a device with a terminal when it happens! Took me a week to get the chance to do a traceroute!
I should say, it lacks the ability to log JUST individual upstream failures, the only option is that it will log every DNS query, which fills the log up in about 10 seconds.
It's not supposed to return a SERVFAIL unless they all failed to respond. That generally means there is no link to the Internet at all, or if you are using servers that are only reachable though a VPN tunnel, the tunnel is down.
Thanks. This shifted my investigation to dnsmasq. I believe it might be related to having DNSSEC enabled in dnsmasq. dnsmasq will set the status if any query to SERVFAIL if it cannot validate DNSSEC. This could cause a single upstream resolver with a trust-chain issue to appear to sporadically issue SERVFAIL responses to clients. Still interested as to how this would occur however, as the domains that fail are not 'badly configured' domains that would normally results in DNSSEC failures. Seems to be very challenging to debug.
Am testing with DNSSEC disabled and looks good so far, so I will probably leave it disabled.