[Solved] Luci floods dnsmasq log with PTR queries

jow · August 31, 2021, 4:09pm

I guess I didn’t quite understood your issue then. What do you mean with flooding exactly? The fact that a bunch if PTRs are queried over and over again every few seconds?

The obvious solution would be to not log those. If that is not possible, then reenabling the local dnsmasq (without DHCP service) should reduce the amount of queries that end up upstream.

If with flooding you mean a lot of queries in a way shorter interval than every few seconds then this could be a bug that needs addressing. For that it would be useful to know which pages trigger that behavior exactly.

Edit: seems I conflated the OPs issue with the network topology described by @akardam .

If logging is enabled at the local dnsmasq and something local is querying a lot, then a lot of logs are written, obviously. Moving caching logic into the local LuCI resolver could be a solution, but it would essentially duplicate dnsmasq’s caching logic which doesn’t make that much sense either. The idea is to rely upon dnsmasq to cache and deduplicate queries.

akardam · August 31, 2021, 5:44pm

I found that the following settings produced the desired results:

Re-enable and start dnsmasq (firewall and odhcpd remained disabled and stopped)

Network > Interfaces > LAN > Advanced Settings:

Use custom DNS servers - removed all entries
DNS search domains - removed entry

Network > Interfaces > LAN > DHCP Server > General Setup:

Ignore interface - unchecked (was already - previously unchecked during initial setup)

Network > DHCP and DNS > General Settings:

Domain required - unchecked
Authoritative - unchecked
Local domain - set to <my.internal.domain>
DNS forwardings - added <my.internal.dns.server.ip> entry
Rebind protection - unchecked
Local service only - checked
Non-wildcard - checked
Exclude interfaces - added "lan" entry

Network > DHCP and DNS > Advanced Settings:

Filter private - unchecked
Expand hosts - checked (was already)

Save and apply all changes

dnsmasq only listening on localhost:

$ netstat -ltupn | grep -E 'dnsmasq|:53'
tcp        0      0 127.0.0.1:53            0.0.0.0:*               LISTEN      16057/dnsmasq
udp        0      0 127.0.0.1:53            0.0.0.0:*                           16057/dnsmasq

ping an internal domain host by hostname only works:

$ ping host-foobar
PING host-foobar (x.x.x.x)
64 bytes from x.x.x.x: (bla bla bla)
64 bytes from x.x.x.x: (bla bla bla)
...

wireshark shows two initial PTR lookups (PI IP and default gateway IP) when logging into the web GUI. Then I can flip back and forth between the status overview, interfaces page, etc, w/o additional lookups. Haven't checked the behavior after the response TTL expires, but I imagine it will be the same - only 2 lookups and 2 responses.

jow · August 31, 2021, 6:06pm

Thats exactly how it is supposed to work and it should do in the default setup. I guess the OPs problem is that he enabled query logging on the local dnsmasq and that unfortunately dnsmasq logs inbound queries before the cache means any resolve attempt is logged, even if it can be/is replied by dnsmasq's internal cache.

The only solution to that issue would be not producing many queries in the first place which implies duplicating (some of) the DNS response cache logic from dnsmasq into the LuCI code itself. That brings many additional problems or undesired effects though:

if caching happens in the browser/JS side, the queries are still repeated after every page change
if caching happens somewhere on the router side (e.g. in the responsible rpcd rdns plugin) then you're essentially storing the cached DNS replies twice in system memory (once dnsmasq, once rpcd or similar)
if caching negative DNS replies, hosts will keep having no hostname even if they come online later unless the secondary cache is purged or the page forcibly reloaded
if not caching negative DNS replies, queries for offline hosts will keep getting repeated
- maybe solvable with a trade-off of unconditionally caching negative replies for 30s to 2m or so
the suggestion to not query devices "which are neither connected nor have a valid lease record" is not easy to implement since not any host uses DHCP (PTR querying is also done for neighbour table entries) and not every host responds to e.g. ping tests, so you can only blindly query and see what you get
even if the PTR query flood would not be an issue in the first place, the OPs logging setup could still pose a problem in case random lan clients perform lots of DNS queries. Unbounded logs in /tmp are always problematic, in the end it is not a question if the system memory gets exhousted but when it does in such a setup

To improve the general "query load" though I think it makes sense to ...

implement client side caching within the same page, either adhering to DNS reply provided TTLs or simply caching forever (which slightly worsens the UX for hosts being offline on page load and coming online later)
- this should solve the case of an autorefresh page being left open and causing logs to overflow
maybe implement some system wide control to disable LuCI originated PTR requests completely

Some points on why PTR lookups are done in the first place:

hostnames for locally connected clients can come from many different sources such as the lease database, the ethers(5) database, an arbitrary number of hosts(5) files or directories of such files which may even appear and disappear at runtime
- reasoning is to query local dnsmasq / the configured resolver and follow its normal resolve logic instead of reimplementing all those data sources again, manually
DHCP might be served off-device by another host in the network, means name information can only be obtained via DNS
likewise, hosts using static IP settings and only discoverable via the local ARP / IPv6 neighbour table might not be present in the lease table at all but have associated DNS records in either an upstream NS or any of the locally available hosts(5) files

Barney · August 31, 2021, 6:48pm

My problem is:

My questions are:

Do you need to query every 5 secs?
Do you need to query for non-connected devices?

Lowering the query frequency to 30 or 60 secs would tremendously reduce the amount of log data.

jow · September 1, 2021, 9:03am

That 5s are the system default for all refresh cycles. Since we want status information to be reasonably current, e.g. to be able to debug network issues, seeing hosts come or go in near real time, follow wireless rate selection, traffic counters etc. a relatively small interval has been chosen. You can modify it via the cli using uci set luci.main.pollinterval=30; uci commit luci. The value is specified in seconds.
Yes since there is no reasonable definition of not connected, we want to cover e.g. non-DHCP hosts

Barney · September 2, 2021, 12:27pm

Okay, that seems reasonable.

Is the uci way the only way to change the poll interval? I prefer to change values/options in a config file.

I consider changing the poll interval as an acceptable work-around for my problem.

jow · September 2, 2021, 5:32pm

You can achieve the same effect by editing /etc/config/luci and adding/changing the option pollinterval ... line in the config core main section.

Barney · September 4, 2021, 12:48pm

Thanks a lot for your hint, it works.

I appreciate your help.

system · September 14, 2021, 12:49pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.