Oom-killer: dnsmasq when Physical Free RAM remains

dl12345 · October 20, 2021, 4:17am

I'm not sure we'd want to enable the process count restriction by default. I suspect it is only necessary when large adblock lists are being used. A setting of 1 is quite aggressive, meaning that only a single child process can be forked to handle tcp requests

A more robust solution might be to add functionality to luci's dnsmasq page to allow it to be configured, but leaving it unset by default, so not changing the init.d script...

Grommish · October 20, 2021, 4:20am

Well, whatever is felt as appropriate.

I just needed to put something in there so it was recognized, as I know the discussion said for sure that setting it to 1 made the issue go away.

Since the default is 20, and the min would be 1, what would be the safe middle ground as a default? 3? 5? 10? Just leave it out of the dhcp config entirely? Will that leave it to the default?

wilsonyan · October 20, 2021, 4:26am

for a quick test you can try merging this developer's branch https://git.openwrt.org/?p=openwrt/staging/ldir.git;a=shortlog;h=refs/heads/mine

You can also see what happens if you revert back to dnsmasq 2.85 by :

git revert ed7769aa405fe246b89c9c97b7fb552dfb0b4995
git revert 02a2b44eabf607fb5405ff0d7da4ad0748d3e1b1
git revert d2d0044ebf01b71f63cde609e09f6ac68cdfeccb

dl12345 · October 20, 2021, 11:49am

Yes, just leaving it out entirely would be the way to go. That will leave it on default settings of 20.

If this is the cause of your problem, it's likely only to occur when a large adblock list is being used AND DNS lookups are being made over TCP.

Probably the kind of thing that should be documented in the wiki so that anyone experiencing oom-killer crashes of this nature can go and set the parameter.

Grommish · October 20, 2021, 11:56am

I've set it to max-procs=5 and loaded a set of blocked domains..

I will see how long it takes to kick over, and then reduce it and repeat. I'll let you know.

dl12345 · October 20, 2021, 12:12pm

It would probably be quicker if you were to first establish exactly how many times the process can fork before you're out of memory. In the thread where I originally posted the dnsmasq patch, I mentioned how you can use netcat to determine this. Just setup a loop in a shell script that will netcat -t 127.0.0.1 53 and see how many times it iterates before you're out of memory. Once you've determined that, then configure an appropriate max-procs value and leave it running to see if it still falls over

Grommish · October 20, 2021, 2:10pm

I need to get video capture of it, but something is very off with this.

If I manually run netcat -t 127.0.0.1 53 & a few times from SSH, I can hit the max 20 process forks without issue.

The moment I run a script (see below), I can watch the memory leak happen. This happens even with only ONE instance running. The memleak stops the moment I kill the script below, but it do NOT free the RAM. Again, this does NOT happen if I manually invoke netcat -t 127.0.0.1 53 &. I know there is a memory leak, but I'm not sure it's in dnsmasq, I think it's just manifesting in dnsmasq

root@OpenWrt:/etc# cat dnstest.sh
#!/bin/sh

while true; do
       instances=$(ps | grep netcat | wc -l)
       [ "$instances" -ge 1 ] && continue
       netcat -t 127.0.0.1 53 &
done

dl12345 · October 20, 2021, 3:56pm

Hard to tell what's going on here, but maybe this is a symptom of a race condition of some form? That script you quoted was designed as a denial of service, so the key difference is that it executes much quicker than your interactive commands. What happens if you put in a sleep (2 seconds should be good) just before the netcat command?

Grommish · October 21, 2021, 6:06pm

I've enabled CONFIG_KERNEL_SLABINFO, CONFIG_DEBUG_KMEMLEAK
and the slabtop utility.

Even without loading anything other than boot, I get http://hastebin.com/vegoranozu.yaml over and over.. So, something else is going on. I'm still trying to stablize things, so I've not put the sleep stuff in yet for testing.

dibdot · October 22, 2021, 5:29am

To prevent that kind of duplication in memory, you can simply tick this option in adblock under Advanced DNS Settings:

adblock

Of course, the help text is a bit irritating - replace "as well" with "begins" ...

hnyman · October 22, 2021, 6:23am

To be honest, I do not connect "DNS cache" with the "DNS ruleset".

I think that "DNS cache" in this contexts mainly means the cache of the fetched addresses, so that already fetched add site IP addresses get cleared along the ruleset update and the new ruleset gets immediate effect.

But I do not mentally connect it to the static adblock ruleset.

Two ideas for you:

clarify the help text. Maybe add a note about the RAM duplication
tweak abdblock to mitigate the problem with the ultra-large rulesets. For example, if the amount of blocked hosts is over 200k addresses (??), the cache gets cleared by default. Or something like that. Needing a half-gigabyte RAM footprint for adblock lists is crazy in any case, but we might try to be somewhat helpful for those who use gigantic blocklists.

Grommish · October 22, 2021, 7:08am

I honestly don't think this is an adblock problem. With nearly 800k blocked domains, the compressed backup was only 10Mb.. even with text compression, the lists aren't going to be THAT big.

This appears to be a memleak somewhere manifesting in dnsmasq, and I've got no idea how to track it down yet.

vgaetera · October 23, 2021, 3:55am

The dnsmasq runtime config including the list of blocked hosts is stored in /tmp, i.e. on tmpfs.
The size of /tmp is smaller than RAM and I'm not sure if its compression level is good enough if any.
Apparently the config is loaded to RAM uncompressed, otherwise it wouldn't consume so much RAM.

Grommish · October 24, 2021, 11:22pm

This should be easy to figure out.. I will just leave Adblock out

I'll run a defconfig, add luci-ssl and the above kernel debug, leave the debug symbols in, etc and see what happens

Grommish · October 29, 2021, 5:32pm

Well, I can confirm it isn't Adblock. After the below uptime, look at the used. This is without Adblock in the image at all.

(Sorted by Mem%)

Suggestions?

vgaetera · October 29, 2021, 11:34pm

Try isolating the cause of the problem by minimizing the dnsmasq config.
If the issue persists, you should escalate it upstream.

Grommish · October 30, 2021, 12:22am

It's the stock network setup, including DNS/DHCP, but shrug. Got suggestions on 1) Which upstream? 2) Best way to try and figure out where the leak is?

root@OpenWrt:/# uci show network
network.loopback=interface
network.loopback.device='lo'
network.loopback.proto='static'
network.loopback.ipaddr='127.0.0.1'
network.loopback.netmask='255.0.0.0'
network.globals=globals
network.globals.ula_prefix='fd1b:fb67:86b3::/48'
network.@device[0]=device
network.@device[0].name='br-lan'
network.@device[0].type='bridge'
network.@device[0].ports='eth1' 'eth2'
network.lan=interface
network.lan.device='br-lan'
network.lan.proto='static'
network.lan.ipaddr='192.168.1.1'
network.lan.netmask='255.255.255.0'
network.lan.ip6assign='60'
network.wan=interface
network.wan.device='eth0'
network.wan.proto='dhcp'
network.wan6=interface
network.wan6.device='eth0'
network.wan6.proto='dhcpv6'
root@OpenWrt:/# uci show dhcp
dhcp.@dnsmasq[0]=dnsmasq
dhcp.@dnsmasq[0].domainneeded='1'
dhcp.@dnsmasq[0].boguspriv='1'
dhcp.@dnsmasq[0].filterwin2k='0'
dhcp.@dnsmasq[0].localise_queries='1'
dhcp.@dnsmasq[0].rebind_protection='1'
dhcp.@dnsmasq[0].rebind_localhost='1'
dhcp.@dnsmasq[0].local='/lan/'
dhcp.@dnsmasq[0].domain='lan'
dhcp.@dnsmasq[0].expandhosts='1'
dhcp.@dnsmasq[0].nonegcache='0'
dhcp.@dnsmasq[0].authoritative='1'
dhcp.@dnsmasq[0].readethers='1'
dhcp.@dnsmasq[0].leasefile='/tmp/dhcp.leases'
dhcp.@dnsmasq[0].resolvfile='/tmp/resolv.conf.d/resolv.conf.auto'
dhcp.@dnsmasq[0].nonwildcard='1'
dhcp.@dnsmasq[0].localservice='1'
dhcp.@dnsmasq[0].ednspacket_max='1232'
dhcp.lan=dhcp
dhcp.lan.interface='lan'
dhcp.lan.start='100'
dhcp.lan.limit='150'
dhcp.lan.leasetime='12h'
dhcp.lan.dhcpv4='server'
dhcp.lan.dhcpv6='server'
dhcp.lan.ra='server'
dhcp.lan.ra_slaac='1'
dhcp.lan.ra_flags='managed-config' 'other-config'
dhcp.wan=dhcp
dhcp.wan.interface='wan'
dhcp.wan.ignore='1'
dhcp.odhcpd=odhcpd
dhcp.odhcpd.maindhcp='0'
dhcp.odhcpd.leasefile='/tmp/hosts/odhcpd'
dhcp.odhcpd.leasetrigger='/usr/sbin/odhcpd-update'
dhcp.odhcpd.loglevel='4'

Grommish · October 30, 2021, 8:45pm

I watched the Mem used slowly crawl up until it died.

gist.github.com

https://gist.github.com/Grommish/7a2f46e576ce3dfd38d0dbb401924c9a

gistfile1.txt

[80793.472208] sh invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=2, oom_score_adj=0
[80793.482186] CPU: 0 PID: 1520 Comm: sh Not tainted 5.10.75 #0thr; 2 running
[80793.487854] Stack : 00300014000005f0 0000000000000008 22e244e8612afb8c 22e244e8612afb8c
[80793.495885]         0000000000000000 80000000096338b0 ffffffff818a81d0 8000000009633780
[80793.503913]         0000000000000000 c0000000ffffefff 0000000000000003 ffffffffffffffea
[80793.511941]         0000000000000010 ffffffff814de580 0000000000000001 80000000096337a5
[80793.519970]         fffe000000000000 0000000000000001 0000000000000000 ffffffff818b0000
[80793.527999]         ffffffff81a4d510 8000000009633c68 0000000000001000 ffffffff8199f5b0
[80793.536027]         00000000fffffffe ffffffff81525270 0000000000000000 ffffffff81b90000
[80793.544055]         8000000009630000 80000000096338b0 0000000000000002 ffffffff814c4d60

This file has been truncated. show original

This is the crash-log.

Grommish · November 1, 2021, 8:58pm

So, I put the 5.4 kernel on the device and have witness no apparently memory increase.

@adrianschmutzler: It appears to be something with the 5.10 kernel and the main branch on the Octeon target. I was mis-attributing the issues I was seeing. Suggestions would be welcome.

adrianschmutzler · November 1, 2021, 10:20pm

Maybe that also what stintel was seeing here: https://github.com/openwrt/openwrt/pull/4610#issuecomment-931213870

Unfortunately, I'm no help at all here; and it will be quite hard to find somebody to debug stuff like this on octeon.