Dnsmasq crashing intermittently when disabling interfaces, would like to adjust procd respawn parameters

Hey I'm running the following setup.

Model	ELECOM WRC-2533GST2
Architecture	MediaTek MT7621 ver:1 eco:3
Target Platform	ramips/mt7621
Firmware Version	OpenWrt 22.03.2 r19803-9a599fee93

I've made quite afew customizations including openvpn, adblock, etc. They don't directly relate to what I am attempting to do and I'm uncertain if they are the cause of the problem or not.

The problem occurs intermittently when I connect to the openwrt box via ssh and execute "wifi down".
Periodically dnsmasq dies due to repeated failures. I've included the exact error message below.

Sun Feb  5 19:41:08 2023 kern.info kernel: [290008.019012] device wlan0 left promiscuous mode
Sun Feb  5 19:41:08 2023 kern.info kernel: [290008.028743] br-lan: port 5(wlan0) entered disabled state
Sun Feb  5 19:41:08 2023 daemon.notice netifd: Wireless device 'radio1' is now down
Sun Feb  5 19:41:09 2023 daemon.notice netifd: Network device 'wlan0' link is down
Sun Feb  5 19:41:09 2023 daemon.notice netifd: Wireless device 'radio0' is now down
Sun Feb  5 19:41:11 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Sun Feb  5 19:41:11 2023 daemon.info procd: Instance dnsmasq::cfg01411c s in a crash loop 6 crashes, 0 seconds since last crash

The error does not always happen and sometimes dnsmasq dies and restarts fine.

Sun Feb  5 21:06:27 2023 kern.info kernel: [295126.929874] device wlan0 left promiscuous mode
Sun Feb  5 21:06:27 2023 kern.info kernel: [295126.939353] br-lan: port 4(wlan0) entered disabled state
Sun Feb  5 21:06:27 2023 daemon.notice netifd: Network device 'wlan0' link is down
Sun Feb  5 21:06:27 2023 daemon.notice netifd: Wireless device 'radio1' is now down
Sun Feb  5 21:06:28 2023 daemon.notice netifd: Wireless device 'radio0' is now down
Sun Feb  5 21:06:29 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Sun Feb  5 21:06:29 2023 daemon.info dnsmasq[1]: started, version 2.86 cachesize 150
Sun Feb  5 21:06:29 2023 daemon.info dnsmasq[1]: DNS service limited to local subnets

EDIT:
Just to be absolutely clear there were no other instances of dnsmasq crashing in the hour prior to the error the only prior instance was the following. So this is not matter of 5 errors occurring within the default 3600 second period specified for respawn_threshold.

Sun Feb  5 18:11:52 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Sun Feb  5 18:11:57 2023 daemon.info dnsmasq[1]: started, version 2.86 cachesize 150

Also in this case where it restarted successfully the respawn takes exactly 5 seconds.

Even in the second successful respawn of dnsmasq above (the one after calling "wifi down" it seems like the restart time after receiving the SIGTERM was 0 seconds (not 5).
It seems like perhaps after receiving a SIGTERM from an interface going down, it does not user the same "respawn_timeout" of 5 seconds, but 0 seconds. It looks like the "respawn_retry" are still 5 as it says it failed 6 times.

It's also unclear to me how it failed 6 times with 0 seconds since last failure if it were using the procd.sh parameters. As based on what I read online the procd.sh should default the "respawn_timeout" to 5 seconds so the last respawn attempt should have been 5 seconds ago not 0 seconds ago as the error message says. But perhaps this is just a quirk of which timestamp that error message is checking against.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/system/procd/files/procd.sh

 464         local respawn_vals
 465         _json_no_warning=1
 466         if json_select respawn ; then
 467                 json_get_values respawn_vals
 468                 if [ -z "$respawn_vals" ]; then
 469                         local respawn_threshold=$(uci_get system.@service[0].respawn_threshold)
 470                         local respawn_timeout=$(uci_get system.@service[0].respawn_timeout)
 471                         local respawn_retry=$(uci_get system.@service[0].respawn_retry)
 472                         _procd_add_array_data ${respawn_threshold:-3600} ${respawn_timeout:-5} ${respawn_retry:-5}
 473                 fi
 474                 json_select ..
 475         fi
 476 
 477         json_close_object

Anyways what I would like to do is change the respawn_retry parameter for dnsmasq so that it has 30 retries instead of just 5. And also to ensure that respawn_timeout is at least 1 second (and possibly the default 5 seconds) and not 0 seconds. I'm hopeful this may eliminate my problem with dnsmasq crashing leaving all devices on my network stranded.

I'm unsure where to make this change though. I looked at the dnsmasq init.d script and it doesn't appear to supply any specific respawn parameters. I'm unsure if this means it should be using the defaults or not.

/etc/init.d/dnsmasq

	procd_open_instance $cfg
	procd_set_param command $PROG -C $CONFIGFILE -k -x /var/run/dnsmasq/dnsmasq."${cfg}".pid
	procd_set_param file $CONFIGFILE
	[ -n "$user_dhcpscript" ] && procd_set_param env USER_DHCPSCRIPT="$user_dhcpscript"
	procd_set_param respawn

Is the most appropriate place for me to add my desired respawn_retry and respawn_timeout as additional "procd_set_param" statements after the one pasted above inside /etc/init.d/dnsmasq , or is there a better place to put this that doesn't require modifying the init.d script for dnsmasq?

Or perhaps changing the procd respawn is not related to my error at all, and there is some other respawn mechanism backed into dnsmasq for when an interface goes down. Perhaps I need to add some manual delay to this or increase the retry count....

Thanks in advance, also if anyone has any tips to prevent it crashing intermittently I'd be more than happy to see if they help.

1 Like

By itself, it uses the default values 3600 5 5.

I’ve never strayed from respawn defaults, primarily because any such fails were usually by my own design/stupidity, but according to the wiki https://openwrt.org/docs/guide-developer/procd-init-scripts#service_parameters procd_set_param is how to set/override parameters. Whether you still need to issue a following procd_set_param respawn implicitly I can’t say.

1 Like

Thanks so much for your reply. Based on the logs I included above, it seems like there is something strange going on after receiving a SIGTERM due to "wifi up" or "wifi down". Sometimes it waits 5 seconds before respawning the dnsmasq service, however othertimes it does it immediately.

# after executing "wifi up"
Mon Feb  6 18:21:43 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Mon Feb  6 18:21:43 2023 daemon.info dnsmasq[1]: started, version 2.86 cachesize 150
# after executing "wifi down"
Mon Feb  6 19:52:00 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Mon Feb  6 19:52:05 2023 daemon.info dnsmasq[1]: started, version 2.86 cachesize 150

# after executing "wifi up" a second time
Mon Feb  6 23:32:48 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Mon Feb  6 23:32:48 2023 daemon.info dnsmasq[1]: started, version 2.86 cachesize 150
# after executing "wifi down" a second time
Mon Feb  6 23:37:26 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Mon Feb  6 23:37:26 2023 daemon.info dnsmasq[1]: started, version 2.86 cachesize 150

I think when it attempts to do it immediately it is also setting the delay to 0 seconds instead of the default 5 it is supposed to. Then on some rare occasions this means it retries 5 times and kills the dnsmasq service. It hasn't happened in the last day so I will need to wait for it to happen again and report back when it does. After that I will probably write my own watchdog init.d service which just checks if dnsmasq service is running and restarts it if it has stopped for more than 30 seconds.

Still no additional failures I suspect it may take another few days for it to fail again. I am curious about the following lines in /etc/init.d/dnsmasq Specifically why does it need to call start_service , when the default behaviour of "procd_send_signal" is SIGHUP which simple reloads the dnsmasq service without restarting (as I show below). Is this some kind of fail-safe to restart the service if it has died?

reload_service() {
	rc_procd start_service "$@"
	procd_send_signal dnsmasq "$@"
}

Based on the documentation for procd_send_signal at https://openwrt.org/docs/guide-developer/procd-init-scripts

The signal argument is SIGHUP by default

Then based on my testing using a custom script, it does indeed send SIGHUP and SIGHUP causes a reload of the service, without the SIGTERM I am seeing in the logs every time an interface goes up or down.

#!/bin/sh
. /lib/functions/procd.sh
procd_send_signal dnsmasq cfg01411c

Also my second question is, does anyone know why when an interface goes up or down a SIGTERM is getting sent (thus restarting it) instead of a SIGHUP (to reload) as it looks like should be the case based on my understanding above and the code below from /etc/init.d/dnsmasq.

add_interface_trigger()
{
	local interface ignore

	config_get interface "$1" interface
	config_get_bool ignore "$1" ignore 0

	[ -n "$interface" ] && [ $ignore -eq 0 ] && procd_add_interface_trigger "interface.*" "$interface" /etc/init.d/dnsmasq reload
}

I found two bug reports for openwrt that seem like they may be related to my issue.

Although I don't have the plugin https-dns-proxy installed it seems like simple-adblock is doing the same thing.

dnsmasq_restart() { /etc/init.d/dnsmasq restart >/dev/null 2>&1; }

I'm not entirely sure under what conditions simple-adblock needs to restart dnsmasq but perhaps I will just comment that out the restart and see if things continue to work with simple-adblock (as I don't change my adblock filters currently).

I posted a question over at the simple-adblock forum Simple-adblock: fast, lean and fully uci/luci configurable AdBlocking - #399 by ev6ds asking why an interface up/down is causing simple-adblock to restart dnsmasq, and also why when it does restart it procd isn't respecting the default respawn_timeout of 5 seconds.

1 Like

Ok so it seems that simple-adblock isn't what is causing dnsmasq to restart when "wifi up" or "wifi down" is called. Since I don't have a spare router available right now it's hard for me to know what part of my configuration (aka which addon) is causing "wifi up" and "wifi down" to cause the dnsmasq service to restart. I'm not sure if there is anyway to tell what addon might be restarting dnsmasq. I will update here if I find a way.

Latest update, I'm getting the crashing problem a lot again so I started to try out who was sending the SIGTERM signal that was restarting dnsmasq. The results were inconclusive but seem like they may suggest that it's killing itself. I did this first connecting with strace and calling wifi up in another shell as follows.

root@OpenWrt:~/misc# ps | grep dnsmasq
 3296 root      2664 S    {dnsmasq} /sbin/ujail -t 5 -n dnsmasq -u -l -r /bin/ubus -r /etc/TZ -r /etc/dnsmasq.conf -r /etc/ethers -r /etc/group -r /etc/hosts -r /etc/passwd -w /tmp/dhcp.leases -r /
 3302 dnsmasq   4628 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
 3489 root      1312 R    grep dnsmasq
root@OpenWrt:~/misc# strace -p 3302 -e 'trace=!all'
strace: Process 3302 attached
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
+++ exited with 0 +++
root@OpenWrt:~/misc# ps | grep dnsmasq
 3784 root      2664 S    {dnsmasq} /sbin/ujail -t 5 -n dnsmasq -u -l -r /bin/ubus -r /etc/TZ -r /etc/dnsmasq.conf -r /etc/ethers -r /etc/group -r /etc/hosts -r /etc/passwd -w /tmp/dhcp.leases -r /
 3795 dnsmasq   4628 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
 3873 root      1312 S    grep dnsmasq

The si_pid reported by strace is 0. Which appears to be undocumented or some kind of error, but I did find one stackexchange comment that suggested it may be the result when a process sends a SIGTERM to itself.

As a second attempt I added log statements to the init.d script for dnsmasq that would print out the parent process names and ids of whoever called the init.d reload, start, and stop (provided I understand how procd scripts actually work which may not be the case).

script /bin/dump_pid_tree

#!/bin/ash

function echo_parent_pids {
    # Look up the parent of the given PID.
    pid=${1:-$$}
    stat="$(cat /proc/${pid}/stat)"
    ppid="$(echo ${stat} | cut -d " " -f 4)"
    echo "$(echo ${stat} | cut -d " " -f 1-4)"
    if [[ ${ppid} -eq 1 ]] ; then
        return
    else
        echo_parent_pids ${ppid}
    fi
}

echo_parent_pids

exit 0

The output was also inconclusive and showed the following.

Tue Feb 14 13:42:03 2023 user.notice dnsmasq: reload_service() 3608 (dump_pid_tree) S 3588 3588 (dnsmasq) S 1
Tue Feb 14 13:42:04 2023 user.notice dnsmasq: dnsmasq_start() 3628 (dump_pid_tree) S 3588 3588 (dnsmasq) S 1
Tue Feb 14 13:42:04 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Tue Feb 14 13:42:05 2023 daemon.info dnsmasq[1]: started, version 2.86 cachesize 150
Tue Feb 14 13:42:05 2023 daemon.info dnsmasq[1]: DNS service limited to local subnets
Tue Feb 14 13:42:05 2023 daemon.info dnsmasq[1]: compile time options: IPv6 GNU-getopt no-DBus UBus no-i18n no-IDN DHCP no-DHCPv6 no-Lua TFTP no-conntrack no-ipset no-auth no-cryptohash no-DNSSEC no-ID loop-detect inotify dumpfile

So my best guess at this point is something about taking the "wifi up" or "wifi down" is causing dnsmasq to send a SIGTERM to itself, after which it sometimes gets stuck in a loop of repeated failures which all happens in less than 5 seconds and causes the service to stop restarting.

My next attempt to fix this is going to be to set respawn_retry to some giant number and see if the 6 failures increases to the giant number + 1.

Just incase my fix works and I forget to update this thread the fix I am currently testing is changing the respawn line in /etc/init.d/dnsmasq to the following.

procd_set_param respawn ${respawn_threshold:-3600} ${respawn_timeout:-5} ${respawn_retry:-50}

Just to update, I didn't get the error for over 1 month. But then just today I had 50 sequential failures. I'll up it to 500 or something and see if that fixes it for even longer.

Tue Mar 28 10:48:26 2023 kern.info kernel: [3519291.339134] br-lan: port 5(wlan0) entered disabled state
Tue Mar 28 10:48:26 2023 daemon.notice netifd: Wireless device 'radio1' is now down
Tue Mar 28 10:48:26 2023 daemon.notice netifd: Network device 'wlan0' link is down
Tue Mar 28 10:48:26 2023 daemon.notice netifd: Wireless device 'radio0' is now down
Tue Mar 28 10:48:27 2023 user.notice dnsmasq: reload_service() 10007 (dump_pid_tree) S 9964 9964 (dnsmasq) S 1
Tue Mar 28 10:48:27 2023 user.notice dnsmasq: dnsmasq_start() 10027 (dump_pid_tree) S 9964 9964 (dnsmasq) S 1
Tue Mar 28 10:48:28 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Tue Mar 28 10:48:28 2023 user.err : jail: can't read from child
Tue Mar 28 10:48:28 2023 user.err : jail: can't read from parent
Tue Mar 28 10:48:28 2023 daemon.info procd: Instance dnsmasq::cfg01411c s in a crash loop 51 crashes, 0 seconds since last crash
1 Like