Dnsmasq dies on boot after receiving SIGTERM

anichang · November 12, 2023, 4:34am

Hi,

I've 3 openwrt 23.05.0 routers (call'em r1, r2, r3) on the same subnet. I'm using keepalived+conntrackd to virtualize 2 IPs (subnet gateway, dns's and ntps). All of the three have dhcps active on the same subnet, using 3 different pools and I've a script to keep the leases file in sync.
All works beautifully but ... if I reboot all of the three at the same time... after reboot one of the three has no dnsmasq running.
So I enabled a log file and this is what I see on the router without dnsmasq running:

root@router:~# cat /tmp/log/dnsmasq.log
Nov 12 04:16:49 dnsmasq[1]: started, version 2.89 cachesize 1000
...
Nov 12 04:16:58 dnsmasq[1]: exiting on receipt of SIGTERM
Nov 12 04:17:00 dnsmasq[1]: started, version 2.89 cachesize 1000
...
Nov 12 04:17:00 dnsmasq[1]: exiting on receipt of SIGTERM
Nov 12 04:17:04 dnsmasq[1]: started, version 2.89 cachesize 1000
...
Nov 12 04:17:05 dnsmasq[1]: exiting on receipt of SIGTERM

It starts, then restarts 2 times, then no more. There aren't errors in the full logs apart from ubus error at times. It is consistent at every reboot. Most of the times it is the same router r3 to not have dnsmasq running; sometimes it is one of the other 2.
I disabled the https-dns-proxy option to update the dnsmasq configuration on start/stop; same behaviour. At this point I've no clue.

Is there a way to see what is sending the SIGTERM to dnsmasq? Some facility to track the sigs.

Regards

hnyman · November 12, 2023, 5:16am

I think that dnsmasq is sometimes forced to reload the config by terminating it. I don't remember exactly, why, (and didn't try to look into it). But it might be that due to the multiple devices starting at the same time, the same condition is triggered multiple times. The limit might need to be more than 5 in your situation.

you might need to adjust the respawn limits for procd in the dnsmask init file. Procd has crash detection, and by default stops respawn after 5 tries, if the process dies multiple times quickly.

But it is also possible that the init at each router causes the same condition to be circularly retriggered, so that you have a never-ending loop

anichang · November 13, 2023, 3:01am

Thanks for the fast support.

Didn't have time to dig in yet. I left the routers there yesterday with dnsmasq running, and today r3 had no dnsmasq running, last line in log: "Nov 12 07:13:25 dnsmasq[1]: exiting on receipt of SIGTERM".
There were no clients served by the routers. The only variable I can see is the wan interface; it is a wireless sta, and it might loose connection from time to time. Does it reload dnsmasq on wifi disconnect/reconnect?

Note: as r2 and r3 are clones (same HW), before leaving I swapped r1 and r3. So r1 was the new r3 and r3 was the new r1. And r3 (both old and new r3) is located in a place where wifi signal is a bit weak; and most of the times is r3 to loose dnsmasq, both old r3 and new r3. At this point looks very probable it is related to wifi connect/disconnect/reconnect.

Probably I should make a bridge as wan interface and assign the wifi sta to the bridge, so that the interface is permanently up. Is it possible? In 19.07.0 I had problems to use a wireless interface in a bridge and I'm not familiar with the new switch/bridge facility. Ideas?

antonk · November 13, 2023, 3:36am

Just a wild guess but possibly this is related to something that your script is doing when synchronizing leases? Can't tell without looking at the script file but does dnsmasq restart when your script runs?

anichang · November 13, 2023, 4:25am

Thanks for trying but the script is not running; I didn't setup the cronjob yet. BTW, it's the vanilla script on openwrt's wiki at the high-availability page.

And currently there are no dhcp clients in the network; no client can trigger that. I've one wired host only on r1-r2-r3 collision domain, ie: another router with static IP, NATing a bunch of servers and a couple of my terminals; but those are 1 host only from r1-r2-r3 perspective. And no AP is configured on r1-r2-r3. No clients.

In the whiletime I noticed I can reboot r1 and r2, and get dnsmasq SIGTERM'ed on r3. I just triple-checked; it happened 3 times out of 3 reboots. When this happens, r3 becomes vrrp MASTER for both the virtual IPs as both the higher priority routers are gone. But dnsmasq dies. Logread:

Mon Nov 13 03:36:24 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:36:24 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:36:29 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:36:29 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:36:34 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:36:34 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:36:36 2023 daemon.info Keepalived_vrrp[2778]: (primary) Entering MASTER STATE
Mon Nov 13 03:36:36 2023 daemon.info Keepalived_vrrp[2778]: (secondary) Entering MASTER STATE
Mon Nov 13 03:36:39 2023 kern.info kernel: [ 1046.420306] Atheros AR8216/AR8236/AR8316 mdio.0:00: Port 1 is down
Mon Nov 13 03:36:48 2023 kern.info kernel: [ 1055.780008] Atheros AR8216/AR8236/AR8316 mdio.0:00: Port 1 is up
Mon Nov 13 03:37:14 2023 kern.info kernel: [ 1081.780181] Atheros AR8216/AR8236/AR8316 mdio.0:00: Port 1 is down
Mon Nov 13 03:37:18 2023 kern.info kernel: [ 1085.939925] Atheros AR8216/AR8236/AR8316 mdio.0:00: Port 1 is up
Mon Nov 13 03:37:19 2023 kern.info kernel: [ 1086.980171] Atheros AR8216/AR8236/AR8316 mdio.0:00: Port 1 is down
Mon Nov 13 03:37:38 2023 kern.info kernel: [ 1105.699887] Atheros AR8216/AR8236/AR8316 mdio.0:00: Port 1 is up
Mon Nov 13 03:37:40 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:37:40 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:37:45 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:37:45 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:37:46 2023 daemon.info Keepalived_vrrp[2778]: (primary) Master received advert from 192.168.168.250 with higher priority 255, ours 253
Mon Nov 13 03:37:46 2023 daemon.info Keepalived_vrrp[2778]: (primary) Entering BACKUP STATE
Mon Nov 13 03:37:47 2023 authpriv.info dropbear[6369]: Early exit: Terminated by signal
Mon Nov 13 03:37:47 2023 authpriv.info dropbear[8522]: Not backgrounding
Mon Nov 13 03:37:50 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:37:50 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:37:55 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:37:55 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:00 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:00 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:05 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:05 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:10 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:10 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:15 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:15 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:20 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:20 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:25 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:25 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:26 2023 daemon.info Keepalived_vrrp[2778]: (secondary) Master received advert from 192.168.168.249 with higher priority 255, ours 253
Mon Nov 13 03:38:26 2023 daemon.info Keepalived_vrrp[2778]: (secondary) Entering BACKUP STATE
Mon Nov 13 03:38:26 2023 authpriv.info dropbear[8522]: Early exit: Terminated by signal
Mon Nov 13 03:38:27 2023 authpriv.info dropbear[8805]: Not backgrounding
Mon Nov 13 03:38:30 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:30 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:35 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:35 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:40 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:40 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:45 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:45 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:50 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:50 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:55 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:38:55 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:00 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:00 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:05 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:05 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:10 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:10 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:16 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:16 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:20 2023 daemon.warn odhcpd[1800]: No default route present, overriding ra_lifetime!
Mon Nov 13 03:39:21 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal
Mon Nov 13 03:39:21 2023 daemon.info Keepalived_vrrp[2778]: Printing VRRP as json for process(2778) on signal

A part from the log flooded with "Printing VRRP as json..."; it shows:

becoming MASTER for both the virtual IPs (named "primary" and "secondary").
the ethernet link going down/up 3 times. That's the cable connecting r1 to r3; as r1 is rebooting, r3 looses the ethernet link. That link is in r3's br-lan bridge; ie: the network interface being "lan" on r3, the one dnsmasq is serving dhcp on.
r3 goes back to BACKUP state for both the virtual IPs as soon as it receives a new probe from r1 and r2.
my ssh terminal resets the connection to dropbear (as my terminal is wired to r1 as well, so it looses link as well when I reboot r1).

So, yes, looks like the wireless interfaces on wan are innocent. It's the lan ethernet link resetting on reboot; and r1 booting makes r3's ethernet link go down/up 3 times, the same number of dnsmasq restarts. Then procd detects a faulty condition and doesn't start dnsmasq anymore.

At this point I should place a switch between the routers, as it should be. But I don't have one handy; I had to remove the one I had a few days ago... a storm cooked it and now there's AC on the ethernet cables connected to it... it's a good tesla coil rather than a switch...

My peculiar situation aside, where a switch is missing between the 2 routers; my humble advice to openwrt's devs is to raise the respawn limits for procd to 5, as @hnyman stated above. Make it default for openwrt.
I didn't check why the ethernet link goes down/up three times but I imagine being kind of normal for a device rebooting and having the ethernet interface in a bridge. So, it's a pretty common setup for consumer grade equipment (ie: many users don't have an l2 switch and an l3 router; they just have 1 device to make it all). Good to have a respawn limit but 3 sounds a bit too strict.

I'll report once more once the issue is solved. Just to confirm the things stated in this message are correct.

anichang · November 13, 2023, 5:41am

I added "3600 5 10" in /etc/init.d/dnsmasq

procd_set_param respawn 3600 5 10

(10 respawns allowed, every 5 seconds, within 3600s time frame)

But it didn't fix the issue. Every time r1 is rebooted, r3's dnsmasq restarts a few times and then no more.

anichang · November 13, 2023, 6:22am

Interfaces aren't responsible for dnsmasq respawn. I set dnsmasq's log facility back to syslog, in order to have a clean time line and it seems like dnsmasq receives the last SIGTERM right after keepalived enters in BACKUP state (ie: removing the virtual IP from br-lan). If I turn off r2 to simplify the testbed, and start/stop keepalived on r1, I can trigger dnsmasq start/stop on r3. No idea why.
Every time I stop keepalived on r1, r3 goes from BACKUP to MASTER state, and restarts dnsmasq; and back when I restart keepalived on r1, r3 goes from MASTER to BACKUP state, and dnsmasq receives a SIGTERM. This is wild. Some undocumented feature?

hnyman · November 13, 2023, 6:31am

Look at your "ucitrack" file, which services affect dhcp / dnsmasq
and the hotplug files (e.g. ntp) etc.

bluewavenet · November 13, 2023, 8:05am

Whether this is related to your issue or not (I have not looked in detail into your issue), a few comments on respawn in procd:

The default for procd retry respawn is 5 (see documentation - https://openwrt.org/docs/guide-developer/procd-init-scripts#service_parameters )
In 23.05, the dnsmasq init.d script does not set its own respawn retry so should be set to the default of 5, as far as I can see.
Adding a procd_set_param respawn in an init.d script would normally be in start_service(), but in 23.05 it is in dnsmasq_start(), where you will find procd_set_param respawn with no parameters.

You could try changing the line in dnsmasq_start() to:
procd_set_param respawn 150 10 10

This will wait for 10 seconds before retrying (second parameter) and retry up to 10 times (third parameter),

The first parameter is the respawn threshold. This is the time interval after which procd resets the retry count.

Next add a new line after the one you just edited:
procd_set_param term_timeout 60

This sets the time interval that procd will wait for dnsmasq to come up before sending a SIGKILL
The default is 5 seconds and this might not be enough for your use case.

Finally, after saving the changes, to activate this you must do:

service dnsmasq stop
service dnsmasq disable
service dnsmsaq enable
service dnsmasq start

anichang · November 13, 2023, 8:08am

Well, the only affects is "network" but it seems a pretty wide one. VRRP (keepalived) adds/removes an alias to an interface on state changes, this may trigger "network"; in any case I can't remove the "affects", as it would break other things for sure. BTW, it is correct to restart dnsmasq when the ip alias gets added/removed, as dnsmasq must bind/unbind that address too. The problem is dnsmasq should be restarted, not stopped.

anichang · November 13, 2023, 1:31pm

That's exactly what I did with 3600 5 10 values. I saw the default (5) and extended to 10. Then I noticed the respawn might not be the issue. So the other guy pointed me to ucitrack ... I'm currently stuck at figuring out why dnsmasq stops instead of restarting when keepalived changes state (master->backup, or the opposite).

EDIT: I tried "150 10 10" as well. No fix.

antonk · November 14, 2023, 3:01pm

I suppose that if automatic re-spawn fails, there should be a reason for that. You could manually try to spawn it in this condition and attempt to track down the failing element.

pgaufillet · January 21, 2024, 9:06pm

Hi!
I am facing exactly the same issue and symptoms. Any piece of news since last Nov?

anichang · January 21, 2024, 9:21pm

Nope, I have the issue still. I'm waiting for some hw to get her before going back to work on the routers; probably a couple weeks more.

If you can find a solution please post it here. Thanks.

Blackfeather · January 27, 2024, 3:21am

Hello, I am a user from China. I'm sorry, my English is very poor and I can only talk to you through the DeepL translator. I hope you can understand these words. I had the same problem as you on my own openwrt. Then I found in the /tmp/keepalived.conf file that keepalived calls /sbin/hotplug-call when switching between backup and master. I guessed that this is why dnsmasq was stopped. I found the file 551-dnsmasq in the /etc/hotplug.d/keepalived folder, and changing set_stop_if_backup to set_restart_if_backup seemed to fix the problem. At least if I test it a few times, the behaviour of dnsmasq changes from suspend to restart. I may need some more time to verify this, but I thought you could give it a try.
By the way, if dnsmasq has dhcp enabled it will cause multiple dhcp servers to exist on the LAN. I guess that's why the keepalived hotplug script stops dnsmasq. So it's best to do something about it, such as finding a way to make the dhcp server run only on the master.
Since this is from DeepL translator, I'll summarise it.
edit file: /etc/hotplug.d/keepalived/551-dnsmasq
set_stop_if_backup => set_restart_if_backup
save.
Test it.
Good luck.
ps: My openwrt version: 23.05.2

pgaufillet · February 11, 2024, 7:49am

For now, I rely on a watchdog cron script executed every 5 minutes:

#!/bin/sh

for i in 0 1 2 3 4 ; do
  /etc/init.d/dnsmasq running && exit
  sleep 5
done

logger -t dnsmasq-watchdog "dnsmasq looks down: restarting."

/etc/init.d/dnsmasq restart

It works, but it is a dirty workaround...

@Blackfeather: Thanks, I am going to give a try to your finding. It is promising.

livelace · March 26, 2024, 2:14am

just caught the same problem and dig it out. found this keepalived hotplug script that performs all those unxpected SIGTERM to dnsmasq.

it's invoked by hardcoded notify_{backup,fault,master,stop} options inside /tmp/keepalived.conf.

i've just removed it from this path /etc/hotplug.d/keepalived/551-dnsmasq and master/backup switching works as expected, dnsmasq is alive.

grifo · May 6, 2024, 8:52am

Thanks for that, I've just set up keepalived and I couldn't figure out why dnsmasq would stop on my backup router.

I guess it's a feature to avoid having 2 DHCP and DNS servers running at once. I'll leave it as it is.