[Solved] ntp fails/time lags when dnsmasq service is disabled

I'm test-driving/shaking down a Netgear WAX206 in preparation for it to replace a R7800 as the main house router.

During this shakedown phase WAN and WAN6 is disabled. Upstream is instead provided via the lan interface configured as a DHCP client and with a 2.4GHz mesh interface connecting to the R7800. There's also a single AP on the 5GHz side, also bridged (as is the default) to the LAN side.

Edit: DHCP is disabled on the LAN interface, and the dnsmasq service is also disabled.

# cat /etc/config/network 

config interface 'loopback'
	option device 'lo'
	option proto 'static'
	option ipaddr '127.0.0.1'
	option netmask '255.0.0.0'

config globals 'globals'
	option ula_prefix ...

config interface 'lan'
	option device 'br-lan'
	option proto 'dhcp'

config interface 'wan'
	option device 'wan'
	option proto 'dhcp'
	option auto '0'

config interface 'wan6'
	option device 'wan'
	option proto 'dhcpv6'
	option auto '0'
	option reqaddress 'try'
	option reqprefix 'auto'
# cat /etc/config/wireless

config wifi-device 'radio0'
	option type 'mac80211'
	option path 'platform/18000000.wmac'
	option channel 'auto'
	option band '2g'
	option htmode 'HT20'
	option cell_density '0'
	option country 'US'
	option txpower '14'

config wifi-device 'radio1'
	option type 'mac80211'
	option path '1a143000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0'
	option channel '120'
	option band '5g'
	option htmode 'HE40'
	option cell_density '0'
	option country 'US'
	option txpower '14'

config wifi-iface 'default_radio1'
	option device 'radio1'
	option network 'lan'
	option mode 'ap'
	option ssid ...
	option encryption 'sae'
	option key ...
	option wpa_disable_eapol_key_retries '1'

config wifi-iface 'wifinet3'
	option device 'radio0'
	option mode 'mesh'
	option encryption 'sae'
	option mesh_id ...
	option mesh_fwding '1'
	option mesh_rssi_threshold '0'
	option key ...
	option network 'lan'

After finalizing this configuration, I left the router (ok, more of an access point atm) running for a couple of hours. When I checked back, I noticed that the system clock was behind by about 30 minutes.

I toggled ntp off then on by clearing the Enable NTP client checkbox, saving+applying, then restoring the config as it was before and saving + applying on the System → System luci screen:

I suspect that there's some (hotplug?) configuration that ties ntp specifically to events on the WAN interfaces I've disabled. Is that correct? Any ideas on how to tweak the configuration so that time stays synchronized on my setup?

If there's no WAN access, then the NTP client daemon won't be able to read time from 1.openwrt.pool.ntp.org and friends, so the server can't sync up the clock. I'm sort of surprised it's only off by 30 minutes, seems like it would be some arbitrary time. There shouldn't be any hotplug involved, ntpd just polls those addresses at some regular interval and updates the system clock accordingly. If you can create a route so that pinging ntp.org works, then the clock should sync up pretty quickly.

The internet is accessible over br-lan though. I'm typing this on a laptop that's a STA on the 5G AP, and on the WAX206 itself right now:

# ip r
default via 192.168.2.1 dev br-lan  src 192.168.2.110 
192.168.2.0/24 dev br-lan scope link  src 192.168.2.110 
# ping 0.openwrt.pool.ntp.org
PING 0.openwrt.pool.ntp.org (216.31.16.12): 56 data bytes
64 bytes from 216.31.16.12: seq=0 ttl=46 time=67.775 ms

I suspect that the drift had something to do with the sequence in which interfaces come up/connectivity is achieved and when/how ntp is started. Some kind of startup-time race condition.

Hmm, curious isn't it? If you look at System -> Startup, you'll see the ntpd is one of the last things, down at 98, so it does try to wait until everything else is up.

There is a hotplug that ntpd fires off to let dnsmasq know the clock is valid, but that's on my machine with dnsmasq running. Do you see this file? (May not be there since you've got dnsmasq disabled.) I can't see how a failure in signaling dnsmasq would propagate back into the ntpd, though...

$ cat /var/state/dnsmasqsec
ntpd says time is valid

30 minutes is a large amount of drift for a few hours of runtime. That doesn't make sense even if NTP only sync'd at the beginning.

I suspect that there is some issue with NTP sync, but possibly slightly different than what you're expecting...

My guess is that the device rebooted at some point, approximately 30 minutes after you finalized the configuration and NTP failed to sync when/after it rebooted. Because most router type devices don't have a real-time clock, OpenWrt will use the timestamp of the last file to be saved as the "current time" upon boot and until corrected by a successful NTP sync.

To answer the original question -- yes, as long as there is a proper upstream connection on br-lan, NTP sync should be possible and reliable. By "proper upstream," I am referring to the complete IP configuration (IP address, subnet mask, dns server, gateway) appropriate for that upstream connection that provides proper connectivity (and of course, a working internet connection via that upstream network). Since you have your lan setup as dhcp client, I don't see any issue with that.

But it is critical that you can verify that the internet connection is working from the router itself -- are you able to ping sites on the internet (like google.com) from an ssh session into the router? And can you force an ntp update and have a successful update occur?

1 Like

I do see that file, but its last modification timestamp is several hours old; around the time I was fiddling with the network configuration and disabling dnsmasq.

# cat /var/state/dnsmasqsec 
ntpd says time is valid
# ls -l /var/state/dnsmasqsec 
-rw-r--r--    1 root     root            24 Jun 19 14:31 /var/state/dnsmasqsec
# date
Mon Jun 19 16:40:24 PDT 2023

Which brings us to:

Yes, I was definitely messing around with the configuration, so files were modified/timestamped. After the network setup was working overall, I don't believe I rebooted the device before I left the house for a while, returning a little while later when I noticed the drift. It's plausible that the intermediate states that the network setup was in while I was tweaking the configurations got ntpd into a funky state (still a stretch since it's supposed to be stateless and just periodically pinging over UDP).

Anyway, I'll go ahead and reboot the device now and chack on it periodically over the next few hours.

Thanks both for chiming in! I'll update this thread if only for posterity.

It could have been a reboot caused by something other than you actively rebooting it. A crash, power issue, or other situation could have been the root cause of a reboot. Log data can be useful (logs will be lost on reboot, but you might see evidence of a reboot. Alternatively, you could send logs to a syslog server if you have one setup (or can get that running).

Interestingly, after a reboot, the device is now again several hours "behind". Presumably, that's the timestamp of the most recently modified (config?) file. So ntpd failed on at least its initial attempt to get current time.

This time there's no var/state/dnsmasqsec which makes sense since the service was already disabled at boot time.

So there's some kind of race condition when ntpd is started, now to see what and how sticky it is...

When you're ssh'd into the device, can you ping public web servers?
What happens when you force the ntp sync?

Yes, by the time I ssh in there seems to be good connectivity:

# ping www.google.com
PING www.google.com (172.217.14.196): 56 data bytes
64 bytes from 172.217.14.196: seq=0 ttl=58 time=12.007 ms
# ping 0.openwrt.pool.ntp.org
PING 0.openwrt.pool.ntp.org (162.159.200.123): 56 data bytes
64 bytes from 162.159.200.123: seq=0 ttl=55 time=9.406 ms

The system time remains (as of a few minutes after reboot anyway) several hours behind.

1 Like

Did you force the NTP sync?

Not when I had tested connectivity.

Once I do that it catches up:

root@angua:~# service sysntpd status
running
root@angua:~# date
Mon Jun 19 14:52:22 PDT 2023
root@angua:~# service sysntpd restart
root@angua:~# service sysntpd status
running
root@angua:~# date
Mon Jun 19 16:58:17 PDT 2023

You might be running into this issue, which can result in the NTP client not working when dnsmasq is disabled.

3 Likes

That's a good bit of info!

@dimitris - Generally speaking, it is not necessary or desirable to disable dnsmasq. It is obviously critical that you do not have another DHCP server running on your network, so this is the reason many people disable dnsmasq. However, the preferred method is to set the lan network DHCP server to ignore the interface so that the server itself is explicitly disabled as a function of the configuration. This is considered best because dnsmasq will become re-enabled when you run a sysupgrade at some point in the future (and/or could be accidentally or intentionally re-enabled for other reasons).

Therefore, I'd recommend that you try ignoring the lan interface in the dhcp server config and re-enable dnsmasq. See if that fixes the issue.

Oh, good find! That does seem to explain the behavior dimitris is seeing.

That makes perfect sense, thanks!

This did the trick. I actually already had the lan interface ignored, just wanted to avoid any DNS issues for devices that use the WAX206. That doesn't seem a problem with dnsmasq re-enabled.

On the WAX206 itself, trying to resolve other devices that dnsmasq on the "bridged upstream" R7800 knows about fails, since I've left both routers using the default .lan domain. That's ok though, this is a temporary setup.

I'll try the alternative of explicitly providing the NTP DHCP option from upstream later.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.