DHCP slow to grant addresses when STP is enabled

I have a mesh setup made of 3 routers, WAN uplink router is the only DHCP and DNS server, remaining two are dumb mesh repeaters/access points which are daisy-chained to a computer. STP is enabled on all of them

That computer has dhcpcd installed and is configured to request a known IP via DHCPINFORM (inform config line). Whenever I boot this computer I have to wait for a minute or so because usually only 4-5th dhcp request makes it to the dhcp server.
This only persists as long as STP is enabled on routers.
devices are synchronized to ~1 second.

relevant part of computer log
daemon.notice: Aug 12 17:21:59 dhcpcd: enp6s0: sending INFORM (xid 0xed14fe60), next in 3.8 seconds
daemon.notice: Aug 12 17:22:02 dhcpcd: enp6s0: sending INFORM (xid 0xed14fe60), next in 8.4 seconds
daemon.notice: Aug 12 17:22:11 dhcpcd: enp6s0: sending INFORM (xid 0xed14fe60), next in 16.9 seconds
daemon.notice: Aug 12 17:22:28 dhcpcd: enp6s0: sending INFORM (xid 0xed14fe60), next in 32.5 seconds
daemon.info: Aug 12 17:22:28 dhcpcd[9436]: enp6s0: received approval for 192.168.1.8
daemon.notice: Aug 12 17:22:28 dhcpcd: enp6s0: received approval for 192.168.1.8
daemon.notice: Aug 12 17:22:28 dhcpcd: enp6s0: adding IP address 192.168.1.8/24 broadcast 192.168.1.255
DHCP server log
Mon Aug 12 17:22:10 2024 daemon.info dnsmasq-dhcp[1]: DHCPINFORM(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:22:10 2024 daemon.info dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:22:27 2024 daemon.info dnsmasq-dhcp[1]: DHCPINFORM(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:22:27 2024 daemon.info dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
relevant part of /etc/config/network
config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'lan4'
        option priority '10'
        option stp '1'
        option hello_time '2'
        option forward_delay '2'
        option ipv6 '0'

config interface 'lan'
        option device 'br-lan'
        option proto 'static'
        option ipaddr '192.168.1.1'
        option netmask '255.255.255.0'
        option ipv6 '0'
        list dns '8.8.8.8'
        option delegate '0'

/etc/config/dhcp
config dnsmasq
        option domainneeded '1'
        option boguspriv '1'
        option filterwin2k '0'
        option localise_queries '1'
        option rebind_protection '1'
        option rebind_localhost '1'
        option local '/lan/'
        option domain 'lan'
        option expandhosts '1'
        option nonegcache '0'
        option cachesize '1000'
        option authoritative '1'
        option readethers '1'
        option leasefile '/tmp/dhcp.leases'
        option resolvfile '/tmp/resolv.conf.d/resolv.conf.auto'
        option nonwildcard '1'
        option localservice '1'
        option ednspacket_max '1232'
        option filter_aaaa '0'
        option filter_a '0'

config dhcp 'lan'
        option interface 'lan'
        option start '50'
        option limit '150'
        option leasetime '12h'
        option dhcpv4 'server'

config dhcp 'wan'
        option interface 'wan'
        option ignore '1'

config odhcpd 'odhcpd'
        option maindhcp '0'
        option leasefile '/tmp/hosts/odhcpd'
        option leasetrigger '/usr/sbin/odhcpd-update'
        option loglevel '4'

Stp keeps port blocked trying to capture looped bridge at start.

But only for first 4 seconds, lan port receives forwarding state before second request is made. How long does it take to propagate the change?

It is between bridge checkboxes

What exactly is between checkboxes? Do you mean some configuration options in Luci?

Once you enable STP in LuCI you see options with various STP holdoffs timeouts etc.

Why do you have STP enabled?
Unless you have non-mesh segments in your backhaul, you will not have mesh bridge loops.
In addition, mesh bridge loops cannot be controlled by STP, because it works at bridge level, not interface level.

The question remains as to why you are having this problem that to my knowledge does not occur in a mesh network.
Perhaps sharing some of the OpenWrt logs would have been a better idea.

Tried my best to stabilise my mesh that has a tendency to stop working.

I run dhcp on OpenWRT, here is a bigger excerpt from that period, 192.168.1.197 is my phone.

log
Mon Aug 12 13:03:29 2024 daemon.warn dnsmasq[12]: possible DNS-rebind attack detected: dns.msftncsi.com
Mon Aug 12 17:13:19 2024 daemon.info dnsmasq-dhcp[1]: DHCPDISCOVER(br-lan) 0c:cb:85:a0:4d:98
Mon Aug 12 17:13:19 2024 daemon.info dnsmasq-dhcp[1]: DHCPOFFER(br-lan) 192.168.1.197 0c:cb:85:a0:4d:98
Mon Aug 12 17:13:21 2024 daemon.info dnsmasq-dhcp[1]: DHCPDISCOVER(br-lan) 0c:cb:85:a0:4d:98
Mon Aug 12 17:13:21 2024 daemon.info dnsmasq-dhcp[1]: DHCPOFFER(br-lan) 192.168.1.197 0c:cb:85:a0:4d:98
Mon Aug 12 17:13:21 2024 daemon.info dnsmasq-dhcp[1]: DHCPREQUEST(br-lan) 192.168.1.197 0c:cb:85:a0:4d:98
Mon Aug 12 17:13:21 2024 daemon.info dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.197 0c:cb:85:a0:4d:98
Mon Aug 12 17:22:10 2024 daemon.info dnsmasq-dhcp[1]: DHCPINFORM(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:22:10 2024 daemon.info dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:22:27 2024 daemon.info dnsmasq-dhcp[1]: DHCPINFORM(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:22:27 2024 daemon.info dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:26:22 2024 authpriv.info dropbear[4832]: Child connection from 192.168.1.8:53570

That is a known bug in the mt76/79 driver - dma buffer overflow.
Fixed in current snapshot.
See commit:

I flashed yesterday's snapshot but the problem is still there, as soon as an end point connects to the router via either WLAN or wired router can't reach my other mesh points.

It is very unclear what you mean by this.
What do you define as an "end point"?

The phrase "as soon as", implies immediately, but your title is "DHCP slow...", you are being somewhat inconsistent, making it hard to understand your problem. "A tendency to stop working", is a bit different to "as soon as".

Does this imply you actually do have a wired segment of backhaul?
This would explain your attempts to stabilise the mesh using STP.

If you want some cabled segments, you need either mesh11sd or batman-adv.

I apologize for my unclear description of the situation.
A purely wireless mesh is deployed on 3 routers, 2 of them are xiaomi AX3000t, remaining one is Cudy RE3000. First AX3000t is WAN connected to my ISP, rest are daisy chained together: RE3000 connects via mesh to first AX3000t, second AX3000t to the RE3000 -- there are no loops, both AX3000t can't reach each other since the signal is not strong enough for that to work. Each router serves as a wireless AP for a fast-roamed network with shared mobility domain.

I have disabled STP now so dhcp issues are gone.

Offending router has always been the third - the last -in the chain. Computer is connected to it via wire. Sporadically it can stop being able to reach other routers via mesh, even though the mesh itself is not disassociated. It is most prevalent after time of inactivity such as over night: often when I boot the computer or connect to wifi with my phone after night the router itself is accessible but suddenly watchcat starts reporting being unable to ping other routers in the mesh.

If you use your computer to bridge wifi and wire STP kicks in and kills one of 2. Like if docker enabled global forwarding or similar.

This is typical of the problem that seems to be fixed with the commit mentioned earlier, but not necessarily. It could also be a bad mesh config allowing hop changes in the backhaul that break layer 3 connections.
You say the mesh nodes at each end of your daisy chain cannot see each other:

A possible scenario:
In a domestic/indoor environment, "signal is not strong enough" is often not 100% true. More likely is that when things are quiet, enough signal is available for a connection, allowing a path hop count change. This breaks layer 3, but it eventually recovers via processes such as dhcp and/or arp.
In the morning, you connect, the background gets noisy again and the path changes, breaking layer 3 again. Eventually layer 3 recovers yet again....Could be minutes or hours, depends....

So all I need to do it to tune txpower on edge routers? Logs never show any association between those at any point in time.
For the last couple of days I ran with a swapped configuration where my mid-point and end-of-chain router were swapped together and the problem was still there.
So the solution may be as simple as that, I'll give it a try.

Yes, but more important, mesh_rssi_threshold

I reduced txpower on all my routers, set mesh_rssi_threshold to 75, but as I expected it wasn't the issue.
I set up watchcat to run a script every time it can't ping my main WAN router

wifirestart script
#!/bin/sh
echo "####################################" | logger -p daemon.warning -t wifirestart
echo " Mesh routing down, wifi restarting " | logger -p daemon.warning -t wifirestart
echo "####################################" | logger -p daemon.warning -t wifirestart
iw dev backbone mpath dump | awk '{print $1 "  " $12}'  \
        | tail -n+2| logger -p daemon.warning -t wifirestart
wifi down radio5
sleep 2
wifi up radio5

iw dev backbone mpath dump part is for showing the number of mesh peer path changes just before resetting the 5GHz radio transmitter.
Every time it logs only 1:

part of log output
daemon.warn wifirestart: ####################################
daemon.warn wifirestart:  Mesh routing down, wifi restarting
daemon.warn wifirestart: ####################################
daemon.warn wifirestart: cc:d8:43:b0:a3:df  1
daemon.warn wifirestart: 82:af:ca:13:ce:41  1
daemon.notice hostapd: Set new config for phy phy1:

I really don't know how to continue from this point, feels like I exhausted all my options.

To be sure, are you still saying the mesh link is staying up but layer 3 is failing (eg your ping fails)?

Try restarting the network instead of the wireless. ie service network restart.
If this works it might shed some light on the problem. (if this works it gives some weight to the dma buffer overflow issue being the problem).

Also, what values are you setting for mesh_hwmp_rootmode and mesh_plink_timeout ? (inappropriate settings for these can be a cause of drop outs, but usually at layer 2, not just layer 3).

Is that a typo? The value must be negative or it will be ignored.
ie:
option mesh_rssi_threshold '-75'

By the way, -75dBm is an excessively high sensitivity in most cases, especially indoors.
Multipath reflections and diffraction of the microwave signals can cause dynamic signal-hotspots, (the technical term is an interference pattern, where reflected signals constructively and destructively interfere with each other).
Mesh11sd defaults to -65dBm for a reason.

1 Like

option mesh_hwmp_rootmode '2' and option mesh_plink_timeout '6000'

I set mesh_rssi_threshold to -75 because in some cases my backbone signal drops to -72~73 dBm.

For an indoor situation, this is generally far too low, as destructive interference can give intervals of complete dropout. Outdoors this will usually be just fine as destructive interference will be very rare.

If the mesh is on 5GHz, you could try changing to 2.4GHz and go up to HS40 mode (should give ~500Mb/s + on these devices). There is less attenuation on 2.4GHz.

Otherwise you could try relocating the meshnodes - a bit hit and miss.

As for mesh_plink_timeout, a value of 6000 is fine outdoors. In an indoor multipath environment you could try changing this to 10 to timeout quickly, forcing connection retries as soon as possible, rather than waiting 100 minutes.
Worth a try.

Also it might be worth trying mesh_hwmp_rootmode set to 4 on the first in the chain. This mode is significantly more efficient and more resilient, but this is probably not going to make any difference because of the poor signal condition.

The last resort is to add another node into the chain.....