I have a mesh setup made of 3 routers, WAN uplink router is the only DHCP and DNS server, remaining two are dumb mesh repeaters/access points which are daisy-chained to a computer. STP is enabled on all of them
That computer has dhcpcd installed and is configured to request a known IP via DHCPINFORM (inform config line). Whenever I boot this computer I have to wait for a minute or so because usually only 4-5th dhcp request makes it to the dhcp server.
This only persists as long as STP is enabled on routers.
devices are synchronized to ~1 second.
relevant part of computer log
daemon.notice: Aug 12 17:21:59 dhcpcd: enp6s0: sending INFORM (xid 0xed14fe60), next in 3.8 seconds
daemon.notice: Aug 12 17:22:02 dhcpcd: enp6s0: sending INFORM (xid 0xed14fe60), next in 8.4 seconds
daemon.notice: Aug 12 17:22:11 dhcpcd: enp6s0: sending INFORM (xid 0xed14fe60), next in 16.9 seconds
daemon.notice: Aug 12 17:22:28 dhcpcd: enp6s0: sending INFORM (xid 0xed14fe60), next in 32.5 seconds
daemon.info: Aug 12 17:22:28 dhcpcd[9436]: enp6s0: received approval for 192.168.1.8
daemon.notice: Aug 12 17:22:28 dhcpcd: enp6s0: received approval for 192.168.1.8
daemon.notice: Aug 12 17:22:28 dhcpcd: enp6s0: adding IP address 192.168.1.8/24 broadcast 192.168.1.255
DHCP server log
Mon Aug 12 17:22:10 2024 daemon.info dnsmasq-dhcp[1]: DHCPINFORM(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:22:10 2024 daemon.info dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:22:27 2024 daemon.info dnsmasq-dhcp[1]: DHCPINFORM(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
Mon Aug 12 17:22:27 2024 daemon.info dnsmasq-dhcp[1]: DHCPACK(br-lan) 192.168.1.8 e0:69:95:b0:d8:d3
relevant part of /etc/config/network
config device
option name 'br-lan'
option type 'bridge'
list ports 'lan2'
list ports 'lan3'
list ports 'lan4'
option priority '10'
option stp '1'
option hello_time '2'
option forward_delay '2'
option ipv6 '0'
config interface 'lan'
option device 'br-lan'
option proto 'static'
option ipaddr '192.168.1.1'
option netmask '255.255.255.0'
option ipv6 '0'
list dns '8.8.8.8'
option delegate '0'
Why do you have STP enabled?
Unless you have non-mesh segments in your backhaul, you will not have mesh bridge loops.
In addition, mesh bridge loops cannot be controlled by STP, because it works at bridge level, not interface level.
The question remains as to why you are having this problem that to my knowledge does not occur in a mesh network.
Perhaps sharing some of the OpenWrt logs would have been a better idea.
I flashed yesterday's snapshot but the problem is still there, as soon as an end point connects to the router via either WLAN or wired router can't reach my other mesh points.
It is very unclear what you mean by this.
What do you define as an "end point"?
The phrase "as soon as", implies immediately, but your title is "DHCP slow...", you are being somewhat inconsistent, making it hard to understand your problem. "A tendency to stop working", is a bit different to "as soon as".
Does this imply you actually do have a wired segment of backhaul?
This would explain your attempts to stabilise the mesh using STP.
If you want some cabled segments, you need either mesh11sd or batman-adv.
I apologize for my unclear description of the situation.
A purely wireless mesh is deployed on 3 routers, 2 of them are xiaomi AX3000t, remaining one is Cudy RE3000. First AX3000t is WAN connected to my ISP, rest are daisy chained together: RE3000 connects via mesh to first AX3000t, second AX3000t to the RE3000 -- there are no loops, both AX3000t can't reach each other since the signal is not strong enough for that to work. Each router serves as a wireless AP for a fast-roamed network with shared mobility domain.
I have disabled STP now so dhcp issues are gone.
Offending router has always been the third - the last -in the chain. Computer is connected to it via wire. Sporadically it can stop being able to reach other routers via mesh, even though the mesh itself is not disassociated. It is most prevalent after time of inactivity such as over night: often when I boot the computer or connect to wifi with my phone after night the router itself is accessible but suddenly watchcat starts reporting being unable to ping other routers in the mesh.
This is typical of the problem that seems to be fixed with the commit mentioned earlier, but not necessarily. It could also be a bad mesh config allowing hop changes in the backhaul that break layer 3 connections.
You say the mesh nodes at each end of your daisy chain cannot see each other:
A possible scenario:
In a domestic/indoor environment, "signal is not strong enough" is often not 100% true. More likely is that when things are quiet, enough signal is available for a connection, allowing a path hop count change. This breaks layer 3, but it eventually recovers via processes such as dhcp and/or arp.
In the morning, you connect, the background gets noisy again and the path changes, breaking layer 3 again. Eventually layer 3 recovers yet again....Could be minutes or hours, depends....
So all I need to do it to tune txpower on edge routers? Logs never show any association between those at any point in time.
For the last couple of days I ran with a swapped configuration where my mid-point and end-of-chain router were swapped together and the problem was still there.
So the solution may be as simple as that, I'll give it a try.
I reduced txpower on all my routers, set mesh_rssi_threshold to 75, but as I expected it wasn't the issue.
I set up watchcat to run a script every time it can't ping my main WAN router
iw dev backbone mpath dump part is for showing the number of mesh peer path changes just before resetting the 5GHz radio transmitter.
Every time it logs only 1:
part of log output
daemon.warn wifirestart: ####################################
daemon.warn wifirestart: Mesh routing down, wifi restarting
daemon.warn wifirestart: ####################################
daemon.warn wifirestart: cc:d8:43:b0:a3:df 1
daemon.warn wifirestart: 82:af:ca:13:ce:41 1
daemon.notice hostapd: Set new config for phy phy1:
I really don't know how to continue from this point, feels like I exhausted all my options.
To be sure, are you still saying the mesh link is staying up but layer 3 is failing (eg your ping fails)?
Try restarting the network instead of the wireless. ie service network restart.
If this works it might shed some light on the problem. (if this works it gives some weight to the dma buffer overflow issue being the problem).
Also, what values are you setting for mesh_hwmp_rootmode and mesh_plink_timeout ? (inappropriate settings for these can be a cause of drop outs, but usually at layer 2, not just layer 3).
Is that a typo? The value must be negative or it will be ignored.
ie: option mesh_rssi_threshold '-75'
By the way, -75dBm is an excessively high sensitivity in most cases, especially indoors.
Multipath reflections and diffraction of the microwave signals can cause dynamic signal-hotspots, (the technical term is an interference pattern, where reflected signals constructively and destructively interfere with each other).
Mesh11sd defaults to -65dBm for a reason.
For an indoor situation, this is generally far too low, as destructive interference can give intervals of complete dropout. Outdoors this will usually be just fine as destructive interference will be very rare.
If the mesh is on 5GHz, you could try changing to 2.4GHz and go up to HS40 mode (should give ~500Mb/s + on these devices). There is less attenuation on 2.4GHz.
Otherwise you could try relocating the meshnodes - a bit hit and miss.
As for mesh_plink_timeout, a value of 6000 is fine outdoors. In an indoor multipath environment you could try changing this to 10 to timeout quickly, forcing connection retries as soon as possible, rather than waiting 100 minutes.
Worth a try.
Also it might be worth trying mesh_hwmp_rootmode set to 4 on the first in the chain. This mode is significantly more efficient and more resilient, but this is probably not going to make any difference because of the poor signal condition.
The last resort is to add another node into the chain.....