I updated the firmware to the latest OpenWrt master, the problem keeps on happening, I changed my script to log each mac address which ends up in the rough routes, as well as to turn off and on wifi if it detects the issue after 2 consecutive runs.
I am now noticing interesting things which may help to shed light on what is going on. I will collect more data and report here.
I will also try to update hostapd, even though I am quite confident is not due to hostapd, because now that I log more details I have started to recognize some mac addresses that are present in the LAN and do not talk to the mesh via 802.11 in any way.
I'm suspecting something wrong with the mesh peer autodiscovery. It just happened again and I noticed the AP that couldn't connect to the mesh had the wrong peer's MAC (it was a valid MAC but not for the mesh0 interface).
I'm now testing it with hardcoded peers: iw dev mesh0 mpath new <XYZ> next_hop <XYZ>
I re-compiled the latest OpenWrt master + latest hostapd main branch, then flashed it on my devices
the bug keeps showing up
Since I now log the mac addresses of the rough routes that are inserted, I just spotted the mac address of my own laptop as well as the mac address of another openwrt router on the same LAN!
So the mac addresses are not invented after all, it's devices that are connected to the LAN in one way or another (either via ethernet cable or wifi), the mesh interfaces are bridged with the LAN, in some way these end up as a rough routes in the routing table of the mesh.
Here's the latest version of the script I am using:
FOUND_BUG="0"
PATTERN="/\s00:00:00:00:00:00\s/s/\s.*$//p"
MESH_BUG_FILE="/tmp/mesh-bug"
get_routes() {
iw dev "$1" mpath dump \
| grep 00:00:00:00:00:00 \
| grep "0\t0\t0\t0\t1600\t4\t0x0\t0\t0" \
| sed -n -e "$PATTERN"
}
for mac_addr in $(get_routes mesh0); do
FOUND_BUG="1"
iw dev mesh0 mpath del "$mac_addr"
logger -t mesh "Cleared rough mesh0 route $mac_addr"
done
for mac_addr in $(get_routes mesh1); do
FOUND_BUG="1"
iw dev mesh1 mpath del "$mac_addr"
logger -t mesh "Cleared rough mesh1 route $mac_addr"
done
if [ "$FOUND_BUG" == "0" ]; then
rm $MESH_BUG_FILE 2> /dev/null
elif [ "$FOUND_BUG" == "1" ] && [ ! -f "$MESH_BUG_FILE" ]; then
ip neigh flush dev br-lan
logger -t mesh "Found mesh bug, mpath routes and ARP cleared"
touch $MESH_BUG_FILE
elif [ "$FOUND_BUG" == "1" ] && [ -f "$MESH_BUG_FILE" ]; then
ip neigh flush dev br-lan
logger -t mesh "Found persistent mesh bug, reloading wifi"
rm $MESH_BUG_FILE 2> /dev/null
wifi down
sleep 1
wifi up
fi
The script seems to be effective in fixing the blackhole issue without manual intervention.
However, I think we need to understand where this bug is originating from and send a bug report.
I wonder as well, but I have no exact way to replicate this issue, it just happens randomly.
Today I had the issue again after a long time which I mitigated it with the script I mentioned above, in the logs I have seen something different:
root@OpenWrt:~ # logread | grep mesh
Mon Oct 11 04:12:28 2021 kern.info kernel: [ 727.597214] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:12:28 2021 kern.info kernel: [ 727.610298] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:12:28 2021 kern.info kernel: [ 727.694934] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:40:24 2021 kern.info kernel: [ 2403.773171] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:40:24 2021 kern.info kernel: [ 2403.786241] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:40:24 2021 kern.info kernel: [ 2403.944355] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:40:52 2021 kern.info kernel: [ 2431.933449] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:40:52 2021 kern.info kernel: [ 2431.946484] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:40:52 2021 kern.info kernel: [ 2432.041426] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:41:28 2021 kern.info kernel: [ 2467.783968] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:41:28 2021 kern.info kernel: [ 2467.797034] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:41:30 2021 kern.info kernel: [ 2469.563320] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:41:44 2021 kern.info kernel: [ 2483.773950] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:41:44 2021 kern.info kernel: [ 2483.787005] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:41:44 2021 kern.info kernel: [ 2483.909368] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:42:24 2021 kern.info kernel: [ 2524.094326] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Given I am using a build from master of 8 days ago (ade56b8d9e) and the problem is still happening while comments in the bug report regarding WDS claim it's fixed, I think this is a different bug, not related to WDS.
Sounds like something that should be carried upstream, at least to get some guidance on how to debug this further. The OpenWrt wireless stack developer community is already very small, the amount of people being into 802.11s details is even smaller.
I have the same issue I think. If I mesh 2 Openwrt (Netgear R6120), the whole network is not usable after some seconds. Default gateway, internet, all systems in the same network are unreachable.
Running latest OpenWrt 21.02.1 r16325-88151b8303. After pulling the network cable from one Netgear, the network is reachable again.
Or is it forbidden to use a network cable in mesh mode?
Edit: I made a network trace and faced that my Samsung television is flooding UDP multicast 239.255.255.250:15600 into the network if I mesh the two Openwrt Netgears. That's strange and some kind of funny
That sounds like network loop. Ensure the Spanning Tree Protocol is enabled on the bridge, in OpenWrt 21 it has to be defined on the new "device" directive.
I found out that UPnP devices create udp multicast storm. So If a device has an active uPNP like my Samsung TV or my TP-Link TL-WR841N (under forwarding/UPnP), then the devices storms my whole network with udp so hard, that nothing is reachable any more. crazy stuff! so I created a separate device for the mesh interface (which is not ETH0 LAN) and now it's working.