Mesh 802.11s routing table gets filled with garbage causing a black hole (OpenWrt 21.02 RC4, mt7603e, mt7615e)

I updated the firmware to the latest OpenWrt master, the problem keeps on happening, I changed my script to log each mac address which ends up in the rough routes, as well as to turn off and on wifi if it detects the issue after 2 consecutive runs.
I am now noticing interesting things which may help to shed light on what is going on. I will collect more data and report here.

I will also try to update hostapd, even though I am quite confident is not due to hostapd, because now that I log more details I have started to recognize some mac addresses that are present in the LAN and do not talk to the mesh via 802.11 in any way.

I'm suspecting something wrong with the mesh peer autodiscovery. It just happened again and I noticed the AP that couldn't connect to the mesh had the wrong peer's MAC (it was a valid MAC but not for the mesh0 interface).

I'm now testing it with hardcoded peers:
iw dev mesh0 mpath new <XYZ> next_hop <XYZ>

Here's an update:

  • I re-compiled the latest OpenWrt master + latest hostapd main branch, then flashed it on my devices
  • the bug keeps showing up
  • Since I now log the mac addresses of the rough routes that are inserted, I just spotted the mac address of my own laptop as well as the mac address of another openwrt router on the same LAN!

So the mac addresses are not invented after all, it's devices that are connected to the LAN in one way or another (either via ethernet cable or wifi), the mesh interfaces are bridged with the LAN, in some way these end up as a rough routes in the routing table of the mesh.

Here's the latest version of the script I am using:

FOUND_BUG="0"
PATTERN="/\s00:00:00:00:00:00\s/s/\s.*$//p"
MESH_BUG_FILE="/tmp/mesh-bug"

get_routes() {
	iw dev "$1" mpath dump \
	    | grep 00:00:00:00:00:00 \
	    | grep "0\t0\t0\t0\t1600\t4\t0x0\t0\t0" \
			| sed -n -e "$PATTERN"
}

for mac_addr in $(get_routes mesh0); do
	FOUND_BUG="1"
	iw dev mesh0 mpath del "$mac_addr"
	logger -t mesh "Cleared rough mesh0 route $mac_addr"
done

for mac_addr in $(get_routes mesh1); do
	FOUND_BUG="1"
	iw dev mesh1 mpath del "$mac_addr"
	logger -t mesh "Cleared rough mesh1 route $mac_addr"
done

if [ "$FOUND_BUG" == "0" ]; then
	rm $MESH_BUG_FILE 2> /dev/null
elif [ "$FOUND_BUG" == "1" ] && [ ! -f "$MESH_BUG_FILE" ]; then
	ip neigh flush dev br-lan
	logger -t mesh "Found mesh bug, mpath routes and ARP cleared"
	touch $MESH_BUG_FILE
elif [ "$FOUND_BUG" == "1" ] && [ -f "$MESH_BUG_FILE" ]; then
	ip neigh flush dev br-lan
	logger -t mesh "Found persistent mesh bug, reloading wifi"
	rm $MESH_BUG_FILE 2> /dev/null
	wifi down
	sleep 1
	wifi up
fi

The script seems to be effective in fixing the blackhole issue without manual intervention.

However, I think we need to understand where this bug is originating from and send a bug report.

I am wondering if this is related to

https://bugs.openwrt.org/index.php?do=details&task_id=3961

Fix for above was out shortly after your latest test.

I wonder as well, but I have no exact way to replicate this issue, it just happens randomly.
Today I had the issue again after a long time which I mitigated it with the script I mentioned above, in the logs I have seen something different:

root@OpenWrt:~ # logread | grep mesh
Mon Oct 11 04:12:28 2021 kern.info kernel: [  727.597214] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:12:28 2021 kern.info kernel: [  727.610298] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:12:28 2021 kern.info kernel: [  727.694934] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:40:24 2021 kern.info kernel: [ 2403.773171] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:40:24 2021 kern.info kernel: [ 2403.786241] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:40:24 2021 kern.info kernel: [ 2403.944355] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:40:52 2021 kern.info kernel: [ 2431.933449] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:40:52 2021 kern.info kernel: [ 2431.946484] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:40:52 2021 kern.info kernel: [ 2432.041426] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:41:28 2021 kern.info kernel: [ 2467.783968] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:41:28 2021 kern.info kernel: [ 2467.797034] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:41:30 2021 kern.info kernel: [ 2469.563320] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:41:44 2021 kern.info kernel: [ 2483.773950] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost
Mon Oct 11 04:41:44 2021 kern.info kernel: [ 2483.787005] br-lan: port 4(mesh1) entered listening state
Mon Oct 11 04:41:44 2021 kern.info kernel: [ 2483.909368] br-lan: port 4(mesh1) entered blocking state
Mon Oct 11 04:42:24 2021 kern.info kernel: [ 2524.094326] br-lan: port 6(mesh0) neighbor 0fa1.80:**:**:**:**:** lost

Given I am using a build from master of 8 days ago (ade56b8d9e) and the problem is still happening while comments in the bug report regarding WDS claim it's fixed, I think this is a different bug, not related to WDS.

I opened a bug report: https://bugs.openwrt.org/index.php?do=details&task_id=4099.

Sounds like something that should be carried upstream, at least to get some guidance on how to debug this further. The OpenWrt wireless stack developer community is already very small, the amount of people being into 802.11s details is even smaller.

I totally understand and agree, do you have any suggestion on where to report this? Thanks.

I would suggest to get in touch with the Linux Wireless mailing list.

1 Like

I sent it, thanks for your suggestion.

https://lore.kernel.org/linux-wireless/CAAGgX6+okJRVP4v+a5scg1ZOTvBgRmbOAsoQhJVmBiSwiVamcw@mail.gmail.com/T/

Finger crossed..

I have the same issue I think. If I mesh 2 Openwrt (Netgear R6120), the whole network is not usable after some seconds. Default gateway, internet, all systems in the same network are unreachable.
Running latest OpenWrt 21.02.1 r16325-88151b8303. After pulling the network cable from one Netgear, the network is reachable again.
Or is it forbidden to use a network cable in mesh mode?
Edit: I made a network trace and faced that my Samsung television is flooding UDP multicast 239.255.255.250:15600 into the network if I mesh the two Openwrt Netgears. That's strange and some kind of funny :slight_smile:

You're probably creating a broadcast storm by bridging the mesh interface with the eth interface. Definitely not the same issue.

1 Like

That sounds like network loop. Ensure the Spanning Tree Protocol is enabled on the bridge, in OpenWrt 21 it has to be defined on the new "device" directive.

I found out that UPnP devices create udp multicast storm. So If a device has an active uPNP like my Samsung TV or my TP-Link TL-WR841N (under forwarding/UPnP), then the devices storms my whole network with udp so hard, that nothing is reachable any more. crazy stuff! so I created a separate device for the mesh interface (which is not ETH0 LAN) and now it's working.

1 Like