Mesh 802.11s routing table gets filled with garbage causing a black hole (OpenWrt 21.02 RC4, mt7603e, mt7615e)

Out of the blue a mesh router running OpenWrt 21.02 RC4 connected via 802.11s mesh mode (and mesh forwarding active and no other routing deamon if not the default 802.11s routing protocol) gets filled with garbage and the L2 connection between the mesh nodes to the router is lost (so the internet connection too).

After several times this thing has happened I finally found a weird thing which looks a symptom of the issue.

This happened on 24th of August:

iw mesh1 mpath dump
DEST ADDR         NEXT HOP          IFACE	SN	METRIC	QLEN	EXPTIME		DTIM	DRET	FLAGS	HOP_COUNT	PATH_CHANGE
90:38:3d:dd:42:61 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
3a:bd:b8:ec:00:da 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
2e:bc:05:55:1b:fe 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
b6:75:f9:d4:75:21 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
2a:81:a7:88:b2:3d 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
38:cb:3b:46:a3:b0 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
1c:48:1f:b9:ff:2f 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
a4:e1:c5:1d:0b:67 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
0c:33:7e:b6:6f:aa 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
e4:fa:0b:af:d5:eb 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
32:e7:70:d6:00:b8 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
16:90:cc:07:c7:f4 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
f8:71:4d:52:77:92 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
0e:7e:0d:96:51:37 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
9a:bb:4e:36:2a:bc 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
a4:79:30:19:69:3c 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
56:6a:17:c7:c9:34 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
42:2d:3a:7a:5c:bb 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
c4:b3:59:4d:bc:34 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
50:7e:6b:64:9f:3b 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
24:0c:82:93:ab:13 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
b4:f6:42:5c:61:ba 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
62:53:bf:54:35:17 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
74:63:b3:cf:a8:7c 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
f6:03:a7:7e:d7:1d 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
fa:36:ca:7b:b1:b8 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
b2:18:10:ef:f1:01 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
c4:65:02:5d:d5:70 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
7c:b0:7f:84:c8:9a 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
ea:81:c8:f4:b2:7e 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
3c:2b:b6:77:6b:bd 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
ba:fb:37:ec:b6:a6 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
82:41:34:4c:43:0e 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
56:6d:a2:f8:b6:2f 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
c0:16:8a:96:cc:51 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
d6:d3:82:ad:2f:f1 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
a8:a9:46:b3:5f:ca 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
fa:df:0b:b6:2e:8a 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
d4:c9:a7:97:76:03 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
6a:d0:15:8a:42:2d 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
80:cc:d9:35:de:6c 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
40:be:bc:c5:fd:24 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
60:bf:68:ae:27:af 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
3c:36:c8:a9:25:1b 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
3e:41:c8:45:9a:9f 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
a4:e2:72:c5:02:b1 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
9c:c2:86:1b:af:b3 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
48:fe:59:b5:2e:2e 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
d2:2d:04:35:92:88 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
e2:d3:b2:81:cb:9f 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
38:ad:f7:d5:c0:16 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
ea:49:73:38:b0:3b 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
fa:c6:cb:89:dc:af 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
10:3a:ac:ff:8a:ca 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
ac:20:5b:27:eb:93 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
9a:0f:90:71:70:b7 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
84:ae:6f:11:6b:3d 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
14:eb:df:d2:02:dc 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
b2:cd:2a:46:e7:c4 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
fc:fd:65:9b:02:db 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
ac:c7:31:cc:f4:89 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
80:3f:5d:**:**:** 80:3f:5d:**:**:** mesh1	0	4857	0	0	0	0	0x10	1	1

Only the last route should be present here, the other routes I am not sure where they're coming from. There are many devices on the network but I don't think there are so many.

This has happened just tonight:

28 of August:

iw mesh1 mpath dump
DEST ADDR         NEXT HOP          IFACE	SN	METRIC	QLEN	EXPTIME		DTIM	DRET	FLAGS	HOP_COUNT	PATH_CHANGE
d8:54:0b:7c:20:46 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
ce:43:28:84:44:7e 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
26:75:58:0b:39:18 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
80:3f:5d:**:**:** 80:3f:5d:**:**:** mesh1	0	4857	0	0	0	0	0x10	1	1

After a couple of minutes that I ran the same command, more routes are shown:

iw mesh1 mpath dump
DEST ADDR         NEXT HOP          IFACE	SN	METRIC	QLEN	EXPTIME		DTIM	DRET	FLAGS	HOP_COUNT	PATH_CHANGE
16:dd:0c:a4:ba:aa 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
fc:93:c3:3b:0b:fe 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
90:f4:c0:8f:de:80 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
bc:a1:da:cb:87:a8 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
1e:f7:95:47:4a:b3 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
3a:e2:e6:88:65:fb 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
6c:cd:48:37:af:bc 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
d8:54:0b:7c:20:46 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
ce:43:28:84:44:7e 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
26:75:58:0b:39:18 00:00:00:00:00:00 mesh1	0	0	0	0	1600	4	0x0	0	0
80:3f:5d:**:**:** 80:3f:5d:**:**:** mesh1	0	4857	0	0	0	0	0x10	1	1

And if I leave it on, it gets filled with more and more.

Rebooting or giving wifi down && sleep 2 && wifi up fixes it.

I tried removing the routes by hand but it didn't seem to fix it.

I have been having 2 other issues recently, one with the ethernet mediatek switch, another with wifi clients being disconnected all the time or not being able to connect.

I think that I may try the same build on the latest OpenWrt 19 branch to see if these issues have been around for a while or if they're new.

I may be wrong, but my impression is that WiFi with mediatek drivers is not working well in OpenWrt 21.02.

5GHz driver I'm using:

# readlink /sys/class/net/wlan1/device/driver
../../../../../../bus/pci/drivers/mt7615e

2GHz driver:

# readlink /sys/class/net/wlan0/device/driver
../../../../../../bus/pci/drivers/mt7603e

I am not sure the driver is related though, because the problem can happen regardless of the interface (I have 2 mesh interfaces, one per radio), but I'm adding the info for the sake of completeness.

3 Likes

It happened again today for the 3rd time until I discovered this issue with the routing table, I lose internet connection and I can't reach the router node anymore, so I have to physically go there (or reach it from a management VPN via mobile data) to reload the wifi process, but before doing so I checked the routing table again and same problem, it gets filled with a lot of bogus routes which look complete garbage to me.

Edit: after a few days it happened one more time, there's definitely a problem here.

1 Like

I've noticed the same issue with my mesh, but so far it's only one extra MAC added with a null next hop, and that MAC is actually another wifi client (in this case it happened to be a webcam). The mesh's wifi adapter sometimes goes offline and I have to restart wifi manually every time that happens. I still haven't figured out why this is happening.

I'm on a Linksys EA8500. What's your hardware?

# iw dev mesh0 mpath dump
DEST ADDR         NEXT HOP          IFACE       SN      METRIC  QLEN    EXPTIME         DTIM    DRET    FLAGS   HOP_COUNT       PATH_CHANGE
00:16:6c:9f:XX:XX 00:00:00:00:00:00 mesh0       0       0       0       0       1600    4       0x0     0       0
cc:40:d0:56:XX:XX cc:40:d0:56:XX:XX mesh0       91246   9       0       3700    100     0       0x5     1       1

You can use something like this as a workaround until the issue is fixed:

cat << "EOF" > /root/iw-mpath-fix
iw dev \
| sed -n -e "/^\s*Interface\smesh[0-9]*$/s/^.*\s//p" \
| while read -r IW_DEV
do iw dev "${IW_DEV}" mpath dump \
| sed -n -e "/\s00:00:00:00:00:00\s/s/\s.*$//p" \
| while read -r IW_MAC
do iw dev "${IW_DEV}" mpath del "${IW_MAC}"
done
done
EOF
cat << "EOF" >> /etc/crontabs/root
*/5 * * * * . /root/iw-mpath-fix
EOF
uci set system.@system[0].cronloglevel="9"
uci commit system
/etc/init.d/cron restart

@vgaetera are you sure this works?
I had tried removing all routes manually but it didn't work.
I will try this.

Perhaps you also need to clear the ARP/neighbor cache and/or flush the route cache.

Flushing arp like "ip neigh flush dev br-lan"?
What do you mean for flushing route cache exactly?

1 Like

Looks like there's no need:
https://man.cx/ip-route(8)#heading4

1 Like

The bug happened again and I tested the solution.

I did what was suggested but while trying to flush the routes I removed all routes which obviously made it worse, so I had to run /etc/init.d/network restart which is undesirable because it takes a while to execute.

Here's the script I have just added to my crontab:

FOUND_BUG="0"
PATTERN="/\s00:00:00:00:00:00\s/s/\s.*$//p"

get_routes() {
	iw dev "$1" mpath dump | sed -n -e "$PATTERN"
}

for mac_addr in $(get_routes mesh0); do
	FOUND_BUG="1"
	iw dev mesh0 mpath del "$mac_addr"
done

for mac_addr in $(get_routes mesh1); do
	FOUND_BUG="1"
	iw dev mesh1 mpath del "$mac_addr"
done

if [ "$FOUND_BUG" == "1" ]; then
	ip neigh flush dev br-lan
	logger "Found mesh bug, mpath routes and ARP cleared"
fi

The logger call should help me determine if this script is helping out at all.

If the problem is fixed by this script, I should not notice the problem anymore but I should find these log lines in the logs.

If the problem isn't fixed by this script I'll find out because the internet connection will stop working, in that case the next thing I could try is to add wifi reload, which is also not great because it would cause the mesh links to go down for 10-20 seconds, but if you have any better idea please let me know.

Question: what's the best way to report this bug? I guess we should find out if it's something with the Linux kernel or with OpenWrt.

1 Like

Try to isolate the issue as much as possible to figure out how to handle it.
I'm just coding here in my spare time and have no experience with mesh. :sweat_smile:

A few minutes ago the log line appeared but I didn't notice any issue. I will wait to see it a few more times before claiming the script shared in my previous post works.

BTW: I'm running the script every 20 seconds (using sleep 20 && ... in the crontab entry).

1 Like

My hardware is not a Linksys EA8500, it's a Winstars device built for mesh, it's not officially supported yet. However, I doubt this is a hardware related issue. It took me a long time to figure out what was really happening, probably other people are having this issue and haven't figured out what's going on yet.

One question: are you using standard 802.11s routing protocol or another routing protocol (OLSRd2, batman-adv, babel, etc)?

I wonder if this issue happens only with 802.11s or any other routing deamon as well.

Yes, I'm using the standard 802.11.s. I also think it's a software issue, I checked one of the routers and found this suspicious crash in the kernel log around the time the mesh went down. See if you find anything similar.

What's weird is that even if it crashes for any reason, it doesn't recover and the mesh interface remains disabled.

[ 7155.780773] do_page_fault(): sending SIGSEGV to wpa_supplicant for invalid read access from 0000006c
[ 7155.790114] epc = 77b9dfd5 in libubus.so.20210630[77b9c000+13000]
[ 7155.796416] ra  = 77b9dfcd in libubus.so.20210630[77b9c000+13000]
[ 7155.808275] br-lan: port 4(mesh0) entered disabled state
[ 7156.601675] device wlan0-1 left promiscuous mode
[ 7156.606572] br-lan: port 3(wlan0-1) entered disabled state
[ 7156.613651] ath10k_pci 0000:00:00.0: mac flush null vif, drop 0 queues 0xffff
[ 7156.623382] ath10k_pci 0000:00:00.0: peer-unmap-event: unknown peer id 0
[ 7156.630312] ath10k_pci 0000:00:00.0: peer-unmap-event: unknown peer id 0
[ 7156.694691] ath10k_pci 0000:00:00.0: peer-unmap-event: unknown peer id 1
[ 7156.701657] ath10k_pci 0000:00:00.0: peer-unmap-event: unknown peer id 1
[ 7157.148972] br-lan: port 4(mesh0) entered disabled state
[ 7157.162280] device mesh0 left promiscuous mode
[ 7157.166836] br-lan: port 4(mesh0) entered disabled state
[ 7159.487549] ath10k_pci 0000:00:00.0: 10.4 wmi init: vdevs: 16  peers: 48  tid: 96
[ 7159.495184] ath10k_pci 0000:00:00.0: msdu-desc: 2500  skid: 32
[ 7159.570803] ath10k_pci 0000:00:00.0: wmi print 'P 48/48 V 16 K 144 PH 176 T 186  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[ 7159.581933] ath10k_pci 0000:00:00.0: wmi print 'free: 31080 iram: 23028 sram: 9596'
[ 7159.941183] ath10k_pci 0000:00:00.0: rts threshold -1
[ 7159.946840] ath10k_pci 0000:00:00.0: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[ 7159.963577] br-lan: port 3(wlan0-1) entered blocking state
[ 7159.969292] br-lan: port 3(wlan0-1) entered disabled state
[ 7159.975154] device wlan0-1 entered promiscuous mode
[ 7160.377481] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0-1: link becomes ready

Is that dmesg output?
I haven't checked that when the issue happens yet, I usually check at the output of logread which I also collect with syslog and hence is easier for me to check postmortem.

I added some info to the initial post regarding drivers I'm using:

5GHz driver I'm using:

# readlink /sys/class/net/wlan1/device/driver
../../../../../../bus/pci/drivers/mt7615e

2GHz driver:

# readlink /sys/class/net/wlan0/device/driver
../../../../../../bus/pci/drivers/mt7603e

I am not sure the driver is related though, because the problem can happen regardless of the interface (I have 2 mesh interfaces, one per radio), but I'm adding the info for the sake of completeness.

I also wanted to give an update regarding the script shared earlier, I see in the log that is being triggered often:

Sep 16 12:25:20 Mesh-Router root: Found mesh bug, mpath routes and ARP cleared
Sep 16 19:08:21 Mesh-Router root: Found mesh bug, mpath routes and ARP cleared
Sep 17 00:57:20 Mesh-Router root: Found mesh bug, mpath routes and ARP cleared
Sep 17 00:57:40 Mesh-Router root: Found mesh bug, mpath routes and ARP cleared
Sep 17 00:58:00 Mesh-Router root: Found mesh bug, mpath routes and ARP cleared
Sep 17 04:44:00 Mesh-Router root: Found mesh bug, mpath routes and ARP cleared

I remember very clearly that on Sep 17 00:57:20 I noticed issues with the connection, however, it didn't stop completely from working, it did suck for a while though, but surely better than a persistent black hole!

1 Like

Yes, that's in the dmesg output. It might be a bug in hostapd related to 802.11s. Openwrt is pretty far behind with upstream updates from hostapd (currently it's pinned to a version from May 2021). You can try, if you have time, to pull the latest hostapd code, reapply the openwrt patches and rebuild.

I looks like it's happening more often for you. Mine does this only a few times after I reboot the mesh APs, then things settle and it can work for weeks.

It seems there hasn't been a stable hostapd release for a while.
What if it's a bug in the implementation of 802.11s in Linux?
The question I am asking myself is wether we are the only ones having this issue or not.

I don't know - I doubt it. At this point I suspect hostapd, but it's really hard for me to repro this. For example it hasn't happened in more than a week now.

The number of people using 802.11s with openwrt is probably small and for most it's pretty hard to diagnose this kind of bugs.

Freifunk, LibreMesh and other similar communities use 802.11s regularly, but so far I didn't see them complaining about this issue. Maybe it only affects the default HWMP routing protocol.

The amount is small because many do not know that this is a function.

Unfortunately the script is not a definitive solution, recently it happened to me that the bogus routes were being added faster than the script could remove them. So unless wifi is stopped and turned on again the script is not 100% effective. This may be done if the script detects the same failure in a short time span.