Dropped ethernet route on a 802.11s gateway node

I've noticed that occasionally on my workstation connected on the WLAN either my network latency is awful, or I'm unable to ping my main OpenWRT router.

The setup is like this:
Workstation <--802.11ac--> mesh4 <--ethernet--> openwrt (main router) <--WAN--> ISP.

mesh4 is the gateway node for an 802.11s mesh, but the connection from my workstation does not use the mesh, but the wired backhaul from mesh4 to openwrt.

When the bad latency starts, pings from zephyr to openwrt balloon to 400-600 ms, or it becomes unreachable. I'm still able to ping mesh4 from zephyr, however.

On mesh4, this is in the logs:

Tue Oct 22 13:40:11 2024 kern.info kernel: [767977.091916] br-lan: port 1(eth0) neighbor 7fff.0c:80:63:5a:18:14 lost
Tue Oct 22 13:40:11 2024 kern.info kernel: [767977.098460] br-lan: topology change detected, propagating
Tue Oct 22 13:40:11 2024 kern.info kernel: [767977.104792] br-lan: port 3(phy1-mesh0) received tcn bpdu
Tue Oct 22 13:40:11 2024 kern.info kernel: [767977.110191] br-lan: topology change detected, propagating
Tue Oct 22 13:40:12 2024 kern.info kernel: [767978.156175] br-lan: port 3(phy1-mesh0) received tcn bpdu
Tue Oct 22 13:40:12 2024 kern.info kernel: [767978.161581] br-lan: topology change detected, propagating

The MAC address of the LAN interface on openwrt is 0C:80:63:5A:18:14.

On both mesh4 and openwrt, I have enabled STP on br-lan.

So, why does mesh4 lose its route to openwrt on br-lan? Nothing changed in the actual physical connections. They're both plugged into a dumb switch.

Why do you need STP if there aren't loops in the network?
Will STP through your dumb switch work as expected?
I've actually had poor experiences mixing STP compatible devices with non STP devices....

I haven't used vanilla 802.11s with gateway?

Is this vanilla 802.11s or are you running something on top of that?

I think another method to start debugging is post the config, redacting personal information / sensitive information, as well as give block diagram of whole network?

These symptoms look like a classic "mesh-bridge" loop storm (different to a "bridge loop").

You do not give much information about your network, but:

implies you have 4 meshnodes.

This indicates you possibly have non-mesh segments to your mesh backhaul.

My conclusions could be wrong, but the indications are you have a mesh-bridge storm that builds, generating increasing latency until the network stops working.

STP will not help on its own with a mesh-bridge loop storm. This is because STP works at the bridge level, but mesh works at the interface level. The response of STP would be exactly as you see in your logs.

If I am correct, then you need to do one of the following:

  • Remove any non-mesh segments in your mesh backhaul
  • install either mesh11sd or batman-adv, as both of these have built in mesh-loop storm mitigation. The one you choose depends on what you are trying to achieve.

Note some terminology, so we are talking about the same things:

  • mesh gateway - a mesh node that has a downstream network connection eg wireless-ap, ethernet-lan etc.
  • mesh portal - a mesh node that has an upstream network connection eg a wan connection to an isp router.
  • A mesh portal can also be a mesh gateway - eg it has both an upstream wan link and downstream wireless-ap and/or ethernet-lan links.
2 Likes

I was getting loops from my mesh network (i.e. log messages saying it received a packet with its own address as source address), so after searching I enabled STP on any bridges where a mesh intersected with the wired LAN (in my case, mesh4 and openwrt).

Here's a network diagram.

You can see I have two separate meshes, as I had an older batman-adv mesh with VLANs to isolate my IoT devices.

I installed batman-adv on all the home mesh nodes, which seems to have improved the overall stability of the network. The home mesh has mesh4 as the mesh gateway, and it is a batman-adv Server, while all the other nodes are Clients. I'm still getting weird routes, however.

E.g. from my workstation zephyr connected to AP mesh4 I can't reliably ping mesh1.

❯ ping mesh1
PING mesh1.techne.net (192.168.0.41): 56 data bytes
64 bytes from 192.168.0.41: icmp_seq=0 ttl=64 time=10.383 ms
92 bytes from 192.168.0.47: Redirect Host(New addr: 192.168.0.41)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 90ed   0 0000  3f  01 693c 192.168.0.6  192.168.0.41

Request timeout for icmp_seq 1
92 bytes from 192.168.0.47: Redirect Host(New addr: 192.168.0.41)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 b5de   0 0000  3f  01 444b 192.168.0.6  192.168.0.41

Request timeout for icmp_seq 2
92 bytes from 192.168.0.47: Redirect Host(New addr: 192.168.0.41)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 1386   0 0000  3f  01 e6a3 192.168.0.6  192.168.0.41

Request timeout for icmp_seq 3
92 bytes from 192.168.0.47: Redirect Host(New addr: 192.168.0.41)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 2b6e   0 0000  3f  01 cebb 192.168.0.6  192.168.0.41

92 bytes from 192.168.0.47: Destination Host Unreachable
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 5400 90ed   0 0000  3f  01 693c 192.168.0.6  192.168.0.41

^C
--- mesh1.techne.net ping statistics ---
5 packets transmitted, 1 packets received, 80.0% packet loss
round-trip min/avg/max/stddev = 10.383/10.383/10.383/0.000 ms

But I can ping mesh1 from a terminal on openwrt and mesh4. But not from mesh7. Doing a traceroute from zephyr shows its going through mesh7.

❯ traceroute mesh1
traceroute to mesh1.techne.net (192.168.0.41), 64 hops max, 52 byte packets
 1  192.168.0.47 (192.168.0.47)  3.216 ms  1.169 ms  1.717 ms
 2  192.168.0.47 (192.168.0.47)  3163.216 ms !H  3352.484 ms !H  3120.784 ms !H

On mesh7, the default route appears correct:

root@mesh7:~# ip route
default via 192.168.0.1 dev br-lan
192.168.0.0/16 dev br-lan scope link  src 192.168.0.47

I am just very confused why zephyr, connected to AP mesh4, cannot ping mesh1, when mesh4 can. And I don't know why any routes from zephyr are going through mesh7. All the nodes are reachable from openwrt.

OK i have limited experience with batman-adv... I've not done mixed wired/wireless with great success...

So are all the nodes running batman? Or is it that the legacy network is batman, whilst the new one is pure 802.11s?

As of today, both meshes are on (separate) batman-adv. Yesterday, the home mesh was vanilla 802.11s.

1 Like

Awesome. Thanks for clarifying.

See my response upthread for details on my network topology, but I did not have non-mesh backhauls on the mesh nodes when I was seeing the behavior that indicated a mesh-bridge storm.

It seems mesh 7 and mesh1 can communicate directly.
If so, the 2 mesh backhauls must have the same meshid/key pairs.

Simple 802.11s, batman-adv and mesh11sd all are concerned only with layer 2 communications. The backhaul knows nothing about, nor cares about layer 3 and above.

If mesh 7 and mesh 1 can (maybe sometimes) communicate directly, then the wired link between mesh4 and "openwrt" will become the source of a mesh bridge loop.

Maybe you could share the wireless configs of both mesh 7 and mesh 1 using:

uci show wireless

Also, what does the following reveal (on mesh 1 and mesh 7):

iw dev [mesh_interface] mpath dump`

where [mesh_interface] is the mesh interface name.

The two meshes have different IDs, are on different channels, and have different security keys.

mesh1:

root@mesh1:~# uci show wireless
wireless.radio0=wifi-device
wireless.radio0.type='mac80211'
wireless.radio0.path='pci0000:00/0000:00:00.0'
wireless.radio0.channel='36'
wireless.radio0.band='5g'
wireless.radio0.htmode='VHT80'
wireless.radio0.cell_density='0'
wireless.radio0.country='US'
wireless.radio1=wifi-device
wireless.radio1.type='mac80211'
wireless.radio1.path='platform/ahb/18100000.wmac'
wireless.radio1.band='2g'
wireless.radio1.htmode='HT20'
wireless.radio1.channel='11'
wireless.radio1.cell_density='0'
wireless.radio1.country='US'
wireless.wifinet2=wifi-iface
wireless.wifinet2.device='radio0'
wireless.wifinet2.mode='mesh'
wireless.wifinet2.encryption='sae'
wireless.wifinet2.mesh_id='iot-mesh'
wireless.wifinet2.mesh_rssi_threshold='0'
wireless.wifinet2.key='<redacted>'
wireless.wifinet2.network='batmesh'
wireless.wifinet2.mesh_fwding='0'
wireless.wifinet2.disabled='0'
wireless.wifinet1=wifi-iface
wireless.wifinet1.device='radio1'
wireless.wifinet1.mode='ap'
wireless.wifinet1.ssid='mitchelliot'
wireless.wifinet1.encryption='psk2'
wireless.wifinet1.key='<redacted>'
wireless.wifinet1.network='IoT'
wireless.wifinet1.disabled='0'
wireless.wifinet3=wifi-iface
wireless.wifinet3.device='radio0'
wireless.wifinet3.mode='ap'
wireless.wifinet3.ssid='mitchell-test'
wireless.wifinet3.encryption='psk2'
wireless.wifinet3.key='<redacted>'
wireless.wifinet3.ieee80211r='1'
wireless.wifinet3.mobility_domain='531f'
wireless.wifinet3.ft_psk_generate_local='1'
wireless.wifinet3.network='lan'
wireless.wifinet3.ft_over_ds='0'
wireless.wifinet3.disabled='0'
wireless.wifinet3.reassociation_deadline='20000'
wireless.wifinet4=wifi-iface
wireless.wifinet4.device='radio1'
wireless.wifinet4.mode='ap'
wireless.wifinet4.ssid='mitchell-test-2.4'
wireless.wifinet4.encryption='psk2'
wireless.wifinet4.key='<redacted>'
wireless.wifinet4.ieee80211r='1'
wireless.wifinet4.mobility_domain='6ca4'
wireless.wifinet4.ft_over_ds='0'
wireless.wifinet4.ft_psk_generate_local='1'
wireless.wifinet4.network='lan'
wireless.wifinet4.disabled='0'
wireless.wifinet4.reassociation_deadline='20000'

mesh7:

root@mesh7:/etc/config# uci show wireless
wireless.radio0=wifi-device
wireless.radio0.type='mac80211'
wireless.radio0.path='platform/18000000.wifi'
wireless.radio0.channel='auto'
wireless.radio0.band='2g'
wireless.radio0.htmode='HE20'
wireless.radio0.cell_density='0'
wireless.radio0.country='US'
wireless.radio0.disabled='1'
wireless.radio1=wifi-device
wireless.radio1.type='mac80211'
wireless.radio1.path='platform/18000000.wifi+1'
wireless.radio1.channel='149'
wireless.radio1.band='5g'
wireless.radio1.htmode='HE80'
wireless.radio1.cell_density='0'
wireless.radio1.country='US'
wireless.wifinet4=wifi-iface
wireless.wifinet4.device='radio1'
wireless.wifinet4.mode='mesh'
wireless.wifinet4.encryption='sae'
wireless.wifinet4.mesh_id='home-mesh'
wireless.wifinet4.mesh_fwding='0'
wireless.wifinet4.mesh_rssi_threshold='0'
wireless.wifinet4.key='<redacted>'
wireless.wifinet4.network='batmesh'
wireless.wifinet4.disabled='0'

And the output of mpath dump.

mesh1:

root@mesh1:~# iw dev phy0-mesh0 mpath dump
DEST ADDR         NEXT HOP          IFACE       SN      METRIC  QLEN    EXPTIME DTIM    DRET    FLAGS   HOP_COUNT     PATH_CHANGE
54:af:97:62:7a:2c 54:af:97:62:7a:2c phy0-mesh0  20206   245     0       0       100     0       0x14    1    184
0c:80:63:5a:18:13 0c:80:63:5a:18:13 phy0-mesh0  31370   565     0       3060    100     0       0x5     1    1680

mesh7:

root@mesh7:~# iw dev phy1-mesh0 mpath dump
DEST ADDR         NEXT HOP          IFACE	SN	METRIC	QLEN	EXPTIME	DTIM	DRET	FLAGS	HOP_COUNT	PATH_CHANGE
4a:ed:e6:2a:20:84 4a:ed:e6:2a:20:84 phy1-mesh0	38337	8	0	4360	100	00x15	1	4
96:83:c4:54:38:cb 96:83:c4:54:38:cb phy1-mesh0	36510	42	0	4020	100	00x15	1	5

My network has been stable for the last week, after rebooting my workstation and the routers that were misbehaving. I'm going to assume this must have cleared the bad routes from the various ARP tables. I can ping nodes on both meshes, and each node can ping every other node.