Best practice for multi-AP WLAN, 802.11s, and route loops

Hello all!

I'm struggling to get a WLAN working reliably that:

  • Contains 2 access points connected to a managed switch via ethernet
  • Contains 2 access points utilizing 802.11s to join the larger LAN network
  • Contains a managed switch (see above)

My intention is that my wireless devices will seek the strongest local AP through 802.11r. All the APs should provide access to the LAN. Due to distances involved two of the APs have to rely on 802.11s to join the LAN.

I've enabled STP on br-lan on all devices, and enabled "Loopback Protection" for the switch (the internet seems a little vague on this, but suggests that this setting enables STP/RSTP on the switch as well). I've created a 802.11s network (Access Point(WDS), etc etc) and added DAWN to help with AP roaming and am able to see traffic flowing across the entire LAN. Except... well, except the performance isn't very good. And eventually the switch seems to panic (the LEDs on all the ports are blinking rapidly in a synchronized pattern) and has to be restarted.

So what I'm struggling with is whether this configuration will actually work. I can understand why a route loop would form, but I was under the impression that STP was meant to prevent this.

The workaround I've done at the moment is to have AP1 advertise "mesh1" and AP2 advertise "mesh2", effectively creating two 802.11s networks. This prevents either of the eth tethered APs from possiblyt interfering with the MAC routing of the other. And that's cool, I guess, but it seems to be counter to the intention of the mesh. Surely all the mesh network devices should seek whatever route takes them to eth first, and having two eth exit points from the 802.11s mesh only increases the resiliency of the network.

Is there a best practice I'm overlooking? Or is the best practice "every mesh network should have a single eth connection"?

Thanks!

What do you mean by this?

Access point WDS is a very different thing to 802.11s mesh and if you use both between nodes, you will indeed be creating loops.

It would be better if you made a diagram for us to see.....

Ahh. Very good catch. That's what I wrote, but certainly not what I meant to write.

  • AP1 advertises two networks: Access Point ("HouseWifi") and 802.11s ("mesh"). Has eth connection to switch
  • AP2 advertises two network: Access Point ("HouseWifi") and 802.11s ("mesh"). Does not have eth connection to switch
  • AP3 advertises two networks: Access Point ("HouseWifi") and 802.11s ("mesh"). Has eth connection to switch
  • AP4 advertises two network: Access Point ("HouseWifi") and 802.11s ("mesh"). Does not have eth connection to switch

So in double-checking everything I found that AP3 was advertising "Access Point (WDS)" and 802.11s ("mesh"). This seems to have been a legacy configuration from when I was following some write-up on having AP4 be a "range extender" rather than a participant in a 802.11s mesh network. From my reading the WDS configuration will create a Layer 2 route that can potentially contribute to performance issues and to Layer 2 route loops.

I've fixed that, and am now verifying that things are working as expected, In the past it could take up to 24 hours for performance to degrade, so I'm in a holding pattern watching to see what happens.

A minor update: I restarted the clock on watching for bad behavior after disabled "Access Point (WDS)" on AP3. This came after finding a different post by @bluewavenet that recommended setting a minimum RSS strength (-80), setting mesh_gate_announcements to 1 after the mesh network is established, and moving AP2 farther away from AP1 and closer to AP2, such that it is now preferring to route through AP2 and its eth connection.

At the moment things appear to be much more stable. I don't know which of the above changes may be contributing, but so far - looking good.

See:

Thanks for the pointer to mesh11sd. I had had that installed but had not changed it's settings.

I may have identified some artifact that shows that my Wifi performance is degrading when all APs are on the same 802.11s mesh. It seems as though the primary House AP ("AP1" in the diagram) is preferring to route it's traffic over the mesh network to AP3 rather than use its own eth interface.

I've installed iperf3 on the main router and on all the mesh APs. This is the output from AP1 (the traffic goes through the switch and to the router where the iperf3 server is running). Test 1 shows the result with the mesh network interface active and chatting with a single peer. Test two has simply turned off the mesh interface.

Connecting to host 192.168.0.1, port 5201
[  5] local 192.168.0.5 port 49836 connected to 192.168.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.23 MBytes  27.1 Mbits/sec    0    115 KBytes
[  5]   1.00-2.00   sec  2.49 MBytes  20.9 Mbits/sec   10   93.3 KBytes
[  5]   2.00-3.00   sec  1.80 MBytes  15.1 Mbits/sec    0   99.0 KBytes
[  5]   3.00-4.00   sec  2.17 MBytes  18.2 Mbits/sec    0    113 KBytes
[  5]   4.00-5.00   sec  2.05 MBytes  17.2 Mbits/sec   10   84.8 KBytes
[  5]   5.00-6.00   sec  2.61 MBytes  21.9 Mbits/sec    0   93.3 KBytes
[  5]   6.00-7.00   sec  2.98 MBytes  24.9 Mbits/sec    0   96.2 KBytes
[  5]   7.00-8.00   sec  2.30 MBytes  19.4 Mbits/sec    0    100 KBytes
[  5]   8.00-9.00   sec  3.11 MBytes  26.1 Mbits/sec    0    106 KBytes
[  5]   9.00-10.00  sec  2.98 MBytes  25.0 Mbits/sec    0    110 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  25.7 MBytes  21.6 Mbits/sec   20             sender
[  5]   0.00-10.01  sec  25.4 MBytes  21.3 Mbits/sec                  receiver

iperf Done.

# iperf3 -c 192.168.0.1
Connecting to host 192.168.0.1, port 5201
[  5] local 192.168.0.5 port 37216 connected to 192.168.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  20.8 MBytes   174 Mbits/sec   70   66.5 KBytes
[  5]   1.00-2.00   sec  23.6 MBytes   198 Mbits/sec   82   63.6 KBytes
[  5]   2.00-3.01   sec  24.4 MBytes   204 Mbits/sec   88   55.1 KBytes
[  5]   3.01-4.01   sec  23.6 MBytes   197 Mbits/sec  101   50.9 KBytes
[  5]   4.01-5.00   sec  22.9 MBytes   194 Mbits/sec   70   50.9 KBytes
[  5]   5.00-6.04   sec  20.3 MBytes   163 Mbits/sec   55   28.3 KBytes
[  5]   6.04-7.07   sec  19.4 MBytes   159 Mbits/sec   74   66.5 KBytes
[  5]   7.07-8.00   sec  23.2 MBytes   208 Mbits/sec   88   65.0 KBytes
[  5]   8.00-9.02   sec  24.1 MBytes   199 Mbits/sec  104   52.3 KBytes
[  5]   9.02-10.11  sec  23.2 MBytes   179 Mbits/sec   89   63.6 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.11  sec   225 MBytes   187 Mbits/sec  821             sender
[  5]   0.00-10.12  sec   225 MBytes   187 Mbits/sec                  receiver

iperf Done.```

From ~25Mb/s to ~187Mb/s is a pretty dramatic change (figuring out why 187Mb/s is the highwater mark between two devices connected via a switch and supposedly 1G ethernet ports is a challenge for another day).

Any idea about what I can check to see why this node (AP1) is preferring to route traffic via mesh rather than ethernet?

I just want to nitpick on the terminology here:
This is a "Switching Loop", see https://en.wikipedia.org/wiki/Switching_loop

On Layer 3, it's a "Routing Loop", see https://en.wikipedia.org/wiki/Routing_loop

To prevent Loops on Layer 2, there is https://en.wikipedia.org/wiki/Spanning_Tree_Protocol but this is somehow arcane and slow and has multiple issues even on smaller scale setups. That's why there are "Rapid STP" and "Multiple STP", even there are implementations available for Linux it's sadly not often to be used. Many issues can however be prevented by either a different design of the Layer 1 and 2 topology, and for larger deployments use only Layer 3 and avoid so called stretched Layer 2 at all.

On your use-case I would recommend, either

  • Use single wireless band with native 802.11s or with mesh11sd, or
  • multiple wireless band links and/or Ethernet, and batman-adv (which brings its own "loop avoidance" and is able to prioritize interfaces)

Degradation of throughput is expected. If all mesh-nodes are on the same channel, and you have a transmission going from node1 to node2 to node3 (and node1 and node3) don't have a direct link. Each node has to wait till the channel is free / free to send. And while node2 is rx data from node1, it has to wait till it is able to tx data to node3, and so on. Also, running iperf3 on the AP will most of the time give "false" data, as the CPU on most devices is kinda weak, to do both. Running iper3 (and generate data), and pushing everything through the CPU... If you want to get "better" data, connect 2 laptops via Ethernet to the AP, or if you want to test wifi, then ensure each client is on the expected AP, and run for instance a simple webserver on a Laptop, and request data from your Phone... something like this.

(Regarding prioritize interface: I'm not quite sure if mesh11sd is able to so; batman-adv, however, is able to do so.)

Coolio. Thanks for the clarification on the terminology. This level of network depth is not my usual haunt, so I'm learning new things every time someone replies.

I had read that STP was needed to avoid the switching loop, so I enabled it on the br-lan for both APs that are connected via 802.11s and Eth. Good to its word, it has prevented the switching loop, but not in a way I had anticipated (it makes sense that the solution would be "ok, I test the network to see if I have a switching loop and if I find one then I don't switch traffic in that direction," but if this is how STP resolves the conflict, I'm not finding it clearly stated in docs).

I absolutely realize that the traffic routed from AP to another causes latency and slowdowns. That's why I desire to have multiple APs have Eth connections that connect to the switch. The more escape paths from the mesh, the better performance of the overall mesh (in theory).

You're right about the CPU with iperf3. I can absolutely see how CPU saturation would impact performance. I should try running another test while closely monitoring the CPU. If it's blocked on I/O, or thread starved, or something similar - all that could impact the overall performance.

I'm not sure I understand your recommendation of a "single wireless band with native 802.11s." I think that's what I have now, with the complication being two ingress/egress points into the mesh network.

I'll read up on batman-adv and give it a go. I had understood it to be overkill for the size of my small mesh network, but if it works - good enough!

Thanks!

Which docs? OpenWRT's?
I find this book in general pretty good, so maybe have a look at the sub-capter about STP as well: https://book.systemsapproach.org/internetworking/ethernet.html#spanning-tree-algorithm Could probably be overkill/far-to-deep but I think its well written and a "gentle" introduction, and explanations are solid.

I wanted to say: Configure 1 mesh-network which is used by all APs, and therefor on a single band. 2.4 GHz, or 5 GHz. But using 2 802.11s networks may complicate things, IF they are configured on all APs.

(You could however do something like: AP1 has mesh1, AP2 has mesh1 and mesh2, and AP3 has mesh2, but then again, if both mesh-networks are using the same frequency then you will gain nothing...)

When I've encountered and configured batman-adv the first time I found it highly confusing, because everyone was copy/pasting from each other without "good" explanations... Later I realized that nearly everyone sets options which are not needed, or are using the defaults anyway. Here is a minimal config you can use as a starting point if you want to try it out...

/etc/config/network

# We configure the actual batman-adv interface...
config interface            'bat0'
    option  proto           'batadv'
    option  routing_algo    'BATMAN_IV'

# My Archer C7 supports an MTU of 2304 (which is AFAIK the max. possible for wifi, so I use that...)
# One "Hardware Interface" will be used for 2.4 and one is used for 5.0 GHz
config interface            'bat0_hardif_mesh0'
    option  proto           'batadv_hardif'
    option  master          'bat0'
    option  mtu             '2304'

config interface            'bat0_hardif_mesh1'
    option  proto           'batadv_hardif'
    option  master          'bat0'
    option  mtu             '2304'

# Example Network-Device / -Interface stanza
# The relevant part is, that the batman device/interface is attached to the/a bridge
config device
    option  name            'br-vlan16'
    option  type            'bridge'
    list    ports           'eth0.16'
    list    ports           'bat0.16'

config interface            'vlan16'
    option  device          'br-vlan16'
    ....
    option  proto           'static'
    option  ipaddr          '192.168.16.1/24'
    ....
    list    ip6ifaceid      '::1'
    list    ip6ifaceid      'eui64'
    option  ip6assign       '64'
    ....


# /etc/config/wireless

config wifi-device 'radio0'
    ....

config wifi-iface 'mesh0'
    option  device      'radio0'
    option  ifname      'mesh0'
    option  network     'bat0_hardif_mesh0' # Thats the name we use in network config for the interface
    option  mode        'mesh'
    option  mesh_fwding '0'
    option  mesh_id     'MyMeshFooBarBaz'
    option  encryption  'psk2+ccmp'  # You will need hostapd / mesh packages which includes encryption, I would have to look it up which are needed....
    option  key         'InsertRandomStringHere'

# Same goes for the other radio...
config wifi-device 'radio1'
    ....

config wifi-iface 'mesh1'

    option  device      'radio1'
    option  ifname      'mesh1'
    option  network     'bat0_hardif_mesh1'
    option  mode        'mesh'
    option  mesh_fwding '0'
    option  mesh_id     'MyMeshFooBarBaz'
    option  encryption  'psk2+ccmp'
    option  key         'InsertRandomStringHere'

Maybe this gives you some pointers....

Mhm hard to tell. I have seen local networks (including mine) going from only 2 or 3 mesh-nodes, to multiple dozen. The only issue I'm aware of, is that some folks tried to build batman-adv networks with over one hundred nodes, interconnected via VPN (over WAN), too, and then wondered that the broadcast and multicast traffic on layer 2 is killing their network... but if you stick with local area networks and below roughly 50 to 100 nodes there are no show stoppers... The good thing: A network can't be too small :slight_smile:

Good luck. (If you have followup questions feel free to dump them.)

PS: @bluewavenet could probably recommend configs if you want to stay with only 802.11s and/or mesh11sd...

Thanks for the sample network file. That went a long way to helping me get a functional batman-adv configuration running. Ultimately the VLAN configuration seemed to be overkill for my needs and I ended up with a configuration like:

/etc/config/wireless

[...]
config wifi-iface 'mesh0'
        option device 'radio0'
        option ifname 'mesh0'
        option network 'bat0_hardif_mesh0'
        option mode 'mesh'
        option mesh_fwding '0'
        option mesh_id 'MeshNet'
        option encryption 'psk2+ccmp'
        option key 'becc443528fa156129651762'

config wifi-iface 'mesh1'
        option device 'radio1'
        option ifname 'mesh1'
        option network 'bat0_hardif_mesh1'
        option mode 'mesh'
        option mesh_fwding '0'
        option mesh_id 'MeshNet'
        option encryption 'psk2+ccmp'
        option key 'becc443528fa156129651762'
        option disabled '1'

/etc/config/network

config interface 'loopback'
        option device 'lo'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'

config globals 'globals'
        option ula_prefix 'fd0e:e528:3936::/48'

config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'eth0'
        list ports 'bat0.1'
        option stp '1'
        option hello_time '4'

config interface 'lan'
        option device 'br-lan'
        option proto 'dhcp'
        option hostname '*'

# We configure the actual batman-adv interface...
config interface            'bat0'
    option  proto           'batadv'
    option  routing_algo    'BATMAN_IV'
    option aggregated_ogms 1
    option ap_isolation 0
    option bonding 0
    option fragmentation 1
    option gw_mode 'off'
    option log_level 0
    option orig_interval 1000
    option bridge_loop_avoidance 1
    option distributed_arp_table 1
    option multicast_mode 1
    option network_coding 0
    option hop_penalty 30
    option isolation_mark '0x00000000/0x00000000'

config interface            'bat0_hardif_mesh0'
    option  proto           'batadv_hardif'
    option  master          'bat0'
    option  mtu             '2304'

config interface            'bat0_hardif_mesh1'
    option  proto           'batadv_hardif'
    option  master          'bat0'
    option  mtu             '2304'

Rather than route a VLAN over the interface as in your example, I chose to simply have the mesh network be bridged to all the interfaces on the AP. Future me may want to play with the VLAN configuration a bit, but for now all interfaces being bridged suits my needs.

Unfortunately after getting running I was again confronted with switching loops and other odd behaviors. These were immediately resolved if I reduced the APs with ethernet connections to 1, or if I removed them from the mesh network (effectively severing the switching loop).

I spent some time trying different batman-adv options, etc to no avail. Eventually I found the bottom of the OpenWRT batman-adv page with this note:

Multiple Nodes Bridging to Same Network Segment

Use of multiple nodes bridged to the same wired networks has not been deeply examined at this time. STP might be sufficient as a “poor-man's” approach, though there have been cases with other networking protocols where bridge loops involving the on-device switches did not seem to be detected and resolved by STP alone.

A quick test with two of the OpenWrt nodes (of the five deployed and participating in batman-adv) connected to the wired network through Cisco SG300-series switches had a “fail-over” occur after unplugging the “active” cable in ~90 seconds, with disturbances evident for another half minute. The output of batctl cl (“claim table”) appears to empty and update on about the same time scale. STP in the OpenWrt bridges has a hello_time of 2.00 s, max_age of 20.00 s, and forward_delay of 2.00 s, suggesting an STP cut-over time of ~26 seconds. Watching the claim table on one node while bringing down bat0 on the “preferred gateway” (without changing Ethernet connectivity) showed a one-minute delay before those associated with the down node were removed. As a result, at this time the delays are believed to be primarily due to batman-adv operation.

There is a batman-adv feature around advertising gateways. It appears to be designed for larger-scale deployments and seems to work by moderating DHCP assignments, rather than by dynamically routing packets with the mesh-routing logic itself.

So ultimately it looks as though batman-adv suffers from the same loop problem that 802.11s was wrestling with and, at least for now, the best path available to me is to have the mesh network (whether 802.11s or batman-adv) have only one ethernet gateway.

Thanks again for the advice and pointers.

And now that I've written that, it seems to be working correctly. Bridge Loop is now not happening.

Please. (!!!!) Compare these with the defaults. I would bet, that most of them are the default value anyway.

And it stopped working again. What's interesting about the failure, when it happens, is that I can PING and SSH to all of my internal hosts EXCEPT for my router (the router does not have Wifi enabled and is not a participant in batman-adv node system). When the failure occurs all communication to the router stops. As soon as I disable the mesh* interfaces on one of the two wired APs communication with the router resumes.

You're right that I began re-adding configuration options from your basic template. Variations from defaults are:
network_coding 0 (def 1)

So, yeah, I can clean up the readability of the stanza by removing every line. On the other hand, explicitly setting the value to the default (in theory) does no harm and increases knowledge of what the working configuration is.

Reading through these again, and the problem I see crop up (access to router, not a member of the batman-adv group), makes me think that this is creating an ARP problem. I wonder if distributed_arp_table is breaking the local network.

I would propose that every AP and Router is part of the batman-adv mesh.
batman builds a (big) switch (layer-2), and every switch/bridge needs to be aware of the ARP table.
In the end its one of the main goals to propagate this information (efficiently): which MAC can be reach where, and how.
I would also not disable Distributed ARP Table, see https://www.open-mesh.org/projects/batman-adv/wiki/DistributedArpTable

In addition, I would disable STP (aka dont configure it; the OpenWRT default is 0), and use only the batman-adv loop avoidance, it's enabled by default: https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance

My comment on "remove everything which is using the defaults anyway" was in sense of:
Even it does not hurt to explicit configure and use the defaults, I would assume that you will not follow the change log of batman-adv :wink: And as the defaults are quiet good, and there is rarely a (good) reason to "tune" batman-adv in small deployments, you will miss new defaults from upstream, when they come. And I want to stress my point: It feels like that 90% of web-content regarding batman-adv is just blindly copy-pasted from the same one or two sources without thinking or checking. And this will help nobody...

Back to your mesh: Could you sketch / explain briefly your topology?

  • Which devices? (How many Routers and WAP)
  • How they are interconnected (wireless or ethernet only; or both), and
  • Who has batman configured if you (for some reason) have it not running on all routers and access points?

Also: If you play around, like unplugging cables, or disable wireless, power off a mesh node etc pp, do not expect immediate failover. It can take a moment or few. Depends on your topology and what not. Personally I would expect that it can take up to 10 or 20 seconds. When it takes longer or don't come back after all, I would start to dig around.

I hope this somehow helps you...

Is this topology still up to date?

The switch: What is it? (Model and firmware) Do you still have STP/RSTP enabled? It maybe interferes...

Here is the updated topology.

The switch is a TP-Link TL-SG116E v2. It is running the most up to date factory firmware, 1.0.0 Build 20210512 Rel.40890. "Loop prevention" is enabled on the switch. It's a little unclear on what that does, but various sources on the Internet claim that this is STP/RSTP.

The only devices that have batman-adv installed/configured on are the four wireless APs. I haven't looked into whether it's actually possible/meaningful to run batman-adv on the network devices (in my case, the router) that do not have a wireless interface.

When I was messing with 802.11s the managed switch, and an unmanaged switch, would both exhibit the same behavior - after a period of time (sometimes days) all the ports would begin blinking rapidly in unison. The only way to recover was to unplug all the Cat-6, then power cycle the switch, then plug in the Cat-6 one at a time.

With batman-adv that behavior has changed. When AP1 and AP3 (both connected to the switch via Cat-6) are participating in the mesh the overall mesh performance is poor, and eventually the router stops responding to pings (odd, since it's only connection to the other devices is via the Cat-6 connection on the switch). All the other LAN devices still respond to pings, can be logged into, etc. Just the router is cut out. If I then unplug the Cat-6 of AP1 OR if I disable the batman-adv mesh on AP1, almost immediately (less than a second) the router begins responding to pings again.

Well here's new news. Things have been reasonably stable over the last week (a good thing since I was off site and wouldn't be able to do anything if it hadn't been). This morning I posted the above. And just now the connection to the router (connected to the switch via Cat-6, IP addy 192.168.0.1) stopped working.

Current network topology: AP1 is not participating in the batman-adv mesh. AP 2, 3, 4 are, with AP3 connected to the switch via Cat-6.

Things tried to restore ping to router:

  1. Unplug router Cat-6 from switch, plug back in. Failed.
  2. Unplug AP1 Cat-6 from switch, plug back in. Failed. (Why AP1? Because it has directly contributed to this problem before).
  3. Unplug AP3 Cat-6 from switch, plug back in. Worked. Full access to Router is restored.

I've disabled "Loop prevention" on the switch, and have made no other changes to the topology. If this configuration is stable then I'll try adding AP1 back to the mesh and see if that was contributing in the past. I may also try switching out the managed switch with a dumb switch just to see if it's "smarts" is a lie.

And connectivity to the Router (192.168.0.1) dropped again. Unplugging the ETH from AP3, counting to five, and plugging it back in restores comms to the router. The router logs contain:

Mon Jun 26 12:35:37 2023 kern.warn kernel: [2973901.546226] br-lan: received packet on lan4 with own address as source address (addr:94:10:3e:80:18:d7, vlan:0)
Mon Jun 26 12:35:41 2023 kern.warn kernel: [2973905.212554] br-lan: received packet on lan4 with own address as source address (addr:94:10:3e:80:18:d7, vlan:0)

At that same time stamp AP1 threw this error:

Mon Jun 26 12:35:36 2023 kern.warn kernel: [604709.708171] br-lan: received packet on eth0.1 with own address as source address (addr:d8:07:b6:20:ed:72, vlan:0)
Mon Jun 26 12:35:36 2023 kern.warn kernel: [604709.719217] br-lan: received packet on eth0.1 with own address as source address (addr:d8:07:b6:20:ed:72, vlan:0)
Mon Jun 26 12:35:36 2023 kern.warn kernel: [604709.734606] br-lan: received packet on eth0.1 with own address as source address (addr:d8:07:b6:20:ed:72, vlan:0)

And this is AP3 (which I unplugged, then plugged back in, and which restored network connectivity for the router):

Mon Jun 26 12:39:08 2023 kern.info kernel: [1447147.582180] eth0: link down
Mon Jun 26 12:39:08 2023 kern.info kernel: [1447147.587455] br-lan: port 1(eth0) entered disabled state
Mon Jun 26 12:39:16 2023 kern.info kernel: [1447154.863418] eth0: link up (1000Mbps/Full duplex)
Mon Jun 26 12:39:16 2023 kern.info kernel: [1447154.868980] br-lan: port 1(eth0) entered blocking state
Mon Jun 26 12:39:16 2023 kern.info kernel: [1447154.874643] br-lan: port 1(eth0) entered listening state
Mon Jun 26 12:39:16 2023 kern.info kernel: [1447155.009998] br-lan: port 1(eth0) received tcn bpdu
Mon Jun 26 12:39:16 2023 kern.info kernel: [1447155.015201] br-lan: topology change detected, propagating
Mon Jun 26 12:39:24 2023 kern.info kernel: [1447162.941776] br-lan: port 1(eth0) entered learning state
Mon Jun 26 12:39:32 2023 kern.info kernel: [1447171.261757] br-lan: port 1(eth0) entered forwarding state
Mon Jun 26 12:39:32 2023 kern.info kernel: [1447171.267550] br-lan: topology change detected, propagating

No messages at all that indicate a problem. Just the messages indicating that I unplugged ETH for 8 seconds, then plugged it back in and the interface came alive again.

Note that the switch itself does not indicate any problems with either port during this period, and APs 2, 3 and 4 are reporting no similar log entries at the same time.

So ... as best I can tell the batman-adv enabled APs are, sometimes, sending packets to the router that claim to be from its own MAC address. Or maybe it's doing this on its own.

Maybe this is because the router is a Linksys WRT1900AC v1 on the mvebu/cortexa9 platform, and this is a manifestation of the bug currently preventing newer OpenWRT builds (something sending packets to all ports on the switch, I think). Maybe it's time to install a snapshot build that contains a fix for this bug.

You don’t mention what routers/aps you use (except 1) But if you’re using ath10k then you need to use the non-ct firmware. All I know is that running batman is stable af if everything is set up correctly. I’ve tried booting up 10+ routers, rebooting them, doing all kinds of sh*t will not break the mesh.

Also if you use batman, do NOT use mesh11sd.

  • everything bernd wrote.