Unresponsive batman wired+wireless network

I have a working wireless batman setup with VLANs on 24.10-rc5. All mesh nodes - a gateway and several "dumb" APs - share (and they must do so) the same network/wireless configuration and have a single port (lan).

Some APs are wired to the gateway and some are not, but any one of them may or may not be at any point, therefore a batman wired interface (bat_eth) is also configured on all nodes. If this interface is enabled the network becomes extremely unresponsive, with nodes regularly becoming unreachable for minutes. What is causing this and what is wrong with my configuration (STP and BLA are enabled)?

/etc/config/network for the nodes:

config globals 'globals'
        option packet_steering '1'
        option ula_prefix 'fd82:e78d:65e8::/48'

config interface 'loopback'
        option device 'lo'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'
        option proto 'static'

config interface 'bat0'
        option bridge_loop_avoidance '1'
        option gw_mode 'client' # 'server' on the gateway
        option hop_penalty '30'
        option proto 'batadv'
        option routing_algo 'BATMAN_IV'

config interface 'bat_wifi_2g'
        option master 'bat0'
        option mtu '1536'
        option proto 'batadv_hardif'

config interface 'bat_wifi_5g'
        option master 'bat0'
        option mtu '1536'
        option proto 'batadv_hardif'

config interface 'bat_eth'
        option device 'lan.4'
        option disabled '1'
        option master 'bat0'
        option mtu '1536'
        option proto 'batadv_hardif'

config device 'device_br'
        option igmp_snooping '1'
        option mtu '1536'
        option mtu6 '1536'
        option name 'br'
        option ports 'bat0.1 bat0.3 bat0.5 lan'
        option stp '1'
        option type 'bridge'

config device 'device_lan'
        option mtu '1536'
        option name 'lan'

config device 'device_br_1'
        option ifname 'br'
        option macaddr '02:00:00:00:01:<id>'
        option name 'br.1'
        option type '8021q'
        option vid '1'

config device 'device_br_3'
        option ifname 'br'
        option macaddr '02:00:00:00:03:<id>'
        option name 'br.3'
        option type '8021q'
        option vid '3'

config device 'device_br_5'
        option ifname 'br'
        option macaddr '02:00:00:00:05:<id>'
        option name 'br.5'
        option type '8021q'
        option vid '5'

config device 'device_lan_4'
        option ifname 'lan'
        option macaddr '02:00:00:00:04:<id>'
        option name 'lan.4'
        option type '8021q'
        option vid '4'

config bridge-vlan 'vlan_lan'
        option device 'br'
        option ports 'lan:t bat0.1'
        option vlan '1'

config bridge-vlan 'vlan_guest'
        option device 'br'
        option ports 'lan bat0.3'
        option vlan '3'

config bridge-vlan 'vlan_iot'
        option device 'br'
        option ports 'lan:t bat0.5'
        option vlan '5'

config interface 'lan'
        option device 'br.1'
        list dns '10.0.1.1'
        list dns 'fd82:e78d:65e8:1::1'
        option dns_search '<search_domain>'
        option gateway '10.0.1.1' # not present on the gateway
        option ip6addr 'fd82:e78d:65e8:1::<id>/64'
        option ip6gw 'fd82:e78d:65e8:1::1' # not present on the gateway
        option ipaddr '10.0.1.<id>'
        option netmask '255.255.255.0'
        option proto 'static'

config interface 'guest' # static configuration on the gateway
        option device 'br.3'
        option proto 'none'

config interface 'iot' # static configuration on the gateway
        option device 'br.5'
        option proto 'none'

# wan interface on the gateway

Also for some reason I cannot specify a bridge VLAN (e.g. br.4) as the device for bat_eth ("resource busy"), only a single tagged port (lan.4 in my case): see this issue.

You are likely running into issues because of a loop. Do you have a device that you can run Wirehshark on? I'll give your config a look but with Wireshark you should be able to quickly tell if there is a loop because you will start seeing the same packets over and over until the until the entire network is overwhelmed.

You can activate Batman adv loop avoidance by adding option bridge_loop_avoidance 1 to the config. It probably is also an option in Luci.

Why do you need batman adv? If both AP's are wired you don't need a mesh.

I was afraid it was loops of some sort but I don't understand how they are possible, given that both STP and batman's bridge loop avoidance are already enabled!

I need batman: not all APs are wired, only two out of four. More APs may come in the future and they must all share the same OpenWisp-deployed config (agnostic to the fact that they are wired or not).

Are all devices a part of the batman interface? Make sure you put the mesh, wired port and SSID in bat0.

The network config is what you see, I could share the wireless config but it's pointless since it all works flawlessly with the bat_eth interface disabled.

I think you have to disable STP in the case of batman adv. Loop avoidance is enabled by default just check with batctl...

But please try to first verify my fade impression of a memory...

I've disabled STP on all nodes and verified that BLA is enabled with batctl. Still, the gateway became unresponsive very soon after enabling the bat_eth interface.

I knew that STP was not strictly necessary with BLA, but I've never heard that it can actually cause problems.

What does brctl show and bridge vlan say?

brctl show:

bridge name     bridge id               STP enabled     interfaces
br              7fff.f64d5cfb76ac       yes             bat0.1
                                                        lan
                                                        ap-2g-guest
                                                        ap-5g-iot
                                                        ap-5g-lan
                                                        ap-2g-iot
                                                        ap-5g-guest
                                                        bat0.5
                                                        ap-2g-lan
                                                        bat0.3

What do you mean by bridge vlan? It's not a valid command

If you can, please install ip-bridge and if you are on it, also ip-full.

Example output x86 with DSA:

root@cpe:~# bridge vlan
port              vlan-id
eth0              16
                  17
                  24
                  49
                  56
                  64
                  65
                  71
                  76
                  77
eth1              4094 PVID Egress Untagged
br-vlan           16
                  17
                  24
                  49
                  56
                  64
                  65
                  71
                  76
                  77
br1-vlan          4094
bat0.16           16 PVID Egress Untagged
bat0.17           17 PVID Egress Untagged
bat0.24           24 PVID Egress Untagged
bat0.49           49 PVID Egress Untagged
bat0.56           56 PVID Egress Untagged
bat0.64           64 PVID Egress Untagged
bat0.65           65 PVID Egress Untagged
bat0.71           71 PVID Egress Untagged
bat0.76           76 PVID Egress Untagged
bat0.77           77 PVID Egress Untagged

@_bernd It may be too early to declare victory. But what I've tried now is setting bat0's gw_mode to off instead of client on all the APs. It was one of the few deviations of my config from the guides. 23 minutes without any issues, coincidence? I have no idea how this setting could be relevant and what has really changed under the hood...

Isn't showing you this equivalent?

No clue. I don't use LUCI. I'm not sure if LUCI is showing the config, or the actual state.

I was curious about the output of brctl and bridge because it lead me to misconfiguration where a link was used with untagged traffic which cased me issues.

gw_mode didn't actually do anything, I'm having issues again.

As you requested, bridge vlan:

port              vlan-id  
lan               1
                  3 PVID Egress Untagged
                  5
ap-2g-lan         1 PVID Egress Untagged
ap-2g-guest       3 PVID Egress Untagged
ap-2g-iot         5 PVID Egress Untagged
br                1
                  3
                  5
bat0.1            1 PVID Egress Untagged
bat0.3            3 PVID Egress Untagged
bat0.5            5 PVID Egress Untagged

Do you know why I can't use a bridge VLAN (e.g. a new br.4) as the device for bat_eth and must use a single tagged port instead (lan.4 in my case)? The former would be more consistent and useful on APs that have more than one port but I get the error

netifd: bat_eth (5155): Error - failed to add interface br.4: Resource busy

It's late but on a first glimpse remove the tagged bat devices here.
Later in your config you attach the tagged bat0 device correctly afaics.

You mean add bat0 instead of bat0.1 bat0.3 bat0.5 to the bridge and then set it tagged (bat0:t) in the bridge-vlan sections?

If so, that was my old configuration and it was working (wireless only) on 23.05.5. On 24.10 it started giving me errors of the kind

batman_adv: bat0: adding TT local entry <macaddr> to non-existent VLAN <vid>

which I could only get rid of with the current approach.

No.

Thats enough:

config device
    option  name            'bat0'
    option  macaddr         '02:00:10:00:00:01'

config interface            'bat0'
    option  proto           'batadv'
    option  routing_algo    'BATMAN_IV'

config interface            'bat0_mesh0'
    option  proto           'batadv_hardif'
    option  master          'bat0'

config interface            'bat0_mesh1'
    option  proto           'batadv_hardif'
    option  master          'bat0'

config interface            'bat0_eth0'
    option  proto           'batadv_hardif'
    option  master          'bat0


# Example VLAN with swconfig
config switch_vlan
    option  device          'switch0'
    option  vlan            '16'
    option  ports           '0t 1t'

config device
    option  name            'br-vlan16'
    option  type            'bridge'
    list    ports           'bat0.16'
    list    ports           'eth0.16'
    option  macaddr         '02:00:10:01:00:10'

# Example with DNS
config bridge-vlan
    option  device          'br-vlan'
    option  vlan            '16'
    list    ports           'eth0:t'
    list    ports           'bat0.16'

You setup your tagged interfaces manually. That should be fine. However like I said; the bat0.N interfaces needs only be listed on the device-config of the VLAN-bridge.
And you don't have to add etho to bat0 otherwise you end up with a "hybrid"-port, which carries tagged and untagged traffic, but you want to have a "trunk"-port which only has tagged networks on it.

Never seen such a config: bat0.<vid> interfaces are always added to the main bridge in all the guides I've read, including the official OpenWrt docs:

If you don't add any device (lan.4 in my case) to your bat0_eth0 interface (bat_eth in my case) then what is it there for? This is what I'm trying to accomplish in case you're not understanding.

A hybrid trunk port is exactly what I have and need regardless of batman: the guest VLAN is the untagged one as you can see, since my guests should be able to connect to my network via ethernet without any special configuration on their devices (I don't have managed switches in the network). I repeat, everything is working flawlessly with the bat_eth interface disabled.

https://openwrt.org/docs/guide-user/network/wifi/mesh/batman#bridge_vlans_over_batman-adv

However I would not trust the wiki all the time.
Thats why I also told multiple times to check the actual state of the interface as the kernel sees it!
Having untagged traffic site by site with tagged traffic can cause issues. Does not have too, but can. Thats why many people say: Don't mix it! Either have an access port or a trunk port.

I will not watch it, sorry. IMHO half the dudes on YT talking shit. (If this dude would be so clever, we would not have like 5 people per week watching these videos and still have a half messed up setup and then come here call for help /rant.)

Again. The minimal config is: create vlan aware bridge with eth0; create bat0 (without any device! only use master); and then for each vlan bridge subinterface you add the ports (lan1, lan2, ...) and bat0.<N>.