I have a working wireless batman setup with VLANs on 24.10-rc5. All mesh nodes - a gateway and several "dumb" APs - share (and they must do so) the same network/wireless configuration and have a single port (lan).
Some APs are wired to the gateway and some are not, but any one of them may or may not be at any point, therefore a batman wired interface (bat_eth) is also configured on all nodes. If this interface is enabled the network becomes extremely unresponsive, with nodes regularly becoming unreachable for minutes. What is causing this and what is wrong with my configuration (STP and BLA are enabled)?
/etc/config/network for the nodes:
config globals 'globals'
option packet_steering '1'
option ula_prefix 'fd82:e78d:65e8::/48'
config interface 'loopback'
option device 'lo'
option ipaddr '127.0.0.1'
option netmask '255.0.0.0'
option proto 'static'
config interface 'bat0'
option bridge_loop_avoidance '1'
option gw_mode 'client' # 'server' on the gateway
option hop_penalty '30'
option proto 'batadv'
option routing_algo 'BATMAN_IV'
config interface 'bat_wifi_2g'
option master 'bat0'
option mtu '1536'
option proto 'batadv_hardif'
config interface 'bat_wifi_5g'
option master 'bat0'
option mtu '1536'
option proto 'batadv_hardif'
config interface 'bat_eth'
option device 'lan.4'
option disabled '1'
option master 'bat0'
option mtu '1536'
option proto 'batadv_hardif'
config device 'device_br'
option igmp_snooping '1'
option mtu '1536'
option mtu6 '1536'
option name 'br'
option ports 'bat0.1 bat0.3 bat0.5 lan'
option stp '1'
option type 'bridge'
config device 'device_lan'
option mtu '1536'
option name 'lan'
config device 'device_br_1'
option ifname 'br'
option macaddr '02:00:00:00:01:<id>'
option name 'br.1'
option type '8021q'
option vid '1'
config device 'device_br_3'
option ifname 'br'
option macaddr '02:00:00:00:03:<id>'
option name 'br.3'
option type '8021q'
option vid '3'
config device 'device_br_5'
option ifname 'br'
option macaddr '02:00:00:00:05:<id>'
option name 'br.5'
option type '8021q'
option vid '5'
config device 'device_lan_4'
option ifname 'lan'
option macaddr '02:00:00:00:04:<id>'
option name 'lan.4'
option type '8021q'
option vid '4'
config bridge-vlan 'vlan_lan'
option device 'br'
option ports 'lan:t bat0.1'
option vlan '1'
config bridge-vlan 'vlan_guest'
option device 'br'
option ports 'lan bat0.3'
option vlan '3'
config bridge-vlan 'vlan_iot'
option device 'br'
option ports 'lan:t bat0.5'
option vlan '5'
config interface 'lan'
option device 'br.1'
list dns '10.0.1.1'
list dns 'fd82:e78d:65e8:1::1'
option dns_search '<search_domain>'
option gateway '10.0.1.1' # not present on the gateway
option ip6addr 'fd82:e78d:65e8:1::<id>/64'
option ip6gw 'fd82:e78d:65e8:1::1' # not present on the gateway
option ipaddr '10.0.1.<id>'
option netmask '255.255.255.0'
option proto 'static'
config interface 'guest' # static configuration on the gateway
option device 'br.3'
option proto 'none'
config interface 'iot' # static configuration on the gateway
option device 'br.5'
option proto 'none'
# wan interface on the gateway
Also for some reason I cannot specify a bridge VLAN (e.g. br.4) as the device for bat_eth ("resource busy"), only a single tagged port (lan.4 in my case): see this issue.
You are likely running into issues because of a loop. Do you have a device that you can run Wirehshark on? I'll give your config a look but with Wireshark you should be able to quickly tell if there is a loop because you will start seeing the same packets over and over until the until the entire network is overwhelmed.
I was afraid it was loops of some sort but I don't understand how they are possible, given that both STP and batman's bridge loop avoidance are already enabled!
I need batman: not all APs are wired, only two out of four. More APs may come in the future and they must all share the same OpenWisp-deployed config (agnostic to the fact that they are wired or not).
The network config is what you see, I could share the wireless config but it's pointless since it all works flawlessly with the bat_eth interface disabled.
I've disabled STP on all nodes and verified that BLA is enabled with batctl. Still, the gateway became unresponsive very soon after enabling the bat_eth interface.
I knew that STP was not strictly necessary with BLA, but I've never heard that it can actually cause problems.
@_bernd It may be too early to declare victory. But what I've tried now is setting bat0's gw_mode to off instead of client on all the APs. It was one of the few deviations of my config from the guides. 23 minutes without any issues, coincidence? I have no idea how this setting could be relevant and what has really changed under the hood...
No clue. I don't use LUCI. I'm not sure if LUCI is showing the config, or the actual state.
I was curious about the output of brctl and bridge because it lead me to misconfiguration where a link was used with untagged traffic which cased me issues.
Do you know why I can't use a bridge VLAN (e.g. a new br.4) as the device for bat_eth and must use a single tagged port instead (lan.4 in my case)? The former would be more consistent and useful on APs that have more than one port but I get the error
config device
option name 'bat0'
option macaddr '02:00:10:00:00:01'
config interface 'bat0'
option proto 'batadv'
option routing_algo 'BATMAN_IV'
config interface 'bat0_mesh0'
option proto 'batadv_hardif'
option master 'bat0'
config interface 'bat0_mesh1'
option proto 'batadv_hardif'
option master 'bat0'
config interface 'bat0_eth0'
option proto 'batadv_hardif'
option master 'bat0
# Example VLAN with swconfig
config switch_vlan
option device 'switch0'
option vlan '16'
option ports '0t 1t'
config device
option name 'br-vlan16'
option type 'bridge'
list ports 'bat0.16'
list ports 'eth0.16'
option macaddr '02:00:10:01:00:10'
# Example with DNS
config bridge-vlan
option device 'br-vlan'
option vlan '16'
list ports 'eth0:t'
list ports 'bat0.16'
You setup your tagged interfaces manually. That should be fine. However like I said; the bat0.N interfaces needs only be listed on the device-config of the VLAN-bridge.
And you don't have to add etho to bat0 otherwise you end up with a "hybrid"-port, which carries tagged and untagged traffic, but you want to have a "trunk"-port which only has tagged networks on it.
If you don't add any device (lan.4 in my case) to your bat0_eth0 interface (bat_eth in my case) then what is it there for? This is what I'm trying to accomplish in case you're not understanding.
A hybrid trunk port is exactly what I have and need regardless of batman: the guest VLAN is the untagged one as you can see, since my guests should be able to connect to my network via ethernet without any special configuration on their devices (I don't have managed switches in the network). I repeat, everything is working flawlessly with the bat_eth interface disabled.
However I would not trust the wiki all the time.
Thats why I also told multiple times to check the actual state of the interface as the kernel sees it!
Having untagged traffic site by site with tagged traffic can cause issues. Does not have too, but can. Thats why many people say: Don't mix it! Either have an access port or a trunk port.
I will not watch it, sorry. IMHO half the dudes on YT talking shit. (If this dude would be so clever, we would not have like 5 people per week watching these videos and still have a half messed up setup and then come here call for help /rant.)
Again. The minimal config is: create vlan aware bridge with eth0; create bat0 (without any device! only use master); and then for each vlan bridge subinterface you add the ports (lan1, lan2, ...) and bat0.<N>.