Batman-adv + vlan-aware bridge = no go, why?

Out of curiosity, I tried a 802.11s mesh with batman-adv in my network. Yes, I know that it's not a good solution with only two wireless devices; yet, the goal was to learn, and it has been achieved, minus one question.

The target setup was to pass two separate VLANs (VID 1 and VID 4) through a wireless mesh. Mesh + batman-adv = new bat0 interface, which is documented to be VLAN-aware.

I already have a VLAN-aware bridge as br-lan; I have created two VLANs as follows, and it works this way:

config interface 'bat0'
	option proto 'batadv'
	option routing_algo 'BATMAN_V'
	option aggregated_ogms '1'
	option bonding '1'
	option bridge_loop_avoidance '1'
	option gw_mode 'off'
	option hop_penalty '30'
	option defaultroute '0'

# This is referenced from /etc/config/wireless
config interface 'nwi0'
	option proto 'batadv_hardif'
	option master 'bat0'
	option mtu '2304'
	option defaultroute '0'

config device
	option name 'br-lan'
	option type 'bridge'
	list ports 'bat0.1'
	list ports 'bat0.4'
	list ports 'lan1'
	list ports 'lan2'
	list ports 'lan3'
	list ports 'lan4'
	list ports 'lan5'

config interface 'lan'
	option device 'br-lan.1'
	option proto 'static'
	option ipaddr '192.168.10.1'
	option netmask '255.255.255.0'
	option ip6assign '64'
	list ip6class 'local'
	list ip6class 'wan_6'

config interface 'rus'
	option proto 'static'
	option device 'br-lan.4'
	option ipaddr '192.168.13.1'
	option netmask '255.255.255.0'
	option defaultroute '0'
	option delegate '0'
	option ip4table '4'

config bridge-vlan
	option device 'br-lan'
	option vlan '1'
	list ports 'bat0.1:u*'
	list ports 'lan1'
	list ports 'lan2'
	list ports 'lan3'
	list ports 'lan4'

config bridge-vlan
	option device 'br-lan'
	option vlan '4'
	list ports 'bat0.4:u*'
	list ports 'lan5'

# "unmanaged" doesn't bring the interfaces up, so "static"
config interface 'bat01'
	option proto 'static'
	option device 'bat0.1'

config interface 'bat04'
	option proto 'static'
	option device 'bat0.4'

...plus the same setup, but with the 192.168.x.2 addresses, on the other side. Ignore the option ip4table '4' on the rus interface, it only affects routing, while this post is 100% about Layer 2.

With the above setup, I can ping 192.168.10.1 and 192.168.13.1 from the other side, and the packets go through the expected VLANs.

The question is: why do I have to create individual VLANs on the bat0 interface and list them as untagged bridge ports, thus defeating the point of a VLAN-aware bridge?

In other words, why doesn't it work if I delete the "bat01" and "bat04" interfaces, and replace the "bridge-vlan" sections as follows?

# no "bat01" and "bat04" interfaces

config device
	option name 'br-lan'
	option type 'bridge'
	list ports 'bat0'
	list ports 'lan1'
	list ports 'lan2'
	list ports 'lan3'
	list ports 'lan4'
	list ports 'lan5'

config bridge-vlan
	option device 'br-lan'
	option vlan '1'
	list ports 'bat0:t'
	list ports 'lan1'
	list ports 'lan2'
	list ports 'lan3'
	list ports 'lan4'

config bridge-vlan
	option device 'br-lan'
	option vlan '4'
	list ports 'bat0:t'
	list ports 'lan5'

The kernel then starts spewing messages like this:

[16819.958117] batman_adv: bat0: adding TT local entry 94:83:c4:a7:ab:c2 to non-existent VLAN 4

Searching for this message yields this result:

Indeed, it says:

batman-adv since 2014.0.0 is 802.1Q VLAN-aware. It is only able to forward VLAN frames when it knows about the VLAN. This can either be done by creating a 802.1Q VLAN device with the correct VID on top of the batadv (bat0) device:

ip link add link bat0 name bat0.23 type vlan id 23

Or in case of a VLAN-aware bridge, it is better to add the VLANs as required to the specific ports:

bridge vlan add vid 23 dev bat0

Well, the first method of letting batman-adv know about the VLANs definitely works, and is implemented in the first (working) UCI configuration snippet.

I believe the second (non-working) UCI configuration snippet to be equivalent to the bridge vlan add vid XX dev bat0 method, which the FAQ suggests should also work, and should be preferred.

I also tried removing bat0 from the bridge entirely, adding it manually using the ip link set dev bat0 master br-lan command, and running the suggested bridge command manually to add the VLAN. This results in no traffic and many kernel messages about the non-existent VLAN.

In the (non-working) manual test, the output of bridge vlan is identical to that from the second (i.e., non-working) UCI-based config:

root@gl-inet-main:~# bridge vlan
port              vlan-id  
lan2              1 PVID Egress Untagged
lan3              1 PVID Egress Untagged
lan4              1 PVID Egress Untagged
lan5              4 PVID Egress Untagged
lan1              1 PVID Egress Untagged
br-lan            1
                  4
phy0-ap0          1 PVID Egress Untagged
phy1-ap0          1 PVID Egress Untagged
phy1-ap1          1 PVID Egress Untagged
bat0              1
                  4
phy0-ap1          4 PVID Egress Untagged

For a comparison, this is the output in the working case:

root@gl-inet-main:~# bridge vlan
port              vlan-id  
lan2              1 PVID Egress Untagged
lan3              1 PVID Egress Untagged
lan4              1 PVID Egress Untagged
lan5              4 PVID Egress Untagged
lan1              1 PVID Egress Untagged
br-lan            1
                  4
phy0-ap0          1 PVID Egress Untagged
phy1-ap0          1 PVID Egress Untagged
phy1-ap1          1 PVID Egress Untagged
phy0-ap1          4 PVID Egress Untagged
bat0.4            4 PVID Egress Untagged
bat0.1            1 PVID Egress Untagged

Is this the case of outdated upstream batman-adv documentation, or a bug in OpenWrt?

In the main bridge, list ports should be just bat0, like the Ethernet ports are just the base port name. Then specify the port as tagged in each applicable bridge-vlan section.

In other words treat a virtual bat port the same as a physical eth port.

That's exactly what the second (non-working) configuration does.

EDIT: I forgot to copy-paste this change in the original post - will correct now.

EDIT 2: done - please review the configuration again.

EDIT 3: If I do this with VLAN 4 in the non-working config, it starts working - but I shouldn't need to perform this dance:

root@gl-inet-main:~# ping 192.168.13.2
PING 192.168.13.2 (192.168.13.2): 56 data bytes
^C
--- 192.168.13.2 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss
root@gl-inet-main:~# ip link add name bat0.4 link bat0 type vlan id 4
root@gl-inet-main:~# ip link set bat0.4 up
root@gl-inet-main:~# ping 192.168.13.2
PING 192.168.13.2 (192.168.13.2): 56 data bytes
^C
--- 192.168.13.2 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
root@gl-inet-main:~# ip link del bat0.4
root@gl-inet-main:~# ping 192.168.13.2
PING 192.168.13.2 (192.168.13.2): 56 data bytes
64 bytes from 192.168.13.2: seq=0 ttl=64 time=1.689 ms
64 bytes from 192.168.13.2: seq=1 ttl=64 time=1.173 ms
64 bytes from 192.168.13.2: seq=2 ttl=64 time=1.538 ms
<...>

The information was written in 2014 so I would think this is now in effect.

Note: Do not rely on VLAN packets being filtered when no VLAN is added on top of bat0. This is likely subject to change in the future.

Well, this is not how I read it. Let me expand the thought by rewording it.

When a particular VLAN is not added on top of bat0, its packets are filtered out.

Which is exactly the current behavior; yet a nearby passage also claims that bridge vlan add vid should also work when bat0 is a part of a VLAN-aware bridge.

We reserve the right to change this behavior in the future, i.e., allow such packets under certain conditions, so please don't rely on them always being filtered out.

I am not relying on that. I need packets that match the VLAN ID mentioned in the OpenWrt vlan-aware bridge configuration as "bat0:t" to pass through and not generate any "non-existent VLAN" kernel messages.

So, let me reword my question:

Does this configuration snippet differ in any material aspect from the documented bridge vlan add vid 4 dev bat0 command? If it differs, I would have to file an enhancement request to OpenWrt. If not, I would have to report a bug to batman-adv.

config bridge-vlan
	option device 'br-lan'
	option vlan '4'
	list ports 'bat0:t'

the online documentation and community support is all I can rely on so can't really help you much further that. Sorry.

This is the config that works without any "non-existent VLAN" kernel message.

config device 'br_lan'
        option name 'br_lan'
        option type 'bridge'
        list ports 'bat0'
        list ports 'lan2'

config bridge-vlan 'br_lan_vlan1'
        option vlan '1'
        option device 'br_lan'
        list ports 'bat0:u*'
        list ports 'lan2:u*'

config bridge-vlan 'br_lan_vlan40'
        option vlan '40'
        option device 'br_lan'
        list ports 'lan2:t'
        list ports 'bat0.40:u*'