Broken 802.11r Fast Transition / Roaming and 5Ghz issues probably caused by DSA implementation

Debug Environment:

  • Devices: Two OpenWrt routers connected via a VLAN-bridged network (bridge name: br).
  • VLAN Tagging: LAN traffic uses VLAN tag 10.
  • Second Bridge: Created a second bridge (brl) above br.10 (Just for clearer view of the bridged ports and wireless interfaces).
  • LAN Configuration: LAN network is attached to the brl bridge.
  • Note: Both OpenWrt devices have identical configurations.

Symptoms & Observations:

  1. Unexpected DHCP Discovery: When an STA moves from proximity of one device to another, it sends a DHCP Discovery message, which is unexpected behavior when Fast Transition is enabled.
  2. DHCP Offer Limited to First Device: The DHCP Offer reply packet is only visible on the first device, which acts as the DHCP server, and does not propagate to the second device where the STA has roamed.

Bridge FDB Observations:

I performed bridge fdb show | grep 'macaddr of STA' on the second OpenWrt device. It reveals multiple MAC address entries across various interfaces and VLANs, indicating that something is amiss. It seems the second device either discards the packets or perhaps sends them back to the source, due to these stale or erroneous entries.

First device:

bridge fdb show | grep '??:??:??:??:??:??'
??:??:??:??:??:?? dev lan1 vlan 10 master br
??:??:??:??:??:?? dev br.10 master brl

Second device:

bridge fdb show | grep '??:??:??:??:??:??'
??:??:??:??:??:?? dev sw-eth1 vlan 10 master br
??:??:??:??:??:?? dev sw-eth1 vlan 10 self
??:??:??:??:??:?? dev phy1-ap0 master brl

phy0-ap0: wireless interface 2.4Ghz
phy1-ap0: wireless interface 5GhZ

Identical scenario is happening also in the case of just switching from 'phy0-ap0' to 'phy1-ap0' (which is 5GHz, and involves both cases with/without FT enabled).

Setting the MAC address aging time of bridge containing the VLAN to a lower value

brctl setageing br 3

seems to fix both issues, at least as a temporary solution.

Conclusion:

I believe there is a flaw or shortcoming in the DSA implementation when dealing with VLANs and MAC address aging. It seems like the bridge doesn't efficiently flush stale MAC entries, which affects both Fast Transition and 5GHz connectivity.

Problem is present on:

  • 23.05.0-rc3 r23389-5deed175a5
  • SNAPSHOT r23995-ce7209bd21
  • also some version of SNAPSHOT r24xxx (but i forgot to write down the exact number :confused:)

Also something to note:
My first attempt to solve the issue was to dynamically (using scripting) delete old entries, but when i tried just manually remove them using:

bridge fdb del '??:??:??:??:??:??' dev sw-eth1 vlan 10 master

which executes with no issues, but

bridge fdb del '??:??:??:??:??:??' dev sw-eth1 vlan 10 self

always fails with

RTNETLINK answers: No such file or directory

so next logical step was to set the aging time to very low value.

2 Likes