Broken 802.11r Fast Transition / Roaming and 5Ghz issues probably caused by DSA implementation

Hello Everyone,

I want to bring up two issues that many of us have probably faced: broken 802.11r Fast Transition (Roaming) and intermittent 5GHz connection issues. After some intensive debugging, I've noticed that the common denominator for both problems appears to be the DSA implementation.

After a deep dive into the system logs and configurations, I found that these problems occur primarily when VLANs are being utilized.

The Issue:

The issue seems to be linked to how MAC addresses age on the bridge. MAC address aging seems to be affecting the usability of both the 802.11r fast transition as well as switching connections from 2.4Ghz to 5Ghz.

Workaround:

I managed to resolve both issues by setting the bridge's MAC address ageing time to a low value. You can do this by running:

brctl setageing [bridge_name] 3

Setting the ageing time to 3 seconds seems to clear up the problems. While this is more of a workaround than a permanent fix, it should offer immediate relief to those experiencing similar issues.

I'm leaning towards the possibility that the DSA implementation has some quirks when dealing with VLANs and MAC address ageing on bridges (Probably missing some flush functionality). More investigation is definitely needed, but I thought I'd share this quick fix in case anyone else is in the same boat.

Regards,
David

2 Likes

Great write-up!

Could you please also help the less-initiated with examples of symptoms and log messages?

I'm curious. Source?
AFAIK the whole bridge stuff on Linux has not much changed. DSA handles only how a bridge and attached interfaces are handled but the bridge module in Linux is quite stable since ages.

1 Like

Debug Environment:

  • Devices: Two OpenWrt routers connected via a VLAN-bridged network (bridge name: br).
  • VLAN Tagging: LAN traffic uses VLAN tag 10.
  • Second Bridge: Created a second bridge (brl) above br.10 (Just for clearer view of the bridged ports and wireless interfaces).
  • LAN Configuration: LAN network is attached to the brl bridge.
  • Note: Both OpenWrt devices have identical configurations.

Symptoms & Observations:

  1. Unexpected DHCP Discovery: When an STA moves from proximity of one device to another, it sends a DHCP Discovery message, which is unexpected behavior when Fast Transition is enabled.
  2. DHCP Offer Limited to First Device: The DHCP Offer reply packet is only visible on the first device, which acts as the DHCP server, and does not propagate to the second device where the STA has roamed.

Bridge FDB Observations:

I performed bridge fdb show | grep 'macaddr of STA' on the second OpenWrt device. It reveals multiple MAC address entries across various interfaces and VLANs, indicating that something is amiss. It seems the second device either discards the packets or perhaps sends them back to the source, due to these stale or erroneous entries.

First device:

bridge fdb show | grep '??:??:??:??:??:??'
??:??:??:??:??:?? dev lan1 vlan 10 master br
??:??:??:??:??:?? dev br.10 master brl

Second device:

bridge fdb show | grep '??:??:??:??:??:??'
??:??:??:??:??:?? dev sw-eth1 vlan 10 master br
??:??:??:??:??:?? dev sw-eth1 vlan 10 self
??:??:??:??:??:?? dev phy1-ap0 master brl

phy0-ap0: wireless interface 2.4Ghz
phy1-ap0: wireless interface 5GhZ

Identical scenario is happening also in the case of just switching from 'phy0-ap0' to 'phy1-ap0' (which is 5GHz, and involves both cases with/without FT enabled).

Setting the MAC address aging time of bridge containing the VLAN to a lower value

brctl setageing br 3

seems to fix both issues, at least as a temporary solution.

Conclusion:

I believe there is a flaw or shortcoming in the DSA implementation when dealing with VLANs and MAC address aging. It seems like the bridge doesn't efficiently flush stale MAC entries, which affects both Fast Transition and 5GHz connectivity.

Problem is present on:

  • 23.05.0-rc3 r23389-5deed175a5
  • SNAPSHOT r23995-ce7209bd21
  • also some version of SNAPSHOT r24xxx (but i forgot to write down the exact number :confused:)

Also something to note:
My first attempt to solve the issue was to dynamically (using scripting) delete old entries, but when i tried just manually remove them using:

bridge fdb del '??:??:??:??:??:??' dev sw-eth1 vlan 10 master

which executes with no issues, but

bridge fdb del '??:??:??:??:??:??' dev sw-eth1 vlan 10 self

always fails with

RTNETLINK answers: No such file or directory

so next logical step was to set the aging time to very low value.

2 Likes

I have questions:

  1. on both routers you have a DHCP server running?
  2. When multiple bridges forming a layer-2 then arp entries aka ip neighbor entries are just replaced but not flushed and then recreated. without a control plane on all devices it has to happen on the root of the bridge.
  3. If the station sends a new DHCP request then there is an issue with the roaming process but it has nothing to do how the Linux bridge works, doesn't it?

I added some more info to the second post, could you please evaluate it once more?

  1. DHCP is runing only on the first device.
  2. Described behavior occurs even after eliminating the brl bridge and lan is attached directly to the br.
  3. DHCP Request is not send at all when are the tables in the 'correct' state.

Very simply, when rows

??:??:??:??:??:?? dev sw-eth1 vlan 10 master br
??:??:??:??:??:?? dev sw-eth1 vlan 10 self

from the second device disappears quickly enough, for example by lowering the aging value, both issues are no more. Which even means no DHCP messages. I think that involvement of DHCP is just result of some handling of the fault state from the STA.

Guess I found something: Regression: DSA breaks roaming to WLAN bridged to VLAN #11650

4 Likes

Quick Update

Hello OpenWrt Community,

I wanted to provide a quick update on the bugfix concerning the lowering of aging time of bridge containing vlans for WiFi roaming. Unfortunately, it appears that the temporary bugfix is no longer working in OpenWrt versions:

  • 23.05.0-rc4
  • SNAPSHOT - r24054-fe10f97439

Reverting to an older OpenWrt version was the only option for me:

  • SNAPSHOT - r23930-6cf27094e9

Observed Behavior

A FDB entry hangs on the second device, as indicated below:

??:??:??:??:??:?? dev sw-eth1 vlan 10 self

This behavior leads to a connectivity interruption lasting approximately 5 minutes (while the entry exists).

Regards,
David

5 Likes

Just to clarify 802.11r - I think it is common for a client to validate its address with the dhcp server after roaming to a new AP. FT only deals with getting the connection - client decides what to do after that but there is no way for a client to know if the new AP is on a different subnet unless it sends the DHCP server a request. - FWIW I could be wrong but this is my undestanding. 80211r deals with this:

When a client device roams from one access point (AP) to another using 802.11r, it undergoes a series of steps to ensure a fast and secure handoff. The state machine for a client during an 802.11r roam typically involves the following stages:

  1. Discovery of Candidate APs (Access Points):
  • The client device listens for beacons or probes responses from nearby APs to identify potential candidates for roaming.
  1. Roaming Decision:
  • The client device decides to roam based on various metrics such as signal strength, load, or the proprietary algorithms of the client device.
  1. FT Authentication Process:
  • Once a roaming candidate is identified, the client begins the FT authentication process with the target AP while still connected to the current AP.
  • It involves the exchange of FT authentication frames (FT Request and FT Response) which contain the necessary information to establish a security context with the new AP. This is the key step that reduces the time typically required for full authentication.
  1. Reassociation and Key Management:
  • After the FT authentication completes, the client sends a Reassociation Request to the target AP.
  • The target AP responds with a Reassociation Response.
  • The client and the target AP derive temporal keys using the negotiated PMK-R1 (Pairwise Master Key-R1) which was established during the FT authentication phase.

This obviously doesn't change the other parts of the problem you have documented. My hope was just to clarify why the client sends the dhcp request.

2 Likes

That’s indeed what I observe in 23.5 . Unfortunately (dynamic) VLAN assignment appears to be broken in the latest stable release: OpenWrt 23.05.0 - First stable release - #364 by fodiator

The link from @davidrapan above : Broken 802.11r Fast Transition / Roaming and 5Ghz issues probably caused by DSA implementation - #4 by davidrapan fixes the issue (temporarily) for me.

1 Like

OpenWRT 23.05.01, 3x Xiaomi AX3200 in DAP mode. I also encountered this problem. I have two configs. With one FT it does not work with the same symptoms as yours. "brctl setageing br-lan "3"" does not solve the problem. But with the other config, without creating a bridge,FT is work fine.

FT don't works

config interface 'loopback'
option device 'lo'
option proto 'static'
option ipaddr '127.0.0.1'
option netmask '255.0.0.0'

config globals 'globals'
option ula_prefix 'xxxxxxxxxx'

config device
option name 'wl0-ap0'
option macaddr 'xx:xx:xx:xx:xx:5F'
option ipv6 '0'

config device
option name 'wl1-ap0'
option ipv6 '0'
option macaddr 'xx:xx:xx:xx:xx:60'

config device
option type 'bridge'
option name 'br-lan'
list ports 'lan1'
list ports 'lan2'
list ports 'lan3'
list ports 'wan'
option ipv6 '0'
option macaddr 'xx:xx:xx:xx:xx:5E'

config interface 'Main'
option proto 'dhcp'
option device 'br-vlan10'
option delegate '0'

config interface 'IOT'
option proto 'none'
option device 'br-vlan20'
option defaultroute '0'
option delegate '0'

config interface 'Guest'
option proto 'none'
option device 'br-vlan30'
option defaultroute '0'
option delegate '0'

config device
option name 'br-lan.10'
option type '8021q'
option ifname 'br-lan'
option vid '10'
option ipv6 '0'

config device
option name 'br-lan.20'
option type '8021q'
option ifname 'br-lan'
option vid '20'
option ipv6 '0'

config device
option name 'br-lan.30'
option type '8021q'
option ifname 'br-lan'
option vid '30'
option ipv6 '0'

config device
option type 'bridge'
option name 'br-vlan10'
list ports 'br-lan.10'
option bridge_empty '1'
option ipv6 '0'
option stp '1'

config device
option type 'bridge'
option name 'br-vlan20'
list ports 'br-lan.20'
option bridge_empty '1'
option ipv6 '0'
option stp '1'

config device
option type 'bridge'
option name 'br-vlan30'
list ports 'br-lan.30'
option bridge_empty '1'
option ipv6 '0'
option stp '1'

config bridge-vlan
option device 'br-lan'
option vlan '10'
list ports 'lan1'
list ports 'lan2'
list ports 'wan:t'

config bridge-vlan
option device 'br-lan'
option vlan '20'
list ports 'wan:t'

config bridge-vlan
option device 'br-lan'
option vlan '30'
list ports 'wan:t'

FT works without problems

config interface 'loopback'
option device 'lo'
option proto 'static'
option ipaddr '127.0.0.1'
option netmask '255.0.0.0'

config globals 'globals'
option ula_prefix 'xxxxxxxxxxxxxxxxx'

config device
option name 'wl0-ap0'
option macaddr 'xx:xx:xx:xx:xx:5F'
option ipv6 '0'

config device
option name 'wl1-ap0'
option macaddr 'xx:xx:xx:xx:xx:60'
option ipv6 '0'

config interface 'Main'
option device 'br-vlan10'
option proto 'dhcp'
option delegate '0'

config interface 'IOT'
option proto 'none'
option device 'br-vlan20'
option defaultroute '0'
option delegate '0'

config interface 'Guest'
option device 'br-vlan30'
option proto 'none'
option delegate '0'
option defaultroute '0'

config device
option type '8021q'
option ifname 'wan'
option vid '10'
option name 'wan.10'
option ipv6 '0'

config device
option type '8021q'
option ifname 'wan'
option vid '20'
option name 'wan.20'
option ipv6 '0'

config device
option type '8021q'
option ifname 'wan'
option vid '30'
option name 'wan.30'
option ipv6 '0'

config device
option type 'bridge'
option name 'br-vlan10'
option bridge_empty '1'
option stp '1'
option macaddr 'xx:xx:xx:xx:xx:5E'
list ports 'lan1'
list ports 'lan2'
list ports 'wan.10'
option ipv6 '0'

config device
option type 'bridge'
option name 'br-vlan20'
option bridge_empty '1'
option stp '1'
option ipv6 '0'
list ports 'wan.20'

config device
option type 'bridge'
option name 'br-vlan30'
list ports 'wan.30'
option bridge_empty '1'
option stp '1'
option ipv6 '0'

Tell me please, is my working version not correct? With the second config everything works but I have another problem, I could not configure 802.1X dynamic VLAN.

here is my config snippet:

Config snippet - OpenWrt 23.05.0, r23497-6637af95aa
root@OpenWrtB:/etc/config# cat network
config interface 'loopback'
        option device 'lo'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'

config globals 'globals'
        option ula_prefix 'fd6d:ed78:b3a6::/48'

config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'lan4'
        option ipv6 '0'

config device
        option type 'bridge'
        option name 'br-wan'
        list ports 'wan'
        option ipv6 '0'

config device
        option type 'bridge'
        option name 'br-vlan172'
        option mtu '1500'
        option ipv6 '0'
        option macaddr 'xx:xx:xx:xx:xx:xx'
        option txqueuelen '1000'
        list ports 'br-wan.172'

config interface 'wan'
        option device 'br-wan.1'
        option proto 'dhcp'

config interface 'lan'
        option device 'br-lan'
        option proto 'static'
        option ipaddr '192.168.2.1'
        option netmask '255.255.255.0'
        option ip6assign '60'

config interface 'wan6'
        option device 'wan'
        option proto 'dhcpv6'

config bridge-vlan
        option device 'br-wan'
        option vlan '1'
        list ports 'wan'

config bridge-vlan
        option device 'br-wan'
        option vlan '10'
        list ports 'wan:t'

config bridge-vlan
        option device 'br-wan'
        option vlan '172'
        list ports 'wan:t'

config bridge-vlan
        option device 'br-wan'
        option vlan '1723'
        list ports 'wan:t'

config interface 'vlan172'
        option proto 'none'
        option device 'br-wan.172'

config interface 'vlan1723'
        option proto 'none'
        option device 'br-wan.1723'

I also tried your second approach, but couldn't get that working well.

With my config, I look for :

root@OpenWrtB:~#  bridge fdb show br-wan172 | grep 62
62:xx:xx:xx:xx:xx dev wl1-ap0.172 master br-vlan172
root@OpenWrtB:~# brctl show
bridge name     bridge id               STP enabled     interfaces
br-lan          7fff.xxxxxxxxxxxx       no              lan1
                                                        lan2
                                                        lan3
                                                        lan4
br-vlan172              8000.xxxxxxxxxxxx              no              br-wan.172
                                                        wl1-ap0.172
br-wan          7fff.xxxxxxxxxxxx              no              wan
                                                        wl0-ap0
                                                        wl1-ap0
                                                        wl1-ap1

... and when I see :

root@OpenWrtB:~#  bridge fdb show br-wan172 | grep 62
62:xx:xx:xx:xx:xx dev wan vlan 172 master br-wan
62:xx:xx:xx:xx:xx dev wan vlan 172 self
62:xx:xx:xx:xx:xx dev br-wan.172 master br-vlan172

..., I apply
root@OpenWrtB:~# brctl setageing br-wan 3

Issue persists in OpenWrt 23.05.2, r23630-842932a63d . Fix continues to work.

@davidrapan , have you perhaps already created a bug/issue ?

1 Like

it's open here Regression: DSA breaks roaming to WLAN bridged to VLAN · Issue #11650 · openwrt/openwrt (github.com)

2 Likes

Same issue here... It works with 22.03.5 but no with 23.05.X

Hi, Did you find a fix / workaround in 23.05.2?