Roaming Issues Xiaomi AX3600

job4:~/git/jobawrt2/package/kernel/mac80211/patches/ath11k > fgrep 'u16 tcl_metadata;' *.patch
002-v5.12-ath11k-Update-tx-descriptor-search-index-properly.patch: u16 tcl_metadata;
801-ath11k-fix-4-addr-tx-failure-for-AP-and-STA-modes.patch:+ u16 tcl_metadata;
804-2-2-ath11k-fix-4addr-multicast-packet-tx.patch:+ u16 tcl_metadata;

I thnk your 801-ath11k-fix-4-addr-tx-failure-for-AP-and-STA-modes.patch is a copy of 804-2-2-ath11k-fix-4addr-multicast-packet-tx.patch then.

The original 801-ath11k-fix-4-addr-tx-failure-for-AP-and-STA-modes.patch doesn't contain "u16 tcl_metadata;".

Guys, I think that this roaming issue is not ath11k related, it looks FDB related.
Probably, there is a bug regarding MAC learning and due to that the switch is not flushing the old FDB
entry for that MAC and causing a duplicate.
After a timeout, it will remove the old one, and voila, traffic will start passing.

However, I don't know about a solution due to the weird and hacky nature of the ethernet + switch driver in which switch ports are presented as netdevs while not actually being one as DSA is not used.

now, that should be easy to check - so easy I'm a little embarassed I didn't do that yet. Standby... :slight_smile:

Same issue with the ath10 interface.

With 2 different APs connected each to a different port of the AX3600 the problem is still present when roaming between these APs, i.e. the MAC address is now moving between the external ports on the AX3600. The AX3600's IP is pingable after a few seconds from the moving client, as well as another established wifi client of the AX3600, but traffic which should be flowing between the external ports is not working.

"brctl showmacs" and "bridge fdb" both correctly update with the new port when the client moves to a new port, yet still traffic doesn't flow for a long time.

Also, physically moving one of the APs to a different port on the AX3600, the moved AP can ping the 2nd AP via the switch instantly. A newly connected client to the moved AP (connected before it moved) which is still in the no-connectivity period, stays in the no-connectivity period.

Once the no-connectivity client gets connectivity via the moved AP, if I move that AP back to its original port on the AX3600, then it keeps connectivity. Very weird!

Hm, then it could actually be ath11k

You will not get this working when there is no proper build of openwrt for these routers and ath11k is not ready. It still leeks memmory all over the place. never mind handling roming in a proper way. DAWN is like a black box as there is no dox for it. I have asked how it works and get know answers.

Can confirm that the roaming delay also happens on the ath10k interface.

I've built an image without at10k & ath11k and the problem roaming between the switch ports persists, so I guess we can rule those out as suspects in this case.

Repeating the roaming between 2 external APs (one ASUS, one TPlink 703n running openwrt). Where I noted that when an AP is moved between ports on the AX3600, it regains connectivity instantly, but the status of clients connected to that AP (as in, they have normal connectivity, else they are in the non-connectivity period) is unchanged.

An explanation could be that the switch, seeing the port go down, scrubs MAC addresses associated with that port. A client connected to the AP which is in the non-connectivity period, doesn't correctly have its MAC address assigned to the now-offline port, hence it doesn't get scrubbed. A client that does have connectivity has its MAC address correctly assigned to the now-offline port, so it is scrubbed and comes back up instantly on the new port.

Both "brctl showmacs" and "bridge fdb" update with the new port of a moved client within a few seconds. At the point the client finally gets connectivity after a few minutes, there's no further change to these, since they were correct already.

And just something else I noticed. When booting, upon issuing "tftboot ....." (and before bootm) it wakes up the switch (I guess it would be hard to tftboot with no switch!), and at this point, it's no problem to roam between APs connected to the switch.

This likely explains why all lan-lan traffic is tracked by nf_conntrack, even when configured as a dump AP with no masquerading zones.

@psi-c

Does the old entry persist or it gets removed and new one added?

Old entries in brctl and bridge disappear as soon as they appear on the new interface, which happens as expected within seconds.

I actually checked the status after disconnecting the AP with the non-connectivity client, before reconnecting it and both their MAC addresses correctly disappeared completely, so in software at least things look correct. Both MAC addresses re-appeared and showed assigned to the new port when the AP was connected to a different port.

Ok, then I am out of ideas.

No worries, thanks for checking in and enjoy the rest of your holidays!

Seriously, no one was testing this?
@joba-1 @foxium2 @psi-c

Still on the todo list.

I've tried that now. Tcpdump on the 703n (with no clients) connected to eth3 of the AX3600. eth2 of the AX is connected to the Asus which has the rest of the network hanging off of it.

Pinging between the AX & devices on the Asus network never appears on the 703n. When moving a wifi client between the Asus & Ath11k still no ICMP appearing on the 703n even when the client can't find anything on the Asus network

Somewhat unexpected, when moving the client from the Asus network to Ath11k, the client has no connectivity to the Asus and beyond during its purgatory period, but has no problem talking to the 703n hanging off eth3 or the AX itself. Once connectivity is restored to the Asus network, if I move the client to the 703n's wifi, then it still can see the Asus network without problem.

Jumping from the 703n to the Ath11k, still no problems contacting the Asus net.

In short, the roaming issue I'm seeing only occurs when the client moves to or from the Asus to another port on the AX. No problem when moving between the AX Ath1k & the 703n. To the client, it's like the link between the Asus and AX is broken.

Time to try some different topologies when I'm home...

Please run it on a PC which is directly connected to AX3600, as the internal switch of 703n may drop those frames.