Roaming Issues Xiaomi AX3600

I plugged in a notebook and a linux sat-box in two ports of one ax3600 now.

Running tcpdump icmp on the notebook

I can see pings from the sat-box to the notebook but not pings to the ax3600 or pings to other devices (wlan or lan).

Looks good to me.

Now that this looks more like a switch issue:

Is there a way to make the hardware switch unlearn a mac?

I'd like to test if that speeds up things.

I tried bridge fdb del MAC dev DEV master br-lan with DEV being wlan0 or eth1 (lan port where the other ax3600 is connected). But that did not help...

Do you have any other APs you can hang off one of your AX3600s to try roaming to? I was going to try a Fritzbox 7390 tonight but the PSU seems to have gone walkies...

Running tcpdump on the Asus and pinging it from the Pi after at moves to the AX, I can see ARP requests from the Pi to the Asus (and the requests go out every active port on the AX), to which the Asus replies on the correct port, but these are not seen by tcpdump listening on the AX's port connected to the Asus, nor appear on any other port on the AX. Similarly pings from the Asus to the Pi are sent from the Asus but are never seen on the AX. Then all of a sudden a single ARP reply from the Asus appears on the AX and the Pi has connectivity.

Manually adding the Asus's MAC address to the Pi's ARP table allows the Pi to see the Asus quicker after moving to the AX, but the Pi still can't see anything else beyond the Asus for a long time. In this case there was no ARP request or reply from the Pi to the Asus, traffic just started flowing.

yes, I have.

D-Link DIR-825 B with current OpenWRT and fritzboxes.
Currently one fritzbox (internet router) is already directly connected to a lan port of one AX3600 like so:

[AX2] - lan - [Fritz 5530] - lan - [dumb 5xSwitch] - lan - [AX1]

I have linux devices with tcpdump attached to AX2, dumb Switch and AX1.
The wlan of the fritzbox is currently off, but I can switch it on for tests.

What is the goal of the test and what exactly should I do?

It's more of a curiosity to see what happens.

As I've seen, an Asus connected to the AX has roaming troubles, yet a 703n connected to the AX works fine. I don't see that it's an issue with the Asus as packets are correctly sent from the Asus, and when the AX is in uboot, it's possible to roam between the Asus and 703n, both connected via the AX's switch.

You've seen a problem with roaming between a pair of AX3600s, so I'm wondering is the 703n a lucky exception, or do other devices also seem to get better treatment by the AX.

finally did some testing again - with the other (fritz) box. No tcpdumps, but combined syslog entries of AX1, AX2 and the roaming ESP32. Of course no syslog directly from the Fritzbox...

Spoiler: Roaming to Fritzbox is reasonably fast, roaming to an AX is slow (no matter from where)

Summary
Switch on ESP32 at AX1


2021-07-30T16:25:24+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: authenticated
2021-07-30T16:25:24+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: associated (aid 11)
2021-07-30T16:25:24+02:00 ax1 hostapd: wlan2: AP-STA-CONNECTED 24:a1:60:54:37:d4
2021-07-30T16:25:24+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 RADIUS: starting accounting session 70BC1C9F1E3A33E5
2021-07-30T16:25:24+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 WPA: pairwise key handshake completed (RSN)
2021-07-30T16:25:24+02:00 ax1 hostapd: wlan2: EAPOL-4WAY-HS-COMPLETED 24:a1:60:54:37:d4

2021-07-30T16:25:24+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:25:24+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4

2021-07-30T16:25:24+02:00 ax1 hostapd: wlan2: AP-STA-DISCONNECTED 24:a1:60:54:37:d4
2021-07-30T16:25:24+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: disassociated
2021-07-30T16:25:25+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)

2021-07-30T16:25:26+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: authenticated
2021-07-30T16:25:26+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: associated (aid 11)
2021-07-30T16:25:26+02:00 ax1 hostapd: wlan2: AP-STA-CONNECTED 24:a1:60:54:37:d4
2021-07-30T16:25:26+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 RADIUS: starting accounting session D1E1F22639F23AA6
2021-07-30T16:25:26+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 WPA: pairwise key handshake completed (RSN)
2021-07-30T16:25:26+02:00 ax1 hostapd: wlan2: EAPOL-4WAY-HS-COMPLETED 24:a1:60:54:37:d4

2021-07-30T16:25:26+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:25:26+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4

2021-07-30T16:25:28.487839+02:00 Roamer-5437D4 Roamer Host Roamer-5437D4 3.1 started with IP 192.168.1.179
2021-07-30T16:25:28.494833+02:00 Roamer-5437D4 Roamer Connect job.fritz.ssid with MAC 24:A1:60:54:37:D4 at 2021-07-30 16:25:28
2021-07-30T16:25:28.502871+02:00 Roamer-5437D4 Roamer RSSI -> -63 at 2021-07-30 16:25:28
2021-07-30T16:25:28.554805+02:00 Roamer-5437D4 Roamer Echo disconnect at 1900-01-00 00:00:00
2021-07-30T16:25:28.561484+02:00 Roamer-5437D4 Roamer Echo ok at 2021-07-30 16:25:28
2021-07-30T16:25:29.555533+02:00 Roamer-5437D4 Roamer No ping at 1900-01-00 00:00:00
2021-07-30T16:25:29.560853+02:00 Roamer-5437D4 Roamer Ping ok at 2021-07-30 16:25:29


Connected to AX1


2021-07-30T16:25:31.177024+02:00 Roamer-5437D4 Roamer RSSI -> -57 at 2021-07-30 16:25:31
2021-07-30T16:31:11.663239+02:00 Roamer-5437D4 Roamer RSSI -> -67 at 2021-07-30 16:31:11


Roaming to AX2


2021-07-30T16:31:15.244544+02:00 Roamer-5437D4 Roamer RSSI -> -81 at 2021-07-30 16:31:15
2021-07-30T16:31:15.467126+02:00 Roamer-5437D4 Roamer RSSI -> -86 at 2021-07-30 16:31:15
2021-07-30T16:31:16.552009+02:00 Roamer-5437D4 Roamer Reconnect low RSSI at 2021-07-30 16:31:16

2021-07-30T16:31:23+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: authenticated
2021-07-30T16:31:23+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: associated (aid 2)
2021-07-30T16:31:23+02:00 ax2 hostapd: wlan2: AP-STA-CONNECTED 24:a1:60:54:37:d4
2021-07-30T16:31:23+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 RADIUS: starting accounting session 3A6D819144D7298C
2021-07-30T16:31:23+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 WPA: pairwise key handshake completed (RSN)
2021-07-30T16:31:23+02:00 ax2 hostapd: wlan2: EAPOL-4WAY-HS-COMPLETED 24:a1:60:54:37:d4

2021-07-30T16:31:23+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:31:23+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4
2021-07-30T16:31:23+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:31:23+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4

2021-07-30T16:31:23+02:00 ax2 hostapd: wlan2: AP-STA-DISCONNECTED 24:a1:60:54:37:d4
2021-07-30T16:31:23+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: disassociated
2021-07-30T16:31:24+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)

2021-07-30T16:31:25+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: authenticated
2021-07-30T16:31:25+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: associated (aid 2)
2021-07-30T16:31:25+02:00 ax2 hostapd: wlan2: AP-STA-CONNECTED 24:a1:60:54:37:d4
2021-07-30T16:31:25+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 RADIUS: starting accounting session 51B6B1A7A65FA03F
2021-07-30T16:31:25+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 WPA: pairwise key handshake completed (RSN)
2021-07-30T16:31:25+02:00 ax2 hostapd: wlan2: EAPOL-4WAY-HS-COMPLETED 24:a1:60:54:37:d4

cut repeating retries, until...

2021-07-30T16:34:09+02:00 ax2 hostapd: wlan2: EAPOL-4WAY-HS-COMPLETED 24:a1:60:54:37:d4

2021-07-30T16:34:09+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:34:09+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4
2021-07-30T16:34:09+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:34:09+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4
2021-07-30T16:34:09+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:34:09+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4
2021-07-30T16:34:09+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:34:09+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4


Connected to AX2 (175s roam)


2021-07-30T16:34:09.876077+02:00 Roamer-5437D4 Roamer Disconnect at 2021-07-30 16:31:23
2021-07-30T16:34:09.879184+02:00 Roamer-5437D4 Roamer Connect job.fritz.ssid with MAC 24:A1:60:54:37:D4 at 2021-07-30 16:34:09
2021-07-30T16:34:09.888462+02:00 Roamer-5437D4 Roamer RSSI -> -55 at 2021-07-30 16:34:09
2021-07-30T16:34:10.871745+02:00 Roamer-5437D4 Roamer Echo disconnect at 2021-07-30 16:34:09
2021-07-30T16:34:10.879081+02:00 Roamer-5437D4 Roamer Echo ok at 2021-07-30 16:34:10
2021-07-30T16:34:11.573749+02:00 Roamer-5437D4 Roamer RSSI -> -50 at 2021-07-30 16:34:11
2021-07-30T16:34:11.852444+02:00 Roamer-5437D4 Roamer No ping at 2021-07-30 16:34:09
2021-07-30T16:34:11.859853+02:00 Roamer-5437D4 Roamer Ping ok at 2021-07-30 16:34:11
2021-07-30T16:34:13.314585+02:00 Roamer-5437D4 Roamer RSSI -> -40 at 2021-07-30 16:34:13
2021-07-30T16:34:13.831473+02:00 Roamer-5437D4 Roamer RSSI -> -45 at 2021-07-30 16:34:13


Roaming to Fritz


2021-07-30T16:34:51.615375+02:00 Roamer-5437D4 Roamer RSSI -> -80 at 2021-07-30 16:34:51
2021-07-30T16:34:52.329314+02:00 Roamer-5437D4 Roamer RSSI -> -86 at 2021-07-30 16:34:52
2021-07-30T16:34:53.406415+02:00 Roamer-5437D4 Roamer Reconnect low RSSI at 2021-07-30 16:34:53

2021-07-30T16:34:53+02:00 ax2 hostapd: wlan2: AP-STA-DISCONNECTED 24:a1:60:54:37:d4
2021-07-30T16:34:53+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: disassociated
2021-07-30T16:34:54+02:00 ax2 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)

2021-07-30T16:34:56+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:34:56+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4

2021-07-30T16:34:57.525779+02:00 Roamer-5437D4 Roamer Reconnect at 2021-07-30 16:34:53

2021-07-30T16:34:59+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:34:59+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4

2021-07-30T16:35:01.547722+02:00 Roamer-5437D4 Roamer RSSI -> 0 at 2021-07-30 16:34:57
2021-07-30T16:35:01.564940+02:00 Roamer-5437D4 Roamer RSSI -> -68 at 2021-07-30 16:35:01
2021-07-30T16:35:01.589248+02:00 Roamer-5437D4 Roamer RSSI -> -60 at 2021-07-30 16:35:01
2021-07-30T16:35:01.977883+02:00 Roamer-5437D4 Roamer RSSI -> -53 at 2021-07-30 16:35:01
2021-07-30T16:35:02.585276+02:00 Roamer-5437D4 Roamer RSSI -> -47 at 2021-07-30 16:35:02
2021-07-30T16:35:02.619454+02:00 Roamer-5437D4 Roamer No ping at 2021-07-30 16:35:01
2021-07-30T16:35:02.627440+02:00 Roamer-5437D4 Roamer Ping ok at 2021-07-30 16:35:02


Connected to Fritz (9s roam)


2021-07-30T16:35:02.688738+02:00 Roamer-5437D4 Roamer RSSI -> -57 at 2021-07-30 16:35:02
2021-07-30T16:35:02.786457+02:00 Roamer-5437D4 Roamer RSSI -> -50 at 2021-07-30 16:35:02


Roam to AX1


2021-07-30T16:35:29.617432+02:00 Roamer-5437D4 Roamer RSSI -> -78 at 2021-07-30 16:35:29
2021-07-30T16:35:39.552082+02:00 Roamer-5437D4 Roamer RSSI -> -90 at 2021-07-30 16:35:39

2021-07-30T16:36:24+02:00 ax1 hostapd: wlan2: AP-STA-DISCONNECTED 24:a1:60:54:37:d4
2021-07-30T16:36:24+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: authenticated
2021-07-30T16:36:24+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: associated (aid 11)
2021-07-30T16:36:24+02:00 ax1 hostapd: wlan2: AP-STA-CONNECTED 24:a1:60:54:37:d4
2021-07-30T16:36:24+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 RADIUS: starting accounting session D1E1F22639F23AA6
2021-07-30T16:36:24+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 WPA: pairwise key handshake completed (RSN)
2021-07-30T16:36:24+02:00 ax1 hostapd: wlan2: EAPOL-4WAY-HS-COMPLETED 24:a1:60:54:37:d4

2021-07-30T16:36:24+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:36:24+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4

2021-07-30T16:36:25+02:00 ax1 hostapd: wlan2: AP-STA-DISCONNECTED 24:a1:60:54:37:d4
2021-07-30T16:36:25+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: disassociated
2021-07-30T16:36:26+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)

2021-07-30T16:36:27+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: authenticated
2021-07-30T16:36:27+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 IEEE 802.11: associated (aid 11)
2021-07-30T16:36:27+02:00 ax1 hostapd: wlan2: AP-STA-CONNECTED 24:a1:60:54:37:d4
2021-07-30T16:36:27+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 RADIUS: starting accounting session C427CD10225EA4E0
2021-07-30T16:36:27+02:00 ax1 hostapd: wlan2: STA 24:a1:60:54:37:d4 WPA: pairwise key handshake completed (RSN)
2021-07-30T16:36:27+02:00 ax1 hostapd: wlan2: EAPOL-4WAY-HS-COMPLETED 24:a1:60:54:37:d4

2021-07-30T16:36:27+02:00 ax1 dnsmasq-dhcp[5312]: DHCPREQUEST(br-lan) 192.168.1.179 24:a1:60:54:37:d4
2021-07-30T16:36:27+02:00 ax1 dnsmasq-dhcp[5312]: DHCPACK(br-lan) 192.168.1.179 24:a1:60:54:37:d4 Roamer-5437D4

2021-07-30T16:38:53.003965+02:00 Roamer-5437D4 Roamer Echo disconnect at 2021-07-30 16:35:33
2021-07-30T16:38:53.011063+02:00 Roamer-5437D4 Roamer Echo ok at 2021-07-30 16:38:52
2021-07-30T16:38:53.039640+02:00 Roamer-5437D4 Roamer No ping at 2021-07-30 16:35:39
2021-07-30T16:38:53.046487+02:00 Roamer-5437D4 Roamer Ping ok at 2021-07-30 16:38:53


Connect to AX1 (149s roam)

ok testing without test details is pretty useless, so...

.config diff from default

CONFIG_TARGET_ipq807x=y
CONFIG_TARGET_ipq807x_generic=y
CONFIG_TARGET_ipq807x_generic_DEVICE_xiaomi_ax3600=y
CONFIG_DEVEL=y
CONFIG_BUILD_LOG=y
CONFIG_KERNEL_BUILD_DOMAIN="patch-mem-roam-1"
CONFIG_KERNEL_BUILD_USER="joba-1"
CONFIG_MACTELNET_PLAIN_SUPPORT=y
CONFIG_OPENVPN_wolfssl=y
CONFIG_OPENVPN_wolfssl_ENABLE_DEF_AUTH=y
CONFIG_OPENVPN_wolfssl_ENABLE_FRAGMENT=y
CONFIG_OPENVPN_wolfssl_ENABLE_LZ4=y
CONFIG_OPENVPN_wolfssl_ENABLE_MULTIHOME=y
CONFIG_OPENVPN_wolfssl_ENABLE_PF=y
CONFIG_OPENVPN_wolfssl_ENABLE_PORT_SHARE=y
CONFIG_OPENVPN_wolfssl_ENABLE_SMALL=y
# CONFIG_PACKAGE_ath10k-firmware-qca9887-ct is not set
CONFIG_PACKAGE_ath10k-firmware-qca9887-ct-full-htt=y
CONFIG_PACKAGE_atop=y
CONFIG_PACKAGE_bridge=y
CONFIG_PACKAGE_cJSON=y
CONFIG_PACKAGE_cgi-io=y
CONFIG_PACKAGE_dawn=y
CONFIG_PACKAGE_hostapd-utils=y
CONFIG_PACKAGE_htop=y
CONFIG_PACKAGE_ip-bridge=y
CONFIG_PACKAGE_iperf3=y
CONFIG_PACKAGE_iptables-mod-conntrack-extra=y
CONFIG_PACKAGE_iptables-mod-extra=y
CONFIG_PACKAGE_iptables-mod-ipopt=y
CONFIG_PACKAGE_iptables-mod-physdev=y
CONFIG_PACKAGE_irqbalance=y
# CONFIG_PACKAGE_kmod-ath10k-ct is not set
CONFIG_PACKAGE_kmod-ath10k-ct-smallbuffers=y
CONFIG_PACKAGE_kmod-br-netfilter=y
CONFIG_PACKAGE_kmod-ifb=y
CONFIG_PACKAGE_kmod-ipt-conntrack-extra=y
CONFIG_PACKAGE_kmod-ipt-extra=y
CONFIG_PACKAGE_kmod-ipt-ipopt=y
CONFIG_PACKAGE_kmod-ipt-physdev=y
CONFIG_PACKAGE_kmod-ipt-raw=y
CONFIG_PACKAGE_kmod-netatop=y
CONFIG_PACKAGE_kmod-netlink-diag=y
CONFIG_PACKAGE_kmod-qca-nss-drv=y
CONFIG_PACKAGE_kmod-qca-nss-ecm=y
CONFIG_PACKAGE_kmod-sched-connmark=y
CONFIG_PACKAGE_kmod-sched-core=y
CONFIG_PACKAGE_kmod-tun=y
CONFIG_PACKAGE_libcares=y
CONFIG_PACKAGE_libgcrypt=y
CONFIG_PACKAGE_libgpg-error=y
CONFIG_PACKAGE_libiwinfo-lua=y
CONFIG_PACKAGE_liblua=y
CONFIG_PACKAGE_liblucihttp=y
CONFIG_PACKAGE_liblucihttp-lua=y
CONFIG_PACKAGE_libmosquitto-nossl=y
CONFIG_PACKAGE_libncurses=y
CONFIG_PACKAGE_libpcap=y
CONFIG_PACKAGE_libpcre=y
CONFIG_PACKAGE_librt=y
CONFIG_PACKAGE_libstdcpp=y
CONFIG_PACKAGE_libtirpc=y
CONFIG_PACKAGE_libubus-lua=y
CONFIG_PACKAGE_libuci-lua=y
CONFIG_PACKAGE_logger=y
CONFIG_PACKAGE_lsof=y
CONFIG_PACKAGE_lua=y
CONFIG_PACKAGE_lua-bit32=y
CONFIG_PACKAGE_luabitop=y
CONFIG_PACKAGE_luasocket=y
CONFIG_PACKAGE_luci=y
CONFIG_PACKAGE_luci-app-dawn=y
CONFIG_PACKAGE_luci-app-firewall=y
CONFIG_PACKAGE_luci-app-openvpn=y
CONFIG_PACKAGE_luci-app-opkg=y
CONFIG_PACKAGE_luci-app-qos=y
CONFIG_PACKAGE_luci-base=y
CONFIG_PACKAGE_luci-compat=y
CONFIG_PACKAGE_luci-lib-base=y
CONFIG_PACKAGE_luci-lib-ip=y
CONFIG_PACKAGE_luci-lib-json=y
CONFIG_PACKAGE_luci-lib-jsonc=y
CONFIG_PACKAGE_luci-lib-nixio=y
CONFIG_PACKAGE_luci-mod-admin-full=y
CONFIG_PACKAGE_luci-mod-dashboard=y
CONFIG_PACKAGE_luci-mod-network=y
CONFIG_PACKAGE_luci-mod-status=y
CONFIG_PACKAGE_luci-mod-system=y
CONFIG_PACKAGE_luci-proto-ipv6=y
CONFIG_PACKAGE_luci-proto-ppp=y
CONFIG_PACKAGE_luci-ssl=y
CONFIG_PACKAGE_luci-theme-bootstrap=y
CONFIG_PACKAGE_luci-theme-openwrt-2020=y
CONFIG_PACKAGE_mac-telnet-client=y
CONFIG_PACKAGE_mac-telnet-discover=y
CONFIG_PACKAGE_mac-telnet-ping=y
CONFIG_PACKAGE_mac-telnet-server=y
CONFIG_PACKAGE_mii-tool=y
CONFIG_PACKAGE_mosquitto-client-nossl=y
CONFIG_PACKAGE_ncat=y
CONFIG_PACKAGE_netatop=y
CONFIG_PACKAGE_nmap=y
CONFIG_PACKAGE_nping=y
CONFIG_PACKAGE_openvpn-wolfssl=y
CONFIG_PACKAGE_prometheus-node-exporter-lua=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-dawn=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-hostapd_stations=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-hostapd_ubus_stations=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-nat_traffic=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-netstat=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-openwrt=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-snmp6=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-textfile=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-uci_dhcp_host=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-wifi=y
CONFIG_PACKAGE_prometheus-node-exporter-lua-wifi_stations=y
CONFIG_PACKAGE_prometheus-statsd-exporter=y
CONFIG_PACKAGE_px5g-wolfssl=y
CONFIG_PACKAGE_qos-scripts=y
CONFIG_PACKAGE_rpcd=y
CONFIG_PACKAGE_rpcd-mod-file=y
CONFIG_PACKAGE_rpcd-mod-iwinfo=y
CONFIG_PACKAGE_rpcd-mod-luci=y
CONFIG_PACKAGE_rpcd-mod-rrdns=y
CONFIG_PACKAGE_ss=y
CONFIG_PACKAGE_tc-mod-iptables=y
CONFIG_PACKAGE_tc-tiny=y
CONFIG_PACKAGE_tcpdump=y
CONFIG_PACKAGE_terminfo=y
CONFIG_PACKAGE_uhttpd=y
CONFIG_PACKAGE_uhttpd-mod-ubus=y
CONFIG_PACKAGE_umdns=y
# CONFIG_PACKAGE_wpad-basic-wolfssl is not set
CONFIG_PACKAGE_wpad-wolfssl=y
CONFIG_PACKAGE_zlib=y
CONFIG_WOLFSSL_HAS_OPENVPN=y

if you like, compare my fork with robimarkos to see the used patches (use less memory and failed attempts on faster roams):

And just something else I noticed. When booting, upon issuing "tftboot ....." (and before bootm) it wakes up the switch (I guess it would be hard to tftboot with no switch!), and at this point, it's no problem to roam between APs connected to the switch.

I can't begin to help debug here, but the fact that the switch works OK in dumb(?) mode in uboot does seem to isolate things to Linux/firmware/config on the AX3600 - rather than any other devices on the test setup.

Digging around, I see that the switch is implemented in the main IPQ8071A SoC NSS, while the device physically connected to the ethernet ports seems to be just a transceiver array without packet routing.

1 Like

Well that's interesting, I was expecting roaming to/from the Fritz to either be instant or have the delay. whereas you can roam fast to Fritz, but slow back which isn't what I've seen. Still no idea what's happened to my Fritz's PSU, the Fritzbox is exactly where it's been for the last 11 years :slight_smile:

Meanwhile I've tried to make a QSDK build to see if that has the issue, finally got the build to complete, but none of the images even created even start booting the kernel... I'm wondering if the NSS acceleration could be causing it, it's the sort of place things could easily go wrong. The hunt goes on!

I have thrown out TCP/IP problems out of the equation now.

For that I wrote a program that sends and receives raw ethernet frames (https://github.com/joba-1/Ethic) and let it run on a lan connected server (job4) and on a wlan roaming client (job2).

job4 -LAN- AX2 -WLAN- job2
        \- AX1 -/

Receiving packets from job4 on job2 near AX2 and walking to AX1...

44:8a:5b:5b:fe:de -> 00:26:c7:3c:e8:1c [46]: 'job4: Mon Aug  2 19:29:55 CEST 2021...........'
44:8a:5b:5b:fe:de -> 00:26:c7:3c:e8:1c [46]: 'job4: Mon Aug  2 19:29:56 CEST 2021...........'

Gap after roaming to AX1

44:8a:5b:5b:fe:de -> 00:26:c7:3c:e8:1c [46]: 'job4: Mon Aug  2 19:33:45 CEST 2021...........'
44:8a:5b:5b:fe:de -> 00:26:c7:3c:e8:1c [46]: 'job4: Mon Aug  2 19:33:46 CEST 2021...........'

now walking back to AX2...

44:8a:5b:5b:fe:de -> 00:26:c7:3c:e8:1c [46]: 'job4: Mon Aug  2 19:37:54 CEST 2021...........'
44:8a:5b:5b:fe:de -> 00:26:c7:3c:e8:1c [46]: 'job4: Mon Aug  2 19:37:55 CEST 2021...........'

Gap after roaming back to AX2

44:8a:5b:5b:fe:de -> 00:26:c7:3c:e8:1c [46]: 'job4: Mon Aug  2 19:39:26 CEST 2021...........'
44:8a:5b:5b:fe:de -> 00:26:c7:3c:e8:1c [46]: 'job4: Mon Aug  2 19:39:27 CEST 2021...........'

Gaps always from job4 to job2. Packages in the other direction go through immediately.

EDIT:

  • a broadcast frame from job4 during the gap time is received on job2 but does not end the gap for frames sent directly

I've built what I think is a QSDK 10.0 image from

and it seems to work fine. There's no wifi, but I can switch between the Asus & 703n constantly with just a few seconds downtime. I tried getting creative with that to enable ath11k but ended up with error after error and didn't think it was worth any more effort since I can test the switch issue without it.

Tried again to build QSDK 11.3 from the above instructions and got a bootable image this time, but the switch wasn't passing any traffic at all and whilst ifconfig shows traffic counters going up, tcpdump sees nothing. The bootlog showed qca-nss complaing about firmware and looking in /lib/firmware the files were dead links. I copied /lib/firmware from a recent robimarko image, rebuilt the image and booted again and the switch was fine.

Again I can switch between the Asus & 703n without issue, which I guess rules out it being an 11.3 firmware problem, so 'just' something in the source that needs tweaking.

1 Like

another observation that hopefully narrows down the code paths to check:

If I send frames both ways job4 -LAN- AX2 -WLAN- job2 and switch wlan of job2 from AX2s ath11 to ath10 (without using an IP), I get a delay of 3-4 min receiving frames of job4 on job2. Switching back to ath11 there is no delay.

Rebuilding the QSDK 11.3 image, excluding a curiously named module "qca_nss_bridge_mgr" results in the exact roaming issues we are seeing.

Rebuilding again just adding back in "qca_nss_bridge_mgr" has no issue roaming.

Inside a working build, rmmoding the module has no immediate effect, roaming still works fine. The first time I tried this, destroying and recreating br-lan resulted in broken roaming. I rebooted to try and see what was needed exactly after rmmod to break roaming, but now I can't break it by destroying & recreating br-lan.

Anyway, it seems like this module (or lack of) could well be the culprit.

5 Likes

Hm, that is a good hint.
I gotta see what that module does.

Ok, so that is a part of the NSS clients, the module handles FDB and other L2 offloading as well as using the PPE through the NSS FW thus providing bridge offloading.
But, its not easy to port at all.

https://source.codeaurora.org/quic/cc-qrdk/oss/lklm/nss-clients/tree/bridge/nss_bridge_mgr.c?h=NHSS.QSDK.11.4

4 Likes

Perhaps somewhat naievly I've built qca_nss_bridge_mgr as well as qca_nss_vlan which it seems to depend on.

I actually spent more time preparing the Makefile to get it to attempt to build the modules than fixing the one issue the modules had. Good learning experience :slight_smile:

It sort of works... in terms of 1 step forward 2 steps back.

The Pi and Macbook can roam freely between the Asus & 703n.

That's where the good news ends.

4 or 5 times in a row it panicked at pretty much the same uptime +- 5s. As I'm writing this it's somehow survived longer and panicked at 848s. Always at br_fdb_bridge_dev_get_and_hold, which I'll get to later.

[  496.607273] Mem abort info:
[  496.614055]   ESR = 0x96000004
[  496.616742]   EC = 0x25: DABT (current EL), IL = 32 bits
[  496.619904]   SET = 0, FnV = 0
[  496.625330]   EA = 0, S1PTW = 0
[  496.628229] Data abort info:
[  496.631231]   ISV = 0, ISS = 0x00000004
[  496.634349]   CM = 0, WnR = 0
[  496.637955] [0000ffffffc0101c] address between user and kernel address ranges
[  496.641053] Internal error: Oops: 96000004 [#1] SMP
[  496.648155] Modules linked in: iptable_nat ath11k_ahb ath11k ath10k_pci ath10k_core ath xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD nf_nat nf_flow_table nf_conntrack mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG ppp_async nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables hwmon crc_ccitt compat qca_nss_pppoe pppoe pppox ppp_generic slhc qca_nss_bridge_mgr qca_nss_vlan nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 qca_nss_drv qca_nss_dp qca_ssdk seqiv jitterentropy_rng drbg michael_mic hmac cmac leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom gpio_button_hotplug
[  496.698321] CPU: 0 PID: 280 Comm: kworker/0:2 Tainted: G        W         5.10.54 #0
[  496.720557] Hardware name: Xiaomi AX3600 (DT)
[  496.728374] Workqueue: events_long br_fdb_cleanup
[  496.732619] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[  496.737311] pc : br_fdb_bridge_dev_get_and_hold+0x4/0x38
[  496.743395] lr : nss_bridge_mgr_find_instance+0x100/0x1f0 [qca_nss_bridge_mgr]
[  496.748680] sp : ffffffc013063c70
[  496.755703] x29: ffffffc013063c70 x28: 0000000000000000 
[  496.759093] x27: ffffff8003424d50 x26: ffffffc01004c930 
[  496.764475] x25: ffffffc01238b9b0 x24: ffffff80034248c0 
[  496.769771] x23: ffffffc013063d8a x22: 0000000000000000 
[  496.775066] x21: 00000000000124f8 x20: 00000000fffffffe 
[  496.780361] x19: ffffffc013063d8a x18: 0000000000000000 
[  496.785655] x17: 0000000000000000 x16: 0000000000000000 
[  496.790951] x15: 0000000000000000 x14: 0000000000000040 
[  496.796247] x13: 0000000000000228 x12: 0000000000000000 
[  496.801541] x11: 0000000000000000 x10: 0000000000000000 
[  496.806837] x9 : 0000000000000000 x8 : ffffff801f6cde00 
[  496.812132] x7 : ffffffc01232ecd8 x6 : 0000000000000000 
[  496.817427] x5 : ffffffc013063b48 x4 : ffffffc0089fd8c8 
[  496.822722] x3 : 0000000000000000 x2 : ffffffc013063d8a 
[  496.828017] x1 : 0000000000000000 x0 : 4300ffffffc01004 
[  496.833313] Call trace:
[  496.838604]  br_fdb_bridge_dev_get_and_hold+0x4/0x38
[  496.840784]  atomic_notifier_call_chain+0x58/0x88
[  496.845986]  br_fdb_cleanup+0x1a4/0x1e8
[  496.850587]  process_one_work+0x200/0x3b0
[  496.854231]  worker_thread+0x54/0x4e8
[  496.858396]  kthread+0x124/0x128
[  496.862043]  ret_from_fork+0x10/0x30
[  496.865345] Code: 9400bab0 d4210000 17ffffd2 d503233f (f9400c01) 
[  496.868905] ---[ end trace 04421dadb8445493 ]---
[  496.874893] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  496.879587] SMP: stopping secondary CPUs
[  496.886265] Kernel Offset: disabled
[  496.890341] CPU features: 0x0040002,20002000
[  496.893553] Memory Limit: none
[  497.098090] Rebooting in 3 seconds..

Similarly connecting to the Ath11 network causes an instant panic at the same place.

Now to the build... All that's missing is the function br_fdb_bridge_dev_get_and_hold() which as noted above just happens to cause the panic. Curiously this function is nothing more than a wrapper

struct net_device *br_fdb_bridge_dev_get_and_hold(struct net_bridge *br)
{
	dev_hold(br->dev);
	return br->dev;
}
EXPORT_SYMBOL_GPL(br_fdb_bridge_dev_get_and_hold);

And is called just once. It might make more sense if there was an accompanying br_fdb_bridge_dev_put() which was a wrapper for dev_put() but alas, inside nss_bridge_mgr_fdb_update_callback() they call dev_put() directly. I'm sure QC must have big plans for br_fdb_bridge_dev_put() in the future!

That snippet lives in net/bridge/br_fdb.c

The the following definition goes in include/linux/if_bridge.h

extern struct net_device *br_fdb_bridge_dev_get_and_hold(struct net_bridge *br);

The qca-nss-clients Makefile:

include $(TOPDIR)/rules.mk

PKG_NAME:=qca-nss-clients
PKG_RELEASE:=$(AUTORELEASE)

PKG_SOURCE_URL:=https://source.codeaurora.org/quic/cc-qrdk/oss/lklm/nss-clients
PKG_SOURCE_PROTO:=git
PKG_SOURCE_DATE:=2021-04-29
PKG_SOURCE_VERSION:=b93c72c1b72c591c2ddc2f0b24f0e2b457720118
PKG_MIRROR_HASH:=9fab23da994bfbac9a3cef32cdfec31a87a03ed415f36bc926da32b7b0934259

include $(INCLUDE_DIR)/kernel.mk
include $(INCLUDE_DIR)/package.mk

define KernelPackage/qca-nss-drv-pppoe
  SECTION:=kernel
  CATEGORY:=Kernel modules
  SUBMENU:=Network Devices
  TITLE:=Kernel driver for NSS (connection manager) - PPPoE
  DEPENDS:=@TARGET_ipq807x +kmod-qca-nss-drv +kmod-ppp +kmod-pppoe
  FILES:=$(PKG_BUILD_DIR)/pppoe/qca-nss-pppoe.ko
  AUTOLOAD:=$(call AutoLoad,51,qca-nss-pppoe)
endef

define KernelPackage/qca-nss-drv-pppoe/Description
Kernel modules for NSS connection manager - Support for PPPoE
endef

define KernelPackage/qca-nss-drv-bridge-mgr
  SECTION:=kernel
  CATEGORY:=Kernel modules
  SUBMENU:=Network Devices
  TITLE:=Kernel driver for NSS bridge manager
  DEPENDS:=@TARGET_ipq807x +kmod-qca-nss-drv +kmod-qca-nss-drv-vlan-mgr
  FILES:=$(PKG_BUILD_DIR)/bridge/qca-nss-bridge-mgr.ko
  AUTOLOAD:=$(call AutoLoad,51,qca-nss-bridge-mgr)
endef

define KernelPackage/qca-nss-drv-bridge-mgr/Description
Kernel modules for NSS bridge manager
endef

define KernelPackage/qca-nss-drv-vlan-mgr
  SECTION:=kernel
  CATEGORY:=Kernel modules
  SUBMENU:=Network Devices
  TITLE:=Kernel driver for NSS vlan manager
  DEPENDS:=@TARGET_ipq807x +kmod-qca-nss-drv
  FILES:=$(PKG_BUILD_DIR)/vlan/qca-nss-vlan.ko
  AUTOLOAD:=$(call AutoLoad,51,qca-nss-vlan)
endef

define KernelPackage/qca-nss-drv-vlan-mgr/Description
Kernel modules for NSS vlan manager
endef

EXTRA_CFLAGS+= \
	-I$(STAGING_DIR)/usr/include/qca-nss-drv \
	-I$(STAGING_DIR)/usr/include/qca-nss-crypto \
	-I$(STAGING_DIR)/usr/include/qca-nss-cfi \
	-I$(STAGING_DIR)/usr/include/qca-nss-gmac \
	-I$(STAGING_DIR)/usr/include/qca-ssdk \
	-I$(STAGING_DIR)/usr/include/qca-ssdk/fal \
	-I$(STAGING_DIR)/usr/include/nat46

ifneq ($(CONFIG_PACKAGE_kmod-qca-nss-drv-pppoe),)
NSS_CLIENTS_MAKE_OPTS+=pppoe=y
endif

ifneq ($(CONFIG_PACKAGE_kmod-qca-nss-drv-bridge-mgr),)
NSS_CLIENTS_MAKE_OPTS+=bridge-mgr=y
#enable OVS bridge if ovsmgr is enabled
ifneq ($(CONFIG_PACKAGE_kmod-qca-ovsmgr),)
NSS_CLIENTS_MAKE_OPTS+= NSS_BRIDGE_MGR_OVS_ENABLE=y
EXTRA_CFLAGS+= -I$(STAGING_DIR)/usr/include/qca-ovsmgr
endif
endif

ifneq ($(CONFIG_PACKAGE_kmod-qca-nss-drv-vlan-mgr),)
NSS_CLIENTS_MAKE_OPTS+=vlan-mgr=y
endif

ifeq ($(CONFIG_TARGET_BOARD), "ipq807x")
    SOC="ipq807x_64"
else ifeq ($(CONFIG_TARGET_BOARD), "ipq60xx")
    SOC="ipq60xx_64"
endif

define Build/Compile
	$(MAKE) -C "$(LINUX_DIR)" $(strip $(NSS_CLIENTS_MAKE_OPTS)) \
		CROSS_COMPILE="$(TARGET_CROSS)" \
		ARCH="$(LINUX_KARCH)" \
		M="$(PKG_BUILD_DIR)" \
		EXTRA_CFLAGS="$(EXTRA_CFLAGS)" \
		SoC=$(SOC) \
		$(KERNEL_MAKE_FLAGS) \
		modules
endef


$(eval $(call KernelPackage,qca-nss-drv-pppoe))
$(eval $(call KernelPackage,qca-nss-drv-bridge-mgr))
$(eval $(call KernelPackage,qca-nss-drv-vlan-mgr))

1.5 steps forward... Think I've fixed the crash, roaming between the external ports works but still roaming to the Ath11 interface takes a long time. Need to double check but I'm sure previously roaming between the 703n & Ath11 worked fine, but now it also has a delay. 703n to Asus and back is fine.

A diff of br_fdb.c from vanilla 4.4.60 and 4.4.60 QSDK 11.3 has amongst other things

@@ -308,10 +337,16 @@ void br_fdb_cleanup(unsigned long _data)
 			if (f->added_by_external_learn)
 				continue;
 			this_timer = f->updated + delay;
-			if (time_before_eq(this_timer, jiffies))
+			if (time_before_eq(this_timer, jiffies)) {
+				memset(&fdb_event, 0, sizeof(fdb_event));
+				ether_addr_copy(fdb_event.addr, f->addr.addr);
 				fdb_delete(br, f);
-			else if (time_before(this_timer, next_timer))
+				atomic_notifier_call_chain(
+					&br_fdb_update_notifier_list, 0,
+					(void *)&fdb_event);
+			} else if (time_before(this_timer, next_timer)) {
 				next_timer = this_timer;
+			}
 		}
 	}
 	spin_unlock(&br->hash_lock);

so the final parameter of atomic_notifier_call_chain() is an empty struct br_fdb_event with just fdb_event.addr populated, and this is what the callback function nss_bridge_mgr_fdb_update_callback() in nss_bridge_mgr is expecting.

Whilst in robimarko's build, it's subtly different

                               ether_addr_copy(mac_addr, f->key.addr.addr);
                                fdb_delete(br, f, true);
                                /* QCA NSS ECM support - Start */
                                atomic_notifier_call_chain(
                                        &br_fdb_update_notifier_list, 0,
                                        (void *)mac_addr);
                                /* QCA NSS ECM support - End */

Here atomic_notifier_call_chain() is supplied with the address directly, not within an otherwise empty struct br_fdb_event. Is there a reason for this change? I've modified nss_bridge_mgr_fdb_update_callback() to work with the address and put that back in a struct for the rest of the function.

static int nss_bridge_mgr_fdb_update_callback(struct notifier_block *notifier,
                                              unsigned long val, void *ctx)
{
        struct br_fdb_event _event;
        struct br_fdb_event *event = &_event;
        struct nss_bridge_pvt *b_pvt = NULL;
        struct net_device *br_dev = NULL;
        fal_fdb_entry_t entry;

        memset(&_event, 0, sizeof(_event));
        ether_addr_copy(_event.addr, ctx);

Uptime 45 minutes and no crashes connecting to the Ath11 so I think this is a small improvement at least now.

3 Likes

Inspired by avalentin's idea on the main AX3600 thread I've taken a peek at ssdk-shell. Setting the fdb ageTime to 30s (from 150s) reduces the roaming delay on Ath11 to... 30s ish. Setting to 300s took 490s!

Monitoring the fdb inside ssdk-shell shows the fdb it's showing isn't updated when roaming during the dead period, and in fact when the device gains connectivity, it's not even listed in that fdb, it just disappears, which seems a bit odd.

Ahh, that would suggest that the kernel has no control over the FDB in this driver setup as the default aging time should be 30s.
I gotta say once again that I hate this setup without a proper switch and ethernet driver but rather this mess with fake netdevs.

D'oh!

While I was investigating the crash yesteday, I neutered the br_fdb_bridge_dev_get_and_hold() function to just return 0... and of course forgot to put it back. Seems with that put back, Ath11 roaming works fine too.

1 Like

Roaming issue is kind of solved with a workaround by @avalentin

See over there Adding OpenWrt support for Xiaomi AX3600 for details.

Of course still interested in clean solution...