IPQ807x Wifi Roaming with VLANs

Not the MR33, that's for sure.

Disabling learning and turning the switch into a hub did indeed fix the issue. I added additional debugging output as I would like to figure out why entries from the switches FDB can't be removed, so when an entry is added or removed that is now logged in dmesg.

The results are interesting:

[ 1240.964141] qca8k-ipq4019 c000000.switch wan: vlans aren't supported yet for dev_uc|mc_add()
[ 1249.717324] qca8k-ipq4019 c000000.switch wan: vlans aren't supported yet for dev_uc|mc_add()
[ 1353.034703] qca8k-ipq4019 c000000.switch wan: vlans aren't supported yet for dev_uc|mc_add()
[ 1469.754563] qca8k-ipq4019 c000000.switch wan: vlans aren't supported yet for dev_uc|mc_add()

after doing something like this:

# bridge fdb add 11:22:33:44:55:66 dev wan self vlan 1
RTNETLINK answers: Invalid argument

Doing

# bridge fdb add 11:22:33:44:55:66 dev wan self

doesn't log anything, but I believe it should be added to the switch's FDB as I specified the self option? At this point I have seen my debug outputs during boot, but never afterwards, not even when roaming and the assisted learning should be doing it's thing....

The error you are seeing is from the kernel FDB add netlink handler itself.

BTW, I checked and the port 0 learn enable bit is cleared so it doesn't do stuff on its own.

@Ansuel any thoughts?

Also, we should move any IPQ40xx discussion to the already existing thread:

FYI here is the patch to enable macs seen on non-physical interfaces to be removed from the fdb.
This is on top of Robimako's patch
Appears to work ok with my limited testing .

`--- a/nss_dp_switchdev.c	2023-08-09 16:21:57.523385184 +1200
+++ b/nss_dp_switchdev.c	2023-08-09 16:25:50.215009916 +1200
@@ -380,23 +380,37 @@
 	return notifier_from_errno(rv);
 }
 
+static void process_kids(struct net_device *dev, void *ptr) 
+{
+	struct net_device *lower;
+	struct list_head *iter;
+
+	netdev_for_each_lower_dev(dev, lower, iter) {
+		if ( lower != NULL ) {
+			netdev_dbg(lower, "  lower physical %d \n",nss_dp_is_phy_dev(lower));
+			if ( nss_dp_is_phy_dev(lower) ) {
+				 nss_dp_switchdev_fdb_del_event(lower, ptr);
+			} else {
+				process_kids(lower,ptr);	
+			}
+		}
+	}
+}
+
 static int nss_dp_fdb_switchdev_event(struct notifier_block *nb,
 				      unsigned long event, void *ptr)
 {
 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
 
-	/*
-	 * Handle switchdev event only for physical devices
-	 */
-	if (!nss_dp_is_phy_dev(dev)) {
-		return NOTIFY_DONE;
-	}
-
-	switch (event) {
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		return nss_dp_switchdev_fdb_del_event(dev, ptr);
+	if ( event == SWITCHDEV_FDB_DEL_TO_DEVICE ) {
+		if (!nss_dp_is_phy_dev(dev)) {
+		        netdev_dbg(dev, "FDB event %ld Not Physical \n", event);
+			process_kids(dev,ptr);
+			return NOTIFY_DONE; 	
+		} else {
+			return nss_dp_switchdev_fdb_del_event(dev, ptr);
+		}
 	}
-
 	return NOTIFY_DONE;
 }
`
2 Likes

would this issue likely still exist today for even ipq60xx targets? i am seeing wifi roam over vlan to ip60xx device disconnects 1-3mins

Yes, there’s a very long issue on github and this problem also exists on ipq40xx. I suspect its something somewhere else in the kernel which affects all devices and some devices are starting to add workarounds to their driver code, which will cause problems some day….

This patch still works btw after adapting it to the latest codebase. Unfortunately it's still needed today apparently. Maybe it's necessary to add the netdev_for_each_lower_dev to a higher level as it affects all CPUs apparently, at least ipq4xxx and the Qualcomm AX CPUs?

Would I be able to use the mentioned patch in ipq6018 build? Any direction you could share I’d like to test it out with my device

This is the rebased patch:

$ cat package/kernel/qca-nss-dp/patches/999-fix-roaming.patch
Index: qca-nss-dp-2025.05.12~07b87bf5/nss_dp_switchdev.c
===================================================================
--- qca-nss-dp-2025.05.12~07b87bf5.orig/nss_dp_switchdev.c
+++ qca-nss-dp-2025.05.12~07b87bf5/nss_dp_switchdev.c
@@ -576,6 +576,23 @@ static int nss_dp_switchdev_fdb_del_even
        return notifier_from_errno(rv);
 }

+static void process_kids(struct net_device *dev, void *ptr)
+{
+       struct net_device *lower;
+       struct list_head *iter;
+
+       netdev_for_each_lower_dev(dev, lower, iter) {
+               if ( lower != NULL ) {
+                       netdev_dbg(lower, "  lower physical %d \n",nss_dp_is_phy_dev(lower));
+                       if ( nss_dp_is_phy_dev(lower) ) {
+                                nss_dp_switchdev_fdb_del_event(lower, ptr);
+                       } else {
+                               process_kids(lower,ptr);
+                       }
+               }
+       }
+}
+
 /*
  * nss_dp_switchdev_event_nb
  *
@@ -587,19 +604,15 @@ static int nss_dp_switchdev_event_nb(str
 {
        struct net_device *dev = switchdev_notifier_info_to_dev(ptr);

-       /*
-        * Handle switchdev event only for physical devices
-        */
-       if (!nss_dp_is_phy_dev(dev)) {
-               return NOTIFY_DONE;
-       }
-
-       switch (event) {
-       case SWITCHDEV_FDB_DEL_TO_DEVICE:
-               return nss_dp_switchdev_fdb_del_event(dev, ptr);
-       default:
-               netdev_dbg(dev, "Switchdev event %lu is not supported\n", event);
-       }
+       if ( event == SWITCHDEV_FDB_DEL_TO_DEVICE ) {
+               if (!nss_dp_is_phy_dev(dev)) {
+                       netdev_dbg(dev, "FDB event %ld Not Physical \n", event);
+                       process_kids(dev,ptr);
+                       return NOTIFY_DONE;
+               } else {
+                       return nss_dp_switchdev_fdb_del_event(dev, ptr);
+               }
+       }

        return NOTIFY_DONE;
 }

1 Like

Thanks ! I did need to update the patch to make it work with the latest repo slightly:

--- a/nss_dp_switchdev.c
+++ b/nss_dp_switchdev.c
@@ -576,6 +576,23 @@ static int nss_dp_switchdev_fdb_del_even
 	return notifier_from_errno(rv);
 }
 
+static void process_kids(struct net_device *dev, void *ptr)
+{
+	struct net_device *lower;
+	struct list_head *iter;
+
+	netdev_for_each_lower_dev(dev, lower, iter) {
+		if (lower != NULL) {
+			netdev_dbg(lower, "  lower physical %d \n", nss_dp_is_phy_dev(lower));
+			if (nss_dp_is_phy_dev(lower)) {
+				nss_dp_switchdev_fdb_del_event(lower, ptr);
+			} else {
+				process_kids(lower, ptr);
+			}
+		}
+	}
+}
+
 /*
  * nss_dp_switchdev_event_nb
  *
@@ -587,18 +604,14 @@ static int nss_dp_switchdev_event_nb(str
 {
 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
 
-	/*
-	 * Handle switchdev event only for physical devices
-	 */
-	if (!nss_dp_is_phy_dev(dev)) {
-		return NOTIFY_DONE;
-	}
-
-	switch (event) {
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		return nss_dp_switchdev_fdb_del_event(dev, ptr);
-	default:
-		netdev_dbg(dev, "Switchdev event %lu is not supported\n", event);
+	if (event == SWITCHDEV_FDB_DEL_TO_DEVICE) {
+		if (!nss_dp_is_phy_dev(dev)) {
+			netdev_dbg(dev, "FDB event %ld Not Physical \n", event);
+			process_kids(dev, ptr);
+			return NOTIFY_DONE;
+		} else {
+			return nss_dp_switchdev_fdb_del_event(dev, ptr);
+		}
 	}
 
 	return NOTIFY_DONE;

But i can also confirm that the patch is working great for my ipq6018 device as well, iperf3 roams are now working as expected. So happy to have this device working now as intended!

well it appears this patch is now no longer working (built from source on aug 15 works, new build with patch on sept 1st has the old roaming break behavior). it would be interesting if someone else could confirm this as well (to rule out i made a mistake somewhere)

I just did a build yesterday and I think for me it is still working.

Are you able to share the rebased patch used in this build? I’ve been trying to hack it with llms without success :frowning:

I haven't changed anything, just used the patch I posted above.

FWIW, https://github.com/openwrt/qca-nss-dp/pull/2 has been merged and the latest qca-nss-dp driver updated in the 25.12 branch, so I presume main as well. My MX-4300v2 now works great in a VLAN’ed dumb AP config (VLAN 1 merged with “primary” Wifi, VLAN 3 with IoT Wifi, VLAN 5 with guest Wifi) roaming to couple other ipq806x/generic routers in the same SSID (running on Meraki MR-42 and Meraki MR-52’s running both 24.10-SNAPSHOT and now 25.12-SNAPSHOT).

Thanks, all!

hmm i also tried 25.12 branch on my eap620hd-v2 ipq60xx and it looks like i am still having roaming tagged vlan timeouts on it.

Oddly, in some past snapshots i had this resolved at some point. Is this fix only targeting ipq808x devices and ipq60xx devices are ignored for it?

@rspierz the fix is the same one discussed upthread here, and is enshrined in the kmod-qca-nss-dp-6.12.63.2025.11.24~19c51af0-r1 package (note the 2025-11-24 date). If ipq60xx doesn’t use this driver, you might need to push it to the appropriate switch driver used by ipq60xx.

To identity if you are using that driver, apk info kmod-qca-nss-dp is a starting place (can’t have the fix without the package ;)), then lsmod | grep nss or inspection of dmesg are next steps.

EDIT: Also, I’m discussing stock OpenWRT, not the NSS builds… for those, I have ~ 0 idea of whether this fix works or not… it would also need to be in whoever’s custom NSS tree.

yeah my build contains the same package version of qca-nss-dp

kmod-qca-nss-dp-6.12.63.2025.11.24~19c51af0-r1.

lsmod | grep nss
qca_nss_dp             45056  0
qca_ssdk             1142784  1 qca_nss_dp

Then I guess the next questions are:

  • What’s your networking and wireless config look like? (please remove any passphrases)
  • what you’re trying to do,
  • and how is it failing?

Sorry if this has been asked before.