IPQ807x Wifi Roaming with VLANs

jim8192 · August 9, 2023, 2:57am

I know the IPQ807x wifi roaming has been through a bunch of iterations, syslog watcher, QCA code and finally @robimarko patch to nss-dp.

However I am having issues roaming when the wlan interfaces are bridged to a vlan and resorted to the syslog watcher script that does a ssdk_sh flush all when a STA joins the SSID.

I think a better solution is for the nss-dp patch to process non physical interfaces and find the real physical interface(s) and perform a fbd_del. I have a working proof of concept.

Can any one else verify there is an issue with wifi roaming when SSIDs are bridged to a vlan on a IPQ807x device using 23.05

Thanks

robimarko · August 9, 2023, 8:54am

I agree with you on this, currently, we are triggering on the switchdev event but then checking if the underlying device is physical.

scr4tchy · August 11, 2023, 6:48am

Not sure if related, I am doing Roaming / VLANs / Batman-adv on AX9000, and continuously having 1-3min periods of full timeouts to LAN/WAN every 1-3min, it's a real nightmare, and at this point I am lost as to what the issue is. Not to mention that DHCP does not work most of the time, and most of my devices are now hardcoded.

I work in infra, so I am not totally dumb when it comes to networking, I have set up Juniper, Cisco, Arubas, all kinds of tunnels to carry market data across continents in hybrid wireless / fiber.. And yet I spent so many hours on tweaking my AX9000 home mesh to try and solve, to no avail - it sometimes works and sometimes does not. At this point I am almost ready to shop for another tri/quad band mesh system.. although not that many of the prosumer stuff can actually reach nice speeds, support vlans and present the overall flexibility of OpenWRT..!

Catfriend1 · August 11, 2023, 7:34am

Also on ax3600

Flole · August 11, 2023, 2:11pm

This sounds a little like the FDB issue that is also present on IPQ40xx. If you can try disabling learning on the switch completely to see if that helps you. On "dumb APs" with a single ethernet port in use (like on many multi-AP-scenarios) there is usually no need for learning/a switch anyways.

robimarko · August 11, 2023, 2:44pm

On this one its due to a lack of a proper switch driver so kernel is not aware of FDB duplicates at all and we are using switchdev events to try and fix that.

Flole · August 11, 2023, 2:52pm

Ah, understood. For people who need stable roaming and only use a single ethernet port it's probably best to disable learning then somehow. There is a bit in a register for that, which should be possible to set to 0 using mdio-tools I believe. Maybe not a "solution", but for people who want to use IPQ807x devices as dumb APs (like me, I just still haven't done the work to get the Orbi port done) that's certainly an option.

Actually for devices that only use a single cpu port, isn't is possible to just enable "normal" learning instead of assisted learning? I believe assisted learning was only needed for devices with multiple links to the switch?

robimarko · August 11, 2023, 5:48pm

Normal learning is just HW learning, and that does not work well with DSA cases, especially for roaming hence why assisted learning is used on the CPU port.

But that has nothing to do with IPQ807x at all, here its the fault of stupid drivers where they are faking switch ports as individual netdevs and they are not exposing the real FDB so it gets out of sync.
QCA "fixed" it by adding more kernel modifications to spawn a notification once existing device moves interface and then they remove the existing FDB entry, same what we are doing via switchdev notifiers.

Flole · August 11, 2023, 6:04pm

I don't fully understand why normal HW learning isn't working here: The client moves from LAN to WiFi, so a packet from the CPU port reaches the switch and the switch updates it's FDB. At least that's how it should be.

I'm aware that it didn't work well when there are multiple connections between CPU and switch (link aggregation/load balancing) as the switch was constantly updating the FDB between the 2 ports, but that's only an issue on very few devices.

What you're describing seems similar to how ipq40xx started, they also did those fake netdevs there iirc.

robimarko · August 11, 2023, 7:02pm

Well, that is the thing, how is the switch supposed to update the FDB if you moved the client from a switch port to a virtual interface?
As far as its aware, nothing happened and the FDB entry will just expire eventually.

Again, the switch isn't modeled properly at all and the kernel doesn't even see the HW FDB table.

For the ipq40xx, you can find the lengthy discussion about why assisted learning is used in the ipq40xx DSA PR, the same is used for MT7530 and plenty of other DSA drivers.
The issue you are having isn't supposed to happen as kernel is supposed to remove the FDB entry, so I highly suspect buggy DSA driver, like I said its quite old and I have much newer and cleaner ethernet driver, new tagger and updated DSA driver.
I just did not find the time to move ipq40xx to 6.1 and use the updated drivers, I should probably work on it during the extended weekend due to a holiday.

Flole · August 11, 2023, 7:24pm

Ah, I thought the CPU port on the (IPQ40xx and IPQ807x integrated) switch is simply a "normal" port, just with a "hardwired" connection to the CPU.

But what would help here aswell temporarily is disabling learning completely, at least for devices which use a single port. The switch would just broadcast the packet to all ports, which isn't an issue if there's only a single port and the CPU in use. Someone who is affected and wants a workaround can install mdio-tools and then just turn off learning, right? @scr4tchy @jim8192 maybe you guys can try that and see if that helps? I know it's not a proper fix, but buying new equipment is worse...

robimarko · August 11, 2023, 8:11pm

Well, it is and it isn't a normal port.
You cannot use mdio-tools as switches inside both of these are MMIO.

Flole · August 11, 2023, 10:35pm

Understood, so in order to disable learning on all ports I would have to write a kernel module that maps the memory and then clears those bits. Or just modify the source code instead.

I read in the datasheet that it is possible to configure the switch using specially crafted ethernet frames with the "atheros-header", is that an option here or is that not available for this internal switch? Could a small c-program send such a frame to the switch to write to a register? Or would the kernel always append an ethernet header? In that case something like pcap_inject could probably still be used. I believe this would at least provide a workaround on all ipq* devices, it just needs to write to different registers based on the CPU/Switch used.

Thanks for the pointer in the (hopefully) right direction, it's always great to talk to you.

robimarko · August 12, 2023, 10:10am

It is way easier to just modify the source, learning is being set up there anyway.

Atheros header is the DSA tag being used, IPQ40xx should support the same thing as the generic QCA8337N where you can send specially crafted packets to read/write registers but its really not useful as the switch is MMIO mapped anyway, and its faster to just access the register directly.
It would just overcomplicate things, and IPQ807x uses a completely new switch for which there are 0 docs anyway.

Thing is that we shouldn't have issues with learning as CPU port learning is disabled as part of the setup in order to use assisted learning intentionally, for the same reason as in MT7530 and rest:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/net/dsa/mt7530.c?h=next-20230809&id=0b69c54c74bcb60e834013ccaf596caf05156a8e

github.com

openwrt/openwrt/blob/main/target/linux/ipq40xx/files/drivers/net/dsa/qca/qca8k-ipq4019.c#L478C5-L478C5


      
          		ret = qca8k_reg_clear(priv, QCA8K_PORT_LOOKUP_CTRL(QCA8K_CPU_PORT),

Flole · August 12, 2023, 10:50am

On IPQ807x most likely, on IPQ40xx there's still the issue with images ending up randomly broken/unbootable, so every change to the kernel is dangerous there. Especially since there's no way to know if an image is broken or not.

I assume the docs exist and simply aren't public?

The learning on all other ports is enabled though, so if a device roams the switch still thinks it's on the upstream-port, but it is now on the CPU port. Entirely disabling learning would mean, that the switch is always flooding the packets, which isn't an issue on devices that only use a single Ethernet port.

robimarko · August 12, 2023, 2:08pm

No, there is no issue when modifying the kernel, whoever had that did something rather wrong as the worst-case scenario your networking would not work and that's it.

Docs exist but only QCA has access to them.

You can disable the learning for other ports by clearing instead of setting the register per port few lines down from the CPU one, its easy.

github.com

openwrt/openwrt/blob/main/target/linux/ipq40xx/files/drivers/net/dsa/qca/qca8k-ipq4019.c#L495


      
          	if (dsa_is_user_port(ds, port)) {
          		int shift = 16 * (port % 2);
          
          		ret = qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(port),
          				QCA8K_PORT_LOOKUP_MEMBER,
          				BIT(QCA8K_CPU_PORT));
          		if (ret)
          			return ret;
          
          		/* Enable ARP Auto-learning by default */
          		ret = qca8k_reg_set(priv, QCA8K_PORT_LOOKUP_CTRL(port),
          				    QCA8K_PORT_LOOKUP_LEARN);
          		if (ret)
          			return ret;
          
          		/* For port based vlans to work we need to set the
          		 * default egress vid
          		 */
          		ret = qca8k_rmw(priv, QCA8K_EGRESS_VLAN(port),
          				0xfff << shift,
          				QCA8K_PORT_VID_DEF << shift);

But, disabling learning is not the solution you are just making a hub

Flole · August 12, 2023, 2:43pm

There is, ipq40xx builds are randomly broken and just stuck at u-boot's "booting the kernel". See https://github.com/openwrt/openwrt/pull/12953 for more information and some guessing what the cause might be. It was first discovered on MR33 but the issue appears to be happening on other units aswell, just not that frequently.

I know, but it's not easy if the image is not bootable and I end up bricking devices again. Once that issue is identified and solved things will become easier on ipq40xx and at least rebuilding it would work.

I am currently trying to see if I can figure out how to disable learning on ipq807x aswell.

I know, but on WiFi Access Points which only use a single ethernet port that isn't an issue at all. My Meraki MR33 only has a single ethernet port, and the RBR50 I have only has a single ethernet port connected to my network.

robimarko · August 12, 2023, 2:54pm

I am not seeing any issues on my boards, it's likely something specific to MR33 and MR74 boards.

Flole · August 12, 2023, 3:07pm

Someone in that pull-request reported similar for the Zyxel WRE6606 (could be just a size issue), and I believe (but I'm not sure) I've seen it on the Netgear RBR50 aswell. That unit is easy to recover, so I'll give it another try on that one. Let's see how disabling learning changes things.

robimarko · August 12, 2023, 4:42pm

Most of the devices are probably hitting the kernel size limit.