Ipq40xx / dsa: switch issues

@Ansuel I just came back to this and had a quick look at it again and unfortunately it is still broken and there still aren't any clues on how to debug this/follow the code and how this is actually supposed to work. Additionally the random unbootable ipq40xx images mean it's not even possible to simply compile in more debug messages as chances are high that there will be a broken image that simply bricks the device as a result. Is there any chance you can help with this issue?

1 Like

I just did additional research on this and it seems like there is an entry in my case that simply can't be deleted. I associated a client to my WiFi and now I am using a host on the LAN to broadcast spoofed UDP packets into my network. Once such a UDP packet with the source MAC of the WiFi client reaches the AP there are 2 entries created and the client looses connectivity immediately:

11:22:33:44:55:66 dev wan vlan 1 master br-lan
11:22:33:44:55:66 dev wan vlan 1 self
11:22:33:44:55:66 dev phy1-ap0 master lan 

I set the ageing of br-lan to 10 seconds, so after 10 seconds that entry disappears (or I can also manually delete it):

11:22:33:44:55:66 dev wan vlan 1 self
11:22:33:44:55:66 dev phy1-ap0 master lan 

however, that first entry prevents packets from reaching my WiFi client, and I can not delete it manually:

bridge fdb del 11:22:33:44:55:66 dev wan vlan 1 self
RTNETLINK answers: No such file or directory

So all I can do is wait until that entry is gone, and then everything works again. Unfortunately that entire qca8k driver doesn't contain any debug output, so it's not really possible to debug it to see if it is at least trying to delete the entry there.

It's ridiculous that a single spoofed packet can cause a denial-of-service for 5 minutes or so, but that's another story I suppose.

@Flole Good stuff!

I was about to write a test like that myself.
It looks like the broadcast packets sent by the client after roaming (ARP and DHCP requests) reaching AP#1 from AP#2 do exactly that.

I assume the entry that can't be deleted lies on the switch and the one that can comes from the Linux bridge FDB directly - not sure, only just starting to look at the code.

Well, it's hard to protect against spoofing while simultaneously allowing people to roam freely.
Even if the AP can tell the switch to update on association of a wifi client, now there are Apple devices that are associated with 2-3 access points at the same time...

Yes exactly, that entry appears to be part of the FDB within the switch. I would assume that you could add debug output in qca8k to get more insights there.

On normal switches the last packet received basically defines where the client is. Here a single packet defines it for 5 minutes, and there's no option to tell the switch that it changed in the meantime.

These lines from qca8k_fdb_search_and_insert have caught my attention - but I'm basically still poking around blindly.

  qca8k_fdb_write(priv, vid, 0, mac, 0);
  ret = qca8k_fdb_access(priv, QCA8K_FDB_SEARCH, -1);
  if (ret < 0)
    goto exit;

  ret = qca8k_fdb_read(priv, &fdb);
  if (ret < 0)
    goto exit;

  /* Rule exist. Delete first */
  if (!fdb.aging) {
    ret = qca8k_fdb_access(priv, QCA8K_FDB_PURGE, -1);
    if (ret)
      goto exit;
  }

  /* Add port to fdb portmask */
  fdb.port_mask |= port_mask;

  qca8k_fdb_write(priv, vid, fdb.port_mask, mac, fdb.aging);

But I'm beginning to wonder if this is the wrong end of the issue and how this related to qca8k-ipq4019.
What I hope to be seeing is the software-side of things (BRCTL_GET_FDB_ENTRIES//sys/devices/virtual/net/$bridge/brforward) holding the correct entry and the switch (RTM_GETNEIGH with PF_BRIDGE/bridge fdb show/ip -f bridge n, to read the switch's ALR) being faulty - assuming I'm interpreting the erroneous forwarding behaviour right.

I still haven't managed to get the FDB entries that you are seeing (only similar behaviour).
My test bridges are set up using tagging interfaces like lan.16 and wan.16 rather than creating and configuring one VLAN-aware bridge. That way, it should be possible to rule out issues related to VLAN awareness in the Linux bridge code.

What I would suggest is dump every FDB insert/remove operation with all parameters before it's sent to the switch. Then you can have a look and see what's happening when you manually try to kick out the entry from the switch FDB, which doesn't even work.

Or what would also work is implementing "hub-mode" by disabling learning completely. Or figure out why on earth reducing the ageing time of the bridge doesn't affect learning/ageing expiration on the switch at all. Something seems to be wrong there aswell.

Wait, you are saying that manual FDB removal is not working?

It is probably time for me to update the DSA and ethernet drivers to the latest version I have, this one is quite old and is lacking features that generic qca8k has gained.
FDB code was reworked quite a bit later.

Either that or I am not doing it correctly. See this post above for what I tried: Ipq40xx / dsa: switch issues - #42 by Flole

BTW, setting ageing time doesnt work as the base driver is too old to have the age time setting and fast age functions.

One more reason to finally move to 6.1 and use the newer driver.

Do you expect these issues to be gone then?

What would be the expected result right now when I roam/connect to the AP? Assisted learning should insert the MAC into the FDB, right? That is not what's happening, there is no add-event on the switch.

Well, it's my hope that it somehow fixes it.

You can try changing netdev_dbg to netdev_info here:

It would show all of the DSA FDB changes

I've set up vlans on a fritz 7530 and an asus rt-ac58u
maybe the issue is the ids you are using ? (1 and 2) ?

I set up mine to have a "dsa-switch" underneath "br-lan"
Example (with the ids changed)

config device
        option type 'bridge'
        option name 'dsa-switch'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'lan4'

config device
        option type '8021q'
        option ifname 'dsa-switch'
        option vid '1'
        option name 'dsa-switch.1'

config bridge-vlan
        option device 'dsa-switch'
        option vlan '1'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3:t'
        list ports 'lan4:t'

config bridge-vlan
        option device 'dsa-switch'
        option vlan '2'
        option local '0'
        list ports 'lan3:t'
        list ports 'lan4:t'

config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'dsa-switch.1'

config interface 'lan'
        option device 'br-lan'
        option proto 'static'
        option ipaddr '192.168.1.1'
        option netmask '255.255.255.0'
        option ip6assign '60'
        option defaultroute '0'

the strange part seems to be this device might be the only one this approach works on, I can't say for sure but i've tried previously to do this type of setup (ie the most logical approach as far as i'm concerned) on an mt7621 based device and had to set up the vlans off the individual port interfaces instead

I assume if I wanted to do something like connect the wireless to vlan 2 i'd need to remove the local 0 setting and make a vlan interface dsa-switch.2 and add it to a bridge and then that to an interface.

I added a log message to qca8k_fdb_add/del aswell as other qca8k functions that access the FDB entries. I also changed netdev_dbg to netdev_info, but that netdev_info is apparently never called. When I dump the FDB I can see that there is communication with the switch:

[ 4772.435352] Accessing FDB command 6...
[ 4772.439118] Writing FDB...

but nothing when roaming. Does assisted learning only do something when the client is already on another switch port? I would have expected, that assisted learning created a new FDB entry on the CPU port, but that is not the case.

I will just wait until your migration to the new driver is ready for testing and be happy about my hub for now.

have you tried flipping the promiscuous setting in the device setting for the bridge ? sometimes this weirdly makes bridging work sometimes for me with wireless wheras it wasn't before (with some devices)

e.g

@Flole https://github.com/openwrt/openwrt/pull/13278

Backporting these was easy until finally getting 6.1 working.

1 Like

@Flole Any chance you can give the 6.1 kernel a go?

1 Like

Ok firstly, I can confirm vlan tagging works for me with the 5.15 kernel and driver
I'm typing this currently via a fritzbox 7530 that has two ports set up with two tagging and they both are working fine, i'm not using any wireless though

Secondly I tried your 6.1 branch on a fritzbox 7530, and it boots up ok, but it seems like certain network traffic to/from the switch ports causes spontaneous reboots so I couldn't really troubleshoot. I plan to test tomorrow with an asus rt-ac58u and see if I get the same issue.

1 Like

Can you set this to false and then try again:

Its most likely that threaded NAPI is causing it.

I won't be able to test before the weekend most likely, but seems like there are already a few brave people finding issues, so I guess you are already getting some feedback.