Ipq40xx / dsa: switch issues

That is intentional as aging time of 0 makes no sense, that is basically saying you want to disable it.
Also, the IPQ40xx DSA code is a bit outdated at this point but I am waiting for 6.1 to completely replace it with the updated code so there is no aging OP in IPQ40xx currently.

Why does it not make sense? It is perfectly valid to disable aging, the "normal" bridges always offered this feature. I agree that due to the division by 7 it is necessary to avoid unintentional disabling, but intentional disabling by explicitly setting it to 0 should still be possible.

I dont think that 0 is even a valid value in this case, however, there is an AGE_EN bit in the same register at BIT(19) that should disable aging altogether.

In that case the function should handle 0 to set/clear that bit IMO. Just to confirm: We assume that clearing that AGE_EN bit should turn off the ageing and fix the issues, right? So if I add a quick and dirty patch to set/clear that bit based on msecs being 0 or != 0 everything should be alright again?

You are going to need to test it, disabling aging is at best just going to hide the issue.

Hiding is sufficient for now. I have IP phones that sometimes decide to roam to a different AP which will lead to the call being dropped now, that needs to be fixed ASAP. If I can fix the dropping calls by clearing that bit that's good enough for now. I just hope that turning it off doesn't just cause the entries to be permanent (as "expiring is disabled").

Do you have any idea what's the root cause for this? I assume it's somehow related to the nested bridges and changes at the "highest" bridge need to somehow be sent down to the bridges under it but that's not happening right now?

Well, the documentation on the bit is a bit vague as it says:

AGE_EN Enable age operation
1 = Lookup module can age the address in the address table.

Is there a bit to disable the address table completely or something like that? :joy: Port mirroring might be an option here aswell....

I guess that could mean multiple things and the way it actually works depends if the engineer who implemented that stuff in hardware actually used the feature before or just read some specs and did his best to fulfill them....

Well yes turning off aging will result in the making the entry permanent... this is why it doesn't make sense... We have all the data... We need a good way to repro this and exactly what is wrong... Then it's just a matter of following the code and see where is the logic problem... Also learning on CPU is disabled so the fdb are generated by the kernel... (and this needs to be tested... may also be that for some reason the kernel is learning something wrong)

And again what you are suggesting is a no solution.... I know you need something quick but it's not the correct way to do thing

Okay so to reproduce it this config should be good:

config interface 'loopback'
        option device 'lo'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'

config globals 'globals'

config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'lan'
        option ipv6 '0'

config interface 'lan'
        option device 'lanbridge'
        option proto 'dhcp'

config bridge-vlan
        option device 'br-lan'
        option vlan '1'
        list ports 'lan:u*'

config bridge-vlan
        option device 'br-lan'
        option vlan '2'
        list ports 'lan:t'

config device
        option type 'bridge'
        option name 'lanbridge'
        list ports 'br-lan.1'

config device
        option type 'bridge'
        option name 'lan2bridge'
        list ports 'br-lan.2'
        option ipv6 '0'

config interface 'LAN2'
        option proto 'none'
        option device 'lan2bridge'

The way this should work in my opinion: br-lan is the switch, it contains the untagged vlan (vlan 1) and the tagged vlan 2, those are br-lan.1 and br-lan.2. Then there are bridges that will be used to connect the wifi radios to those, they are lanbridge and lan2bridge. Now if a client sends a packet that reaches the LAN port on either one of the vlans an entry is created. If the client now appears on the wifi radio of this AP (for example because it roamed to that AP) apparently the entry for lanbridge is corrected, but not for br-lan. Traffic for that MAC address will not reach the client on the WiFi anymore.

Would be also good to integrate some example fdb table and the expected one

I don't really know what the expected one is, I have posted the relevant entries for the "good" and "bad" tables above in IPQ40xx Switch Config "Strangeness" - #108, that's all for a single MAC address. So if no entries exist for that MAC on br-lan it's alright (probably not what's expected though), once there are packets from that client on the ethernet interface entries on br-lan are created and until those expire no traffic will reach the CPU port apparently.

My idea to reduce the aging time to 7 seconds didn't work either, apparently it's still the same as before....

I'm not sure of the topology there, but there is a challenge with stock OpenWrt in that it doesn't assign unique MAC addresses to the virtual ports. This can be troublesome when used with devices that aren't directly connected to the ports, such as through switches.

https://github.com/openwrt/openwrt/pull/11693 is how I've successfully been using EA8300s here with trunked VLANs.

That's not related, this is about DSA messing up the forwarding database of the switch by not removing entries when a client associates to an AP in certain scenarios. The switch will not forward packets until that entry has expired. Also setting the expiration doesn't seem to work, so there is a super long default expiration of (I think) 3 minutes configured.

Hi again :slight_smile:

In hope of this not being an accidental derailment:
Could it be that this isn't caused by the IPQ40XX driver?

I thought so too since January when I got reports of similar behaviour to Flole's description (wifi clients not receiving responses to their packets). Now I have one customer doing some A/B test once in a while because they're fairly patient and they can reproduce (rare combo..).
The last firmware that works for them is based on 3835c92 - DSA/IPQ40xx is already merged in that firmware.
Initially, I confirmed, also using tcpdump (on the bridge, not the ports), that the returning traffic wasn't reaching the bridge interfaces. I confirmed that at least linux had the right stuff in the fdb table (is there way to do that without a tool?).

Since my previous post two things have happened:

  1. In some test-runs the tester reported the wifi to 'not work' despite the returning packets being visible on the bridge interface.
  2. Another bug report with a matching description came in from a tech-savvy user on a different platform (mt7621) who had the traffic drop in the same way as in 1. When he roams to a mesh AP, the client doesn't receive pings. The answers still go through the bridge.

Could it be that this 'switch strangeness' really is a 'queue strangeness' or something like that?
The descriptions just match so well and in one case the packets seem to drop between switch -> kernel and in the other between bridge -> wifi client device (a stretch, i know).

1 Like

In my case the packets don't even reach the bridge until the FDB entry is expired. The bridge command (if installed) can be used to dump the FDB.

@Ansuel Since you mentioned that this needs to be debugged and as it probably needs to be addressed before the next release anyways, what would be the next steps for that?

@moderators I would suggest to split off this year's posts (everything below IPQ40xx Switch Config "Strangeness" - #108) to this thread into a new one (e.g. called "ipq40xx/ dsa: switch issues"), as the DSA drivers (and their behaviour) are very different from the original issues with the old swconfig drivers.

2 Likes

I still wasn't able to reliably determine what role the FDB plays in it but I can confirm that I'm affected in the same way.
Packets go missing somewhere between the external switch and the Linux bridge (-> somewhere between the GMAC and CPU port of the internal switch).

The issue can be reproduced reliably by roaming from one AP (A) to another (B) and back (A). The (external) managed switch notices the port changes alright and forwards traffic accordingly but it doesn't reach the bridge (the other direction works). After a while, the client device gives up. When it reconnects to either A or B after 3-4 minutes, everything works again (also in alignment with @Flole's observation).
On both routers, the SSID interface is in a bridge with lan.16.

@slh If I'm not mistaken, this thread is already about DSA.

The FDB is the forwarding database of the integrated switch, it contains a wrong entry so it will not forward the packets to the CPU port. Hence, the packets get lost and never reach the client until that entry expires.