Ipq40xx / dsa: switch issues

Unfortunately it seems like DSA did not fix things and even broke things on my RBR50. I updates to latest master and I am seeing something really weird: I use all 4 ports as LAN ports, so I did this:

config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'wan'
        option ipv6 '0'

then I need 2 vlans, one "normal" and one tagged with vlan id 2:

config bridge-vlan
        option device 'br-lan'
        option vlan '1'
        list ports 'lan1:u*'
        list ports 'lan2:u*'
        list ports 'lan3:u*'
        list ports 'wan:u*' 

config bridge-vlan
        option device 'br-lan'
        option vlan '2'
        list ports 'wan:t'

As only a single port needs that tagged packets I configured it this way. Now I created bridges using those brlan.1 and brlan.2 interfaces. When I connected to the WiFI with brlan.1 I can see the DHCP reaching the DHCP server but the response is never arriving wan according to tcpdump (and also never at the bridge, let alone the device connected to the access point). After some time though it starts working, at least until I disconnect from WiFi again, wait some time and reconnect again.

Anybody knows what could cause such a weird behaviour? It looks a little like some hidden switch takes some time to learn the MAC address/port to forward the packets to....

nevermind I was wrong haha

That config was made by luci this way. I see br-lan there and can configure the vlans there (and apparently only there). When I set the "local" checkmark it generates those br-lan.1 and br-lan.2 for me.

I thought it is necessary to generate a bridge if I want to connect the WiFi radios to those later on?

There is yet another bridge named lan that contains br-lan.1 (and lan2 for br-lan.2 aswell). So in total I have 3 bridges, one is basically the switch and the other 2 are my vlans which contain the switch's vlan and the wifi networks.

Don't worry, when I tried to set this up initially it took me a bunch of attempts aswell as I thought "this and that" might not be necessary, that pseudo-bridge that represents the switch really caught me off-guard and confused me :wink:

I think I know what's going on: When the device connects to a different access point this one learns the MAC on the WAN port. Apparently it then doesn't re-learn it when the device connects to the WiFi of this AP (the CPU isn't listed as a port, so it doesn't exist for the bridge apparently). I turned off the ageing completely on br-lan so it doesn't learn MACs and that seems to have resolved it. That matches my observations that it ocured after roaming, but I thought it might have been coincidence. Nevermind, that's not it.

Check out the better dump AP template thread here and maybe there are some improvements that will help you further also.

Also, I noticed you have VLAN 2 tagged on your WAN port. So I assume you nat traffic behind it. However, you should not have a need to untag vlan 1 on the same port and should remove.

No I don't, the wan port is simply used as upstream port connected to my switch. Those are 2 completely separate networks which are both LANs, just for different purposes. So one LAN is untagged and one is tagged, both come over the wan port. It's labeled wan but really isn't, it's used as fourth lan port in my case.

1 Like

Makes sense. So I assume you have stripped out the firewall config and/or disabled the firewall too? The other elements outlined on the better dumb AP thread are quite beneficial. Where is DHCP coming from for VLAN 1 and VLAN 2? Are you using an ip-helper on the router, is the DHCP server directly connected to each VLAN, is the router the DHCP server for both VLANs? Noting your comment about not even seeing the packet leave the hardware. What are you seeing in the logs (same as console messages) that may also provide some clues?

I simply didn't compile in the firewall at all. But even if I did: tcpdump would show the traffic even if it's dropped by the firewall.

There's a DHCP server on each vlan, and this config works great on all other (different model) access Points, just this one is behaving weird from time to time somehow. The packets do leave the hardware, outgoing is all good, so the request reaches the server, the server sends a response, the response is on the wire but never visible on the APs ethernet interface. Dmesg and logread show absolutely nothing when that happens, only the WiFi association but no error or warning about Ethernet.

So I just tested this on another IPQ40xx device (Meraki MR33) and it shows the same behavior with this config. So I assume this affects possibly all IPQ40xx devices.

Any idea how to debug this further? I guess running tcpdump on eth0 won't show those packets either (if the encapsulating stuff is removed)?

@Ansuel Since you are the one who commited it for the RBR50, any idea what might be going on here? Incoming ethernet traffic for some MAC Addresses is not forwarded initially, and only after some time it works. So in dumb access point mode it takes a few minutes after association for the device to be actually able to get a IP/communicate with the network. It doesn't always happen though.

Dump the FDB database and look into duplicates that shouldn't really exist as DSA is managing the FDB database.

So when I connect to the AP and everything works normally it looks like this:

...
xx:xx:xx:xx:xx:xx dev phy2-ap0 master lanbridge
...

when I connect to another AP it looks like this:

...
xx:xx:xx:xx:xx:xx dev lan vlan 1 master br-lan
xx:xx:xx:xx:xx:xx dev lan vlan 1 self
xx:xx:xx:xx:xx:xx dev br-lan.1 master lanbridge
...

upon back to the AP

...
xx:xx:xx:xx:xx:xx dev lan vlan 1 master br-lan
xx:xx:xx:xx:xx:xx dev lan vlan 1 self
xx:xx:xx:xx:xx:xx dev phy2-ap0 master lanbridge
...

Are those nested bridges the issue here?

As I am using a single ethernet port device (and on the RBR50 I only use a single port at the moment aswell), can I just disable the learning completely using some command? I tried bridge link set dev lan learning off but only got Error: bridge: bridge flag offload is not supported.

Why are you hiding the MAC-s, they are what is important?
Are you seeing duplicate MAC FDB entries or?

I filtered for the affected MAC, so it's all the same MAC above. Looks like those are duplicates.

That is not supposed to happen as far as I know

Seems like that's indeed the issue, once the entries expire and disappear it starts to work again. Any idea what could cause this? Or any idea for some workaround?

Well, this is the issue we had with HW learning on the CPU port and it was solved by switching to assisted software FDB learning.

I found the conversation in pull request #4721 and I am reading through it right now, but reading all that it should just work....

I'm investigating an issue where wifi clients can no longer receive packets after roaming. Initially, I thought it had something to do with the wifi itself but traffic for affected clients was visible on the wifi device - it just wouldn't pass through the software bridge.
The connection resumes after a few minutes - by which the client most likely has decided to disconnect already.

For me, there are multiple software bridges each with a different VLAN and two wifi networks.
How could I check whether my issue is related to @Flole 's problem?

99% sure it's the same issue. It's most likely affecting all IPQ40xx devices, I'm seeing it on the MR33 and RBR50. The MR33 is a single ethernet port device, the RBR50 is a 4 port device.

I just found this commit that enables setting the ageing time: https://github.com/torvalds/linux/commit/6a3bdc5209f45d2af83aa92433ab6e5cf2297aa4

For some reason the value 0 is not allowed, so it is apparently not possible to disable it completely. I would suggest to change it to something like

if (!val && msecs)

so that an explicit 0 would be allowed and so it would be possible to turn off the ageing if wanted.

I tried to turn down all bridges to 7 seconds now, hoping that this would reduce the delay that I am experiencing by a lot (for some reason even though the default 30 second delay is set in the bridges the switch uses a different default).