Issue Introduced in 22.03.0-rc4 (still there in rc5 an rc6)

Issue Description: Any Openwrt router running 22.03.0-rc4, 22.03.0-rc5 or 22.03.0-rc6 on my network behaves as follows: When freshly flashed or rebooted any client attached to the router will drop packets say to for a while on connecting to the router but then will successfully ping. I have 802.11r set up so I am used to minimal packet loss on switching base stations, so this pause in routing packets immediately got my attention. After 12 to 24 hours packets are not routeable beyond the base station at all no matter how long you wait. If the base station ip is it is almost immediately pingable but nothing beyond that address. My main household router (pfsense) is not pingable and obviously nor is My Openwrt router setup works on 22.03.0-rc1 and before. I have rolled back to 22.03.0-rc1 with no changes in setup to keep things running. I have experienced this issue on working 22.03.0-rc1 routers upgraded to 22.03.0-rc4, 22.03.0-rc5, and 22.03.0-rc6. To make sure there are no dangling incompatible settings I did a fresh from scratch install of 22.03.0-rc6 and then hand configured it to match my normal setup and I still have the same behavior. This does not seem to be hardware specific as it is happening on at least three different sets of hardware (Netgear R6220, Totolink X5000R, Linksys E8450 (UBI)).

My Network: [cable modem] <-> [Pfsense Router] <-> 2* [Openwrt 22.03.0-rc1 base stations]
All my network and routing services are handled by my pfsense router. My two Openwrt base stations are just there for wireless connectivity. I have no WAN network on my Openwrt boxes. I have a LAN network on vlan br-lan.1. I have two guest networks on br-lan.3 and br-lan.5 respectively. Wireless appears to be unaffected and working correctly... It is routing to my router and beyond that is an issue.

One difference I have noticed is on 22.03.0-rc1 the global target and gateway are on lan

on 22.03.0-rc6 the global target and gateway is on GUEST1

I dont know if this is a red herring or is significant...

Some additional screenshots that show my setup:

I apologize for this somewhat nebulous report. I am hoping I can get some help on what to look at next as I am now drawing a blank and am embarrassed at how long I have spent without success.

One more interesting piece of information, while clients cannot ping the router and anything beyond.... The router itself can see

cat /etc/board.json
cat /etc/board.json

Let's see the delta if any.

@pmagid but all wireless devices you have are being handled by mt76 driver.
This could be the issue of common code of the driver. There were many reports of problems, and many are not solved.

1 Like

@lukasz92 I was worried about that... I have a linksys wrt1200 (marvell). I will try with that and report back here later...

Seriously? So if I get a router that uses the mt76 driver should I stay at 22.02.3 or go lower/higher?

@lukaz92 I can confirm that my Linksys WRT1200AC appears to be behaving correctly on the newer RCs. My Apple Mac client switches to the Marvell based AP with only the drop of a packet or two (As expected with 802.11r).... To be sure we will have to wait until tomorrow morning..... With an mt76 nothing would be working after 12-24 hours.... I am betting it will still be good with this Marvell box. I think there is an issue w/ vlans and mt76 possibly... Is there anything I can do to further chase down the issue or just hang tight for a mt76 update and a newer RC?

@lukasz92 I can confirm 12 hours later the Linksys WRT1200AC is working correctly. Its definitely an issue with the mt76 driver (somewhere).

1 Like

maybe interesting @nbd

@nbd is there anything I can do to help or track down and narrow in on the issue?

Hmmm.. - this is the faulty commit.

When I checked 22.03 tags, 22.03.0-rc4 and 22.03.0-rc3 have the same version of mt76 driver (22.03-rc5 introduced many changes). The commit I refer is the only change for mac80211 stack between 22.03.0-rc4 and 22.03.0-rc3.

There is an interesting comment below:

1 Like

FWIW I am not using 802.11s which this commit also breaks according to the comments....

I don't understand why you have all the ports tagged in VLAN 3 and 5, and yet they are also untagged in VLAN1 and have PVID set to 1 (indicated by the asterix next to each port in VLAN 1) - This doesn't make any sense to me....Tagged ports are essentially trunk ports, for use to connect to other switches (including the switch in all-in-one devices).

Currently, if you connect a computer to one of the ports (for example, port 3), the computer will only send untagged data. As the port has PVID 1, it will be tagged with ID '1' on ingress, and the switch will send all data to VLAN 1 (I.e br-lan.1). Data leaving port 3 will be tagged with VLAN tag 3 AND 5 - dumb devices like computers/end-devices can't understand tagged data, so it will be discarded.

You need to re-think the design of your VLANs.

Untagged ports are for use with end-devices.

Tagged ports are for use with other switches.

How are your dumb access points connected to the router? Mesh? ethernet?

My dumb aps are all attached with ethernet. (I am not using mesh)

I set up my vlans like this because this is what has worked for what I need more or less from day one... Am I certain it is optimal / correct: no I am not.... I can look at the design for sure.... I am always up for improving things. But that vlan design is not the cause of the problem I am having, right?

Fix the VLAN problems - it will probably sort this out. Ports with the dumb APs connected should be tagged (on both the AP and the switch). Depending on how your WiFi is set up on the different APs, you may need to tag them on VLAN 1, 3, AND/OR 5 (for example, if you have 2 ssids on the same AP belonging to guest and Lan, they should be tagged on VLAN 3 and 1). They definitely shouldn't be untagged though.

I had/have similar issues

@kramsac I think you are jumping to conclusions and misunderstanding my topology. And assuming it is a problem when it may or may not be. When responding and displaying shortness please remember I am trying to help improve the quality/robustness of openwrt...

I have a pfsense router that handles all my networking including the setup of the vlans. The vlans are used to segregate and dish guest network class c addresses that are different from each other and different than my main network. Attached to that pfsense router is an unmanaged switch attached via ethernet to the unmanaged switch are two dumb aps. Those dumb aps need to be able to allow connections on the separate 3 networks (guest guest1 and lan)

I will gladly try switching the the untagged to tagged and report back here.

Please remember two things: 1) this has worked from 22.03.0-rc1 and prior and for years at that, 2) it continues to work with radios based on other chipsets (Marvell).

@kramsac Please carefully read everything I have posted. You have jumped to conclusions and shown you have missed key things I have said in multiple instances.... For example, you asked if I was using mesh when just one post before I said I was not using 802.11s. You seem to not have noticed that I stated repeatedly this was a configuration that had worked for years and continues to work on other chipsets. Yet you approach the situation like I am a dumbass with a wrong vlan configuration that cant get that to work (been there done that but that is not the case here). If that were the case it would never have worked and would not currently work on the problematic RCs with other chipsets.

That all being said I tried both:

As I expected neither worked and neither solved the issue.

@PolynomialDivision thank you for referencing it looks like this is what I am struggling with....

Wow, its been open since May 26.....