Also, found a better way to fix it in the mac80211 framework, so it will span all wireless devices, not just mt76.
That is great news! Are you hoping to get the patch for this into OpenWRT (via https://github.com/openwrt/openwrt/tree/master/package/kernel/mac80211/patches/subsys) as an interim patch as well? Sometimes it can take a while for kernel patches to make it into kernel releases, and this could be a good way to get the fix in quickly, and hopefully also into the 22.03.x release stream as well.
Looks like Felix just pushed a series of mtk ppe / WED related patches into master. I'm going to get those running on my APs and see if the issue I'm seeing persists.
Update Hmmm.... not off to a good start:
2023-03-23 15:42 kernel [ 5320.784522] Unable to handle kernel paging request at virtual address 00ffffff80057ad9
I'm not 100% sure yet how the patch process works. It uses git email to send patch files to repo managers. Felix is listed on the one in the Linux upstream as well as openwrt mt76, so I think he'll be merging that commit over once it's in the upstream?
Get the latest from the Brain2000/mt76 repo. I put them closer to what the patches are that we kept. The sta_remove and sta_pre_rcu_remove need to be separated, so I want to make sure that's not causing your crash (let me know if it compiles with any errors, I didn't test to make sure it builds)
This is very similar to the crash I had, but yours is in the WED poll.
Felix sent me an addition about an hour ago. It's on my latest repo commit if you want to try it out.
EDIT I might be wrong about that. Where is this mtk_poll function located? EDIT2 found it, looking at it EDIT3 this might be the time we try to use aarch64-openwrt-linux-gdb in the toolchain bin folder
In an effort to test this I've done my first full build of OpenWRT.
I followed the standard environment setup instructions, checked out the v23.03.3 tag and then cherry-picked the patch commit. From there I continued following the "quick" build instructions, using the build configuration from the relevant 23.03.3 target.
I've flashed the firmware and everything seems fine, but Is there any way to verify that the firmware I built actually contains the patch that I cherry-picked?
I just installed current snapshot OpenWrt SNAPSHOT r22392-e7c399bee6 / LuCI Master git-23.074.82619-6ad6a24
and one of my APs froze within an hour.
It was not just wifi, I was unable to log in and had to restart.
This matches your observation.
I've disabled WED for now.
I'm testing with it now and while my APs are not crashing in the same way as before, I am noticing decreased throughput at the moment. Curious if others are seeing the same as well.
FWIW, I still have WED enabled, but am also testing with /proc/sys/net/core/backlog_threaded enabled (set to 1) per this:
Update 1 Yeah, something odd is going on for me. I disabled /proc/sys/net/core/backlog_threaded for now to not have that new variable introduced. When I see flows offloaded, performance seems pretty good--on par with other testing I've been doing with WED thus far. But sometimes the same device, running the same test, will not have its flows offloaded and my RT3200 CPUs reflect that heavily. In those cases, I'm seeing lower throughput than before the commits I tested yesterday (07b5508, fbcfb7f, d0a0696, and aa27771). Back-to-back tests indicate inconsistencies with which flows are offloaded or not.
I feel like I'm probably going to have to bisect again and see if there was another regression in between. sigh. If anyone else is NOT seeing what I'm describing, it could be something else going on in my test environment and knowing that might save me some bisecting time.
Update 2 To put some numbers behind my statements before...
I found the throughput "regression" I was noting is not necessarily new to this build. In fact, I went back and confirmed this same behavior was happening on d0a0690 (and maybe prior even to that) as well.
Testing with iperf3:
From a well-connected 1gb ethernet host (192.168.xx.5)
To the same wireless AX client (192.168.xx.109), located at the same location from test to test
I found it too, but probably need to spend hours on the documentation to figure out how gdb works. There isn't just an example, instead there are hundreds of pages of technical documentation.
Thank you - that was helpful. The answer is no, the patch didn't apply. I have no idea why not though. Do patches need to be listed somewhere in order to get applied to a particular target?
I haven't read OpenWRT documentation on how to apply a specific patch, so I can't answer that. But I'm well versed with git manipulation, so I just do my own patching.
If you run the patch again, you can check that file before you compile. If it's not there, then the patch maybe didn't run? If it did and you run "make world", check it again afterwards and the patch should still be there. If it downloads over it... well, it shouldn't, because each package has a git repo and SHA256 checksum, and packages do not re-download/unpack as long as they haven't changed, leaving source files as is.