Quick update on this. After a lot of code review (along with @Brain2000), we found nothing that stood out as leading to the issue I was experiencing with stoppage of WED flows.
I ended up "grasping at straws" and disabled Usteer. I've been running it for quite a while now and generally had good success with it for band steering.
But for some reason, something about Usteer's interaction with hostapd (??) can break WED offloaded flows. Since disabling Usteer, I am over 27 hours without losing offloaded flows on any of my three APs. I have never made it this long without losing flows on one or more of the APs with Usteer enabled.
Am I saying Usteer is broken? Not necessarily. It's still a fine tool as far as I am concerned. But I am weighing its benefit vs the benefit of WED, and the significant CPU reduction that WED brings on the TX path is hard to give up.
If anyone feels adventurous and wants to help figure out why Usteer leads to this effect, I'll be interested to see where the troubleshooting leads.
I am using static-neighbor-reports now to provide neighbor reports for 802.11k capable clients since Usteer isn't providing this role for me now.
Would you mind sharing what you did to make this work and any tests to verify that it is working as intended? I have read the documentation but it's doesn't seem entirely clear to me how this tool should be used.
Well, as luck would have it, offloaded flows stopped on one of three APs. Definitely made it longer with offloading functional than when Usteer was running, but I'm back to being pretty perplexed as to the root cause here.
So the issue is most likely when wifi devices are jumping between access points.
I haven't had a chance to jump back on this yet, but we will get it. Later this evening I should be able to dig in again.
I've been working with Felix on splitting up the patches and submitting them upstream. Also, found a better way to fix it in the mac80211 framework, so it will span all wireless devices, not just mt76.
Well you suggested a workaround for the issue you've experienced and this relates to an elaboration of that workaround. So seems not so off topic. But, in any case, for those interested - whether arising from this thread or otherwise - perhaps you might instead post on this thread:
Also, found a better way to fix it in the mac80211 framework, so it will span all wireless devices, not just mt76.
That is great news! Are you hoping to get the patch for this into OpenWRT (via https://github.com/openwrt/openwrt/tree/master/package/kernel/mac80211/patches/subsys) as an interim patch as well? Sometimes it can take a while for kernel patches to make it into kernel releases, and this could be a good way to get the fix in quickly, and hopefully also into the 22.03.x release stream as well.
Looks like Felix just pushed a series of mtk ppe / WED related patches into master. I'm going to get those running on my APs and see if the issue I'm seeing persists.
Update Hmmm.... not off to a good start:
2023-03-23 15:42 kernel [ 5320.784522] Unable to handle kernel paging request at virtual address 00ffffff80057ad9
I'm not 100% sure yet how the patch process works. It uses git email to send patch files to repo managers. Felix is listed on the one in the Linux upstream as well as openwrt mt76, so I think he'll be merging that commit over once it's in the upstream?
Get the latest from the Brain2000/mt76 repo. I put them closer to what the patches are that we kept. The sta_remove and sta_pre_rcu_remove need to be separated, so I want to make sure that's not causing your crash (let me know if it compiles with any errors, I didn't test to make sure it builds)
This is very similar to the crash I had, but yours is in the WED poll.
Felix sent me an addition about an hour ago. It's on my latest repo commit if you want to try it out.
EDIT I might be wrong about that. Where is this mtk_poll function located? EDIT2 found it, looking at it EDIT3 this might be the time we try to use aarch64-openwrt-linux-gdb in the toolchain bin folder
In an effort to test this I've done my first full build of OpenWRT.
I followed the standard environment setup instructions, checked out the v23.03.3 tag and then cherry-picked the patch commit. From there I continued following the "quick" build instructions, using the build configuration from the relevant 23.03.3 target.
I've flashed the firmware and everything seems fine, but Is there any way to verify that the firmware I built actually contains the patch that I cherry-picked?
I just installed current snapshot OpenWrt SNAPSHOT r22392-e7c399bee6 / LuCI Master git-23.074.82619-6ad6a24
and one of my APs froze within an hour.
It was not just wifi, I was unable to log in and had to restart.
This matches your observation.
I've disabled WED for now.
I'm testing with it now and while my APs are not crashing in the same way as before, I am noticing decreased throughput at the moment. Curious if others are seeing the same as well.
FWIW, I still have WED enabled, but am also testing with /proc/sys/net/core/backlog_threaded enabled (set to 1) per this:
Update 1 Yeah, something odd is going on for me. I disabled /proc/sys/net/core/backlog_threaded for now to not have that new variable introduced. When I see flows offloaded, performance seems pretty good--on par with other testing I've been doing with WED thus far. But sometimes the same device, running the same test, will not have its flows offloaded and my RT3200 CPUs reflect that heavily. In those cases, I'm seeing lower throughput than before the commits I tested yesterday (07b5508, fbcfb7f, d0a0696, and aa27771). Back-to-back tests indicate inconsistencies with which flows are offloaded or not.
I feel like I'm probably going to have to bisect again and see if there was another regression in between. sigh. If anyone else is NOT seeing what I'm describing, it could be something else going on in my test environment and knowing that might save me some bisecting time.
Update 2 To put some numbers behind my statements before...
I found the throughput "regression" I was noting is not necessarily new to this build. In fact, I went back and confirmed this same behavior was happening on d0a0690 (and maybe prior even to that) as well.
Testing with iperf3:
From a well-connected 1gb ethernet host (192.168.xx.5)
To the same wireless AX client (192.168.xx.109), located at the same location from test to test