Mt76 wireless driver debugging

Quick update on this. After a lot of code review (along with @Brain2000), we found nothing that stood out as leading to the issue I was experiencing with stoppage of WED flows.

I ended up "grasping at straws" and disabled Usteer. I've been running it for quite a while now and generally had good success with it for band steering.

But for some reason, something about Usteer's interaction with hostapd (??) can break WED offloaded flows. Since disabling Usteer, I am over 27 hours without losing offloaded flows on any of my three APs. I have never made it this long without losing flows on one or more of the APs with Usteer enabled.

Am I saying Usteer is broken? Not necessarily. It's still a fine tool as far as I am concerned. But I am weighing its benefit vs the benefit of WED, and the significant CPU reduction that WED brings on the TX path is hard to give up.

If anyone feels adventurous and wants to help figure out why Usteer leads to this effect, I'll be interested to see where the troubleshooting leads.

I am using static-neighbor-reports now to provide neighbor reports for 802.11k capable clients since Usteer isn't providing this role for me now.

4 Likes

Would you mind sharing what you did to make this work and any tests to verify that it is working as intended? I have read the documentation but it's doesn't seem entirely clear to me how this tool should be used.

Not on this thread. Such a topic is outside the scope of this fine thread.

1 Like

Well, as luck would have it, offloaded flows stopped on one of three APs. Definitely made it longer with offloading functional than when Usteer was running, but I'm back to being pretty perplexed as to the root cause here.

So the issue is most likely when wifi devices are jumping between access points.

I haven't had a chance to jump back on this yet, but we will get it. Later this evening I should be able to dig in again.

I've been working with Felix on splitting up the patches and submitting them upstream. Also, found a better way to fix it in the mac80211 framework, so it will span all wireless devices, not just mt76.

5 Likes

Well you suggested a workaround for the issue you've experienced and this relates to an elaboration of that workaround. So seems not so off topic. But, in any case, for those interested - whether arising from this thread or otherwise - perhaps you might instead post on this thread:

Also, found a better way to fix it in the mac80211 framework, so it will span all wireless devices, not just mt76.

That is great news! Are you hoping to get the patch for this into OpenWRT (via https://github.com/openwrt/openwrt/tree/master/package/kernel/mac80211/patches/subsys) as an interim patch as well? Sometimes it can take a while for kernel patches to make it into kernel releases, and this could be a good way to get the fix in quickly, and hopefully also into the 22.03.x release stream as well.

1 Like

Looks like Felix just pushed a series of mtk ppe / WED related patches into master. I'm going to get those running on my APs and see if the issue I'm seeing persists.

Update Hmmm.... not off to a good start:

2023-03-23 15:42 kernel [ 5320.784522] Unable to handle kernel paging request at virtual address 00ffffff80057ad9

Gotta go power cycle that AP.

I'm not 100% sure yet how the patch process works. It uses git email to send patch files to repo managers. Felix is listed on the one in the Linux upstream as well as openwrt mt76, so I think he'll be merging that commit over once it's in the upstream?

Get the latest from the Brain2000/mt76 repo. I put them closer to what the patches are that we kept. The sta_remove and sta_pre_rcu_remove need to be separated, so I want to make sure that's not causing your crash (let me know if it compiles with any errors, I didn't test to make sure it builds)

Got it! Compiled fine, so I'm building a new image now. I'll put it through the wringer as usual and see how it goes.

Your fix is pushed into openwrt/master. Congrats. :slight_smile:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=9779ee021d30508eb9e7ebf1ec0a28a4be3c4c19

8 Likes

Such awesome progress via this thread! Thanks to @Brain2000 for his trailblazing to help get some things addressed :clap:

For anyone running WED, have you noticed a significant regression with this commit as I described here?

I just got a crash dump from this issue:

<1>[   84.515285] Unable to handle kernel paging request at virtual address 00010102464c457f
<1>[   84.523230] Mem abort info:
<1>[   84.526015]   ESR = 0x0000000096000004
<1>[   84.529766]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[   84.535072]   SET = 0, FnV = 0
<1>[   84.538136]   EA = 0, S1PTW = 0
<1>[   84.541274]   FSC = 0x04: level 0 translation fault
<1>[   84.546143] Data abort info:
<1>[   84.549033]   ISV = 0, ISS = 0x00000004
<1>[   84.552863]   CM = 0, WnR = 0
<1>[   84.555822] [00010102464c457f] address between user and kernel address ranges
<0>[   84.562967] Internal error: Oops: 96000004 [#1] SMP
<7>[   84.567843] Modules linked in: nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 cfg80211 nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c hwmon compat cls_flower act_vlan cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact cryptodev autofs4 seqiv authencesn authenc leds_gpio gpio_button_hotplug
<7>[   84.637105] CPU: 1 PID: 343 Comm: napi/mtk_eth-4 Tainted: G S                5.15.102 #0
<7>[   84.645192] Hardware name: Linksys E8450 (UBI) (DT)
<7>[   84.650060] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
<7>[   84.657015] pc : __mtk_ppe_check_skb+0x10c/0x55c
<7>[   84.661634] lr : __mtk_ppe_check_skb+0x154/0x55c
<7>[   84.666244] sp : ffffffc00942bb30
<7>[   84.669549] x29: ffffffc00942bb30 x28: 0000000000000000 x27: ffffff80065ffd00
<7>[   84.676682] x26: 000000000000ffff x25: 0000000000000028 x24: ffffff80056b4780
<7>[   84.683814] x23: ffffffc0085cf7e0 x22: 00010102464c457f x21: ffffffc008c05000
<7>[   84.690944] x20: 00000000000008f4 x19: ffffff8000d00080 x18: 0000000000000001
<7>[   84.698075] x17: 0000000000003300 x16: ffffffc0091f1000 x15: ffffff80027f0886
<7>[   84.705205] x14: 00000000428684c4 x13: 01bbed2234865a94 x12: 1c3d0b440994d43c
<7>[   84.712335] x11: 2600170000000186 x10: 0000000020632063 x9 : 1c3d0b440994d43c
<7>[   84.719466] x8 : 2600170000000186 x7 : 0000000020632063 x6 : 2a0086c01a001e53
<7>[   84.726597] x5 : c086002a63206320 x4 : 0000000000068808 x3 : 000000000000ffff
<7>[   84.733727] x2 : 0000000000000001 x1 : 0000000000000003 x0 : 00000000ffffffff
<7>[   84.740859] Call trace:
<7>[   84.743296]  __mtk_ppe_check_skb+0x10c/0x55c
<7>[   84.747560]  mtk_poll_rx+0x93c/0xa5c
<7>[   84.751128]  mtk_napi_rx+0x90/0x180
<7>[   84.754610]  __napi_poll+0x54/0x1b0
<7>[   84.758096]  napi_threaded_poll+0x84/0xe4
<7>[   84.762098]  kthread+0x11c/0x130
<7>[   84.765320]  ret_from_fork+0x10/0x20
<0>[   84.768893] Code: f8767b16 529ffffa b40002d6 d503201f (f94002d8)
<4>[   84.774978] ---[ end trace 8222ee3554dbe808 ]---
<0>[   84.786025] Kernel panic - not syncing: Oops: Fatal exception in interrupt
<2>[   84.792896] SMP: stopping secondary CPUs
<0>[   84.796814] Kernel Offset: disabled
<0>[   84.800291] CPU features: 0x0,00006000,00000802
<0>[   84.804814] Memory Limit: none

This is very similar to the crash I had, but yours is in the WED poll.
Felix sent me an addition about an hour ago. It's on my latest repo commit if you want to try it out.

EDIT I might be wrong about that. Where is this mtk_poll function located?
EDIT2 found it, looking at it
EDIT3 this might be the time we try to use aarch64-openwrt-linux-gdb in the toolchain bin folder

(gdb) l *__mtk_ppe_check_skb+0x10c

Thanks for the heads up, we have one more crash that needs to get in there before I think we're in the clear...

Did it work for you? I'm having to get back up to speed on gdb, so I haven't uncovered any useful details yet. Did find the gdb binary, though!

In an effort to test this I've done my first full build of OpenWRT.

I followed the standard environment setup instructions, checked out the v23.03.3 tag and then cherry-picked the patch commit. From there I continued following the "quick" build instructions, using the build configuration from the relevant 23.03.3 target.

I've flashed the firmware and everything seems fine, but Is there any way to verify that the firmware I built actually contains the patch that I cherry-picked?

2 Likes

I just installed current snapshot
OpenWrt SNAPSHOT r22392-e7c399bee6 / LuCI Master git-23.074.82619-6ad6a24
and one of my APs froze within an hour.
It was not just wifi, I was unable to log in and had to restart.
This matches your observation.
I've disabled WED for now.

Felix pushed a proposed fix for this regression:

I'm testing with it now and while my APs are not crashing in the same way as before, I am noticing decreased throughput at the moment. Curious if others are seeing the same as well.

FWIW, I still have WED enabled, but am also testing with /proc/sys/net/core/backlog_threaded enabled (set to 1) per this:

Update 1 Yeah, something odd is going on for me. I disabled /proc/sys/net/core/backlog_threaded for now to not have that new variable introduced. When I see flows offloaded, performance seems pretty good--on par with other testing I've been doing with WED thus far. But sometimes the same device, running the same test, will not have its flows offloaded and my RT3200 CPUs reflect that heavily. In those cases, I'm seeing lower throughput than before the commits I tested yesterday (07b5508, fbcfb7f, d0a0696, and aa27771). Back-to-back tests indicate inconsistencies with which flows are offloaded or not.

Unfortunately there were a lot of other changes between the bad commit (aa27771) and the fix he pushed for that commit, which is https://github.com/openwrt/openwrt/commit/95cae498b6ab4613a0b4915393c798c750dac396.

I feel like I'm probably going to have to bisect again and see if there was another regression in between. sigh. If anyone else is NOT seeing what I'm describing, it could be something else going on in my test environment and knowing that might save me some bisecting time. :slight_smile:

1 Like

Update 2 To put some numbers behind my statements before...

I found the throughput "regression" I was noting is not necessarily new to this build. In fact, I went back and confirmed this same behavior was happening on d0a0690 (and maybe prior even to that) as well.

Testing with iperf3:

  • From a well-connected 1gb ethernet host (192.168.xx.5)
  • To the same wireless AX client (192.168.xx.109), located at the same location from test to test
  • Through the same AP

Test 1 - Offloaded flow appeared in ppe0:

❯ iperf3 -c 192.168.xx.5 -i1 -t30 -R
Connecting to host 192.168.xx.5, port 5201
Reverse mode, remote host 192.168.xx.5 is sending
[  5] local 192.168.xx.109 port 53371 connected to 192.168.xx.5 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.01   sec  45.3 MBytes   378 Mbits/sec
[  5]   1.01-2.01   sec  55.5 MBytes   465 Mbits/sec
[  5]   2.01-3.00   sec  41.7 MBytes   352 Mbits/sec
[  5]   3.00-4.00   sec  34.3 MBytes   288 Mbits/sec
[  5]   4.00-5.00   sec  54.6 MBytes   458 Mbits/sec
[  5]   5.00-6.00   sec  57.6 MBytes   483 Mbits/sec
[  5]   6.00-7.00   sec  58.5 MBytes   492 Mbits/sec
[  5]   7.00-8.00   sec  57.6 MBytes   483 Mbits/sec
[  5]   8.00-9.01   sec  58.6 MBytes   489 Mbits/sec
[  5]   9.01-10.00  sec  61.1 MBytes   513 Mbits/sec
[  5]  10.00-11.01  sec  43.2 MBytes   362 Mbits/sec
[  5]  11.01-12.00  sec  38.4 MBytes   324 Mbits/sec
[  5]  12.00-13.00  sec  38.2 MBytes   321 Mbits/sec
[  5]  13.00-14.00  sec  55.3 MBytes   464 Mbits/sec
[  5]  14.00-15.00  sec  53.4 MBytes   448 Mbits/sec
[  5]  15.00-16.00  sec  57.0 MBytes   478 Mbits/sec
[  5]  16.00-17.00  sec  54.1 MBytes   452 Mbits/sec
[  5]  17.00-18.00  sec  55.0 MBytes   462 Mbits/sec
[  5]  18.00-19.00  sec  58.5 MBytes   490 Mbits/sec
[  5]  19.00-20.01  sec  52.7 MBytes   440 Mbits/sec
[  5]  20.01-21.00  sec  42.2 MBytes   354 Mbits/sec
[  5]  21.00-22.00  sec   546 KBytes  4.47 Mbits/sec
[  5]  22.00-23.00  sec  29.1 MBytes   245 Mbits/sec
[  5]  23.00-24.00  sec  58.8 MBytes   493 Mbits/sec
[  5]  24.00-25.00  sec  54.7 MBytes   460 Mbits/sec
[  5]  25.00-26.00  sec  54.8 MBytes   459 Mbits/sec
[  5]  26.00-27.00  sec  53.1 MBytes   446 Mbits/sec
[  5]  27.00-28.00  sec  58.4 MBytes   490 Mbits/sec
[  5]  28.00-29.00  sec  51.3 MBytes   430 Mbits/sec
[  5]  29.00-30.01  sec  35.4 MBytes   296 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.18  sec  1.44 GBytes   409 Mbits/sec  59354             sender
[  5]   0.00-30.01  sec  1.43 GBytes   411 Mbits/sec                  receiver

iperf Done.

Test 2 - Offloaded flow did not appear in ppe0:

❯ iperf3 -c 192.168.xx.5 -i1 -t30 -R
Connecting to host 192.168.xx.5, port 5201
Reverse mode, remote host 192.168.xx.5 is sending
[  5] local 192.168.xx.109 port 53377 connected to 192.168.xx.5 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  44.5 MBytes   373 Mbits/sec
[  5]   1.00-2.00   sec  57.8 MBytes   485 Mbits/sec
[  5]   2.00-3.00   sec  57.4 MBytes   481 Mbits/sec
[  5]   3.00-4.01   sec  63.4 MBytes   529 Mbits/sec
[  5]   4.01-5.00   sec  61.6 MBytes   519 Mbits/sec
[  5]   5.00-6.00   sec  61.5 MBytes   514 Mbits/sec
[  5]   6.00-7.01   sec  63.9 MBytes   535 Mbits/sec
[  5]   7.01-8.00   sec  55.3 MBytes   466 Mbits/sec
[  5]   8.00-9.00   sec  14.8 MBytes   124 Mbits/sec
[  5]   9.00-10.00  sec  60.1 MBytes   503 Mbits/sec
[  5]  10.00-11.00  sec  43.6 MBytes   366 Mbits/sec
[  5]  11.00-12.00  sec  61.8 MBytes   519 Mbits/sec
[  5]  12.00-13.00  sec  57.3 MBytes   481 Mbits/sec
[  5]  13.00-14.00  sec  58.2 MBytes   488 Mbits/sec
[  5]  14.00-15.00  sec  60.8 MBytes   508 Mbits/sec
[  5]  15.00-16.00  sec  60.3 MBytes   506 Mbits/sec
[  5]  16.00-17.00  sec  63.2 MBytes   531 Mbits/sec
[  5]  17.00-18.00  sec  18.8 MBytes   158 Mbits/sec
[  5]  18.00-19.00  sec  54.1 MBytes   454 Mbits/sec
[  5]  19.00-20.00  sec  58.3 MBytes   489 Mbits/sec
[  5]  20.00-21.00  sec  38.0 MBytes   319 Mbits/sec
[  5]  21.00-22.00  sec  61.0 MBytes   512 Mbits/sec
[  5]  22.00-23.00  sec  60.0 MBytes   503 Mbits/sec
[  5]  23.00-24.00  sec  55.0 MBytes   461 Mbits/sec
[  5]  24.00-25.01  sec  57.2 MBytes   477 Mbits/sec
[  5]  25.01-26.00  sec  58.3 MBytes   490 Mbits/sec
[  5]  26.00-27.00  sec  33.1 MBytes   278 Mbits/sec
[  5]  27.00-28.00  sec  53.8 MBytes   450 Mbits/sec
[  5]  28.00-29.00  sec  48.9 MBytes   410 Mbits/sec
[  5]  29.00-30.00  sec  49.1 MBytes   413 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  1.56 GBytes   446 Mbits/sec    3             sender
[  5]   0.00-30.00  sec  1.55 GBytes   445 Mbits/sec                  receiver

iperf Done.

Something odd happens around 21-22seconds. Look at the number of retries (!!!) when packets are traversing ppe0 vs not:
59354 vs 3

@nbd I suspect this is not the intended behavior for data traversing ppe0. Could you confirm this?

Also, this is still very true in my testing: