Mt76 wireless driver debugging

Also, found a better way to fix it in the mac80211 framework, so it will span all wireless devices, not just mt76.

That is great news! Are you hoping to get the patch for this into OpenWRT (via https://github.com/openwrt/openwrt/tree/master/package/kernel/mac80211/patches/subsys) as an interim patch as well? Sometimes it can take a while for kernel patches to make it into kernel releases, and this could be a good way to get the fix in quickly, and hopefully also into the 22.03.x release stream as well.

1 Like

Looks like Felix just pushed a series of mtk ppe / WED related patches into master. I'm going to get those running on my APs and see if the issue I'm seeing persists.

Update Hmmm.... not off to a good start:

2023-03-23 15:42 kernel [ 5320.784522] Unable to handle kernel paging request at virtual address 00ffffff80057ad9

Gotta go power cycle that AP.

I'm not 100% sure yet how the patch process works. It uses git email to send patch files to repo managers. Felix is listed on the one in the Linux upstream as well as openwrt mt76, so I think he'll be merging that commit over once it's in the upstream?

Get the latest from the Brain2000/mt76 repo. I put them closer to what the patches are that we kept. The sta_remove and sta_pre_rcu_remove need to be separated, so I want to make sure that's not causing your crash (let me know if it compiles with any errors, I didn't test to make sure it builds)

Got it! Compiled fine, so I'm building a new image now. I'll put it through the wringer as usual and see how it goes.

Your fix is pushed into openwrt/master. Congrats. :slight_smile:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=9779ee021d30508eb9e7ebf1ec0a28a4be3c4c19

8 Likes

Such awesome progress via this thread! Thanks to @Brain2000 for his trailblazing to help get some things addressed :clap:

For anyone running WED, have you noticed a significant regression with this commit as I described here?

I just got a crash dump from this issue:

<1>[   84.515285] Unable to handle kernel paging request at virtual address 00010102464c457f
<1>[   84.523230] Mem abort info:
<1>[   84.526015]   ESR = 0x0000000096000004
<1>[   84.529766]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[   84.535072]   SET = 0, FnV = 0
<1>[   84.538136]   EA = 0, S1PTW = 0
<1>[   84.541274]   FSC = 0x04: level 0 translation fault
<1>[   84.546143] Data abort info:
<1>[   84.549033]   ISV = 0, ISS = 0x00000004
<1>[   84.552863]   CM = 0, WnR = 0
<1>[   84.555822] [00010102464c457f] address between user and kernel address ranges
<0>[   84.562967] Internal error: Oops: 96000004 [#1] SMP
<7>[   84.567843] Modules linked in: nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 cfg80211 nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c hwmon compat cls_flower act_vlan cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact cryptodev autofs4 seqiv authencesn authenc leds_gpio gpio_button_hotplug
<7>[   84.637105] CPU: 1 PID: 343 Comm: napi/mtk_eth-4 Tainted: G S                5.15.102 #0
<7>[   84.645192] Hardware name: Linksys E8450 (UBI) (DT)
<7>[   84.650060] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
<7>[   84.657015] pc : __mtk_ppe_check_skb+0x10c/0x55c
<7>[   84.661634] lr : __mtk_ppe_check_skb+0x154/0x55c
<7>[   84.666244] sp : ffffffc00942bb30
<7>[   84.669549] x29: ffffffc00942bb30 x28: 0000000000000000 x27: ffffff80065ffd00
<7>[   84.676682] x26: 000000000000ffff x25: 0000000000000028 x24: ffffff80056b4780
<7>[   84.683814] x23: ffffffc0085cf7e0 x22: 00010102464c457f x21: ffffffc008c05000
<7>[   84.690944] x20: 00000000000008f4 x19: ffffff8000d00080 x18: 0000000000000001
<7>[   84.698075] x17: 0000000000003300 x16: ffffffc0091f1000 x15: ffffff80027f0886
<7>[   84.705205] x14: 00000000428684c4 x13: 01bbed2234865a94 x12: 1c3d0b440994d43c
<7>[   84.712335] x11: 2600170000000186 x10: 0000000020632063 x9 : 1c3d0b440994d43c
<7>[   84.719466] x8 : 2600170000000186 x7 : 0000000020632063 x6 : 2a0086c01a001e53
<7>[   84.726597] x5 : c086002a63206320 x4 : 0000000000068808 x3 : 000000000000ffff
<7>[   84.733727] x2 : 0000000000000001 x1 : 0000000000000003 x0 : 00000000ffffffff
<7>[   84.740859] Call trace:
<7>[   84.743296]  __mtk_ppe_check_skb+0x10c/0x55c
<7>[   84.747560]  mtk_poll_rx+0x93c/0xa5c
<7>[   84.751128]  mtk_napi_rx+0x90/0x180
<7>[   84.754610]  __napi_poll+0x54/0x1b0
<7>[   84.758096]  napi_threaded_poll+0x84/0xe4
<7>[   84.762098]  kthread+0x11c/0x130
<7>[   84.765320]  ret_from_fork+0x10/0x20
<0>[   84.768893] Code: f8767b16 529ffffa b40002d6 d503201f (f94002d8)
<4>[   84.774978] ---[ end trace 8222ee3554dbe808 ]---
<0>[   84.786025] Kernel panic - not syncing: Oops: Fatal exception in interrupt
<2>[   84.792896] SMP: stopping secondary CPUs
<0>[   84.796814] Kernel Offset: disabled
<0>[   84.800291] CPU features: 0x0,00006000,00000802
<0>[   84.804814] Memory Limit: none

This is very similar to the crash I had, but yours is in the WED poll.
Felix sent me an addition about an hour ago. It's on my latest repo commit if you want to try it out.

EDIT I might be wrong about that. Where is this mtk_poll function located?
EDIT2 found it, looking at it
EDIT3 this might be the time we try to use aarch64-openwrt-linux-gdb in the toolchain bin folder

(gdb) l *__mtk_ppe_check_skb+0x10c

Thanks for the heads up, we have one more crash that needs to get in there before I think we're in the clear...

Did it work for you? I'm having to get back up to speed on gdb, so I haven't uncovered any useful details yet. Did find the gdb binary, though!

In an effort to test this I've done my first full build of OpenWRT.

I followed the standard environment setup instructions, checked out the v23.03.3 tag and then cherry-picked the patch commit. From there I continued following the "quick" build instructions, using the build configuration from the relevant 23.03.3 target.

I've flashed the firmware and everything seems fine, but Is there any way to verify that the firmware I built actually contains the patch that I cherry-picked?

2 Likes

I just installed current snapshot
OpenWrt SNAPSHOT r22392-e7c399bee6 / LuCI Master git-23.074.82619-6ad6a24
and one of my APs froze within an hour.
It was not just wifi, I was unable to log in and had to restart.
This matches your observation.
I've disabled WED for now.

Felix pushed a proposed fix for this regression:

I'm testing with it now and while my APs are not crashing in the same way as before, I am noticing decreased throughput at the moment. Curious if others are seeing the same as well.

FWIW, I still have WED enabled, but am also testing with /proc/sys/net/core/backlog_threaded enabled (set to 1) per this:

Update 1 Yeah, something odd is going on for me. I disabled /proc/sys/net/core/backlog_threaded for now to not have that new variable introduced. When I see flows offloaded, performance seems pretty good--on par with other testing I've been doing with WED thus far. But sometimes the same device, running the same test, will not have its flows offloaded and my RT3200 CPUs reflect that heavily. In those cases, I'm seeing lower throughput than before the commits I tested yesterday (07b5508, fbcfb7f, d0a0696, and aa27771). Back-to-back tests indicate inconsistencies with which flows are offloaded or not.

Unfortunately there were a lot of other changes between the bad commit (aa27771) and the fix he pushed for that commit, which is https://github.com/openwrt/openwrt/commit/95cae498b6ab4613a0b4915393c798c750dac396.

I feel like I'm probably going to have to bisect again and see if there was another regression in between. sigh. If anyone else is NOT seeing what I'm describing, it could be something else going on in my test environment and knowing that might save me some bisecting time. :slight_smile:

1 Like

Update 2 To put some numbers behind my statements before...

I found the throughput "regression" I was noting is not necessarily new to this build. In fact, I went back and confirmed this same behavior was happening on d0a0690 (and maybe prior even to that) as well.

Testing with iperf3:

  • From a well-connected 1gb ethernet host (192.168.xx.5)
  • To the same wireless AX client (192.168.xx.109), located at the same location from test to test
  • Through the same AP

Test 1 - Offloaded flow appeared in ppe0:

❯ iperf3 -c 192.168.xx.5 -i1 -t30 -R
Connecting to host 192.168.xx.5, port 5201
Reverse mode, remote host 192.168.xx.5 is sending
[  5] local 192.168.xx.109 port 53371 connected to 192.168.xx.5 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.01   sec  45.3 MBytes   378 Mbits/sec
[  5]   1.01-2.01   sec  55.5 MBytes   465 Mbits/sec
[  5]   2.01-3.00   sec  41.7 MBytes   352 Mbits/sec
[  5]   3.00-4.00   sec  34.3 MBytes   288 Mbits/sec
[  5]   4.00-5.00   sec  54.6 MBytes   458 Mbits/sec
[  5]   5.00-6.00   sec  57.6 MBytes   483 Mbits/sec
[  5]   6.00-7.00   sec  58.5 MBytes   492 Mbits/sec
[  5]   7.00-8.00   sec  57.6 MBytes   483 Mbits/sec
[  5]   8.00-9.01   sec  58.6 MBytes   489 Mbits/sec
[  5]   9.01-10.00  sec  61.1 MBytes   513 Mbits/sec
[  5]  10.00-11.01  sec  43.2 MBytes   362 Mbits/sec
[  5]  11.01-12.00  sec  38.4 MBytes   324 Mbits/sec
[  5]  12.00-13.00  sec  38.2 MBytes   321 Mbits/sec
[  5]  13.00-14.00  sec  55.3 MBytes   464 Mbits/sec
[  5]  14.00-15.00  sec  53.4 MBytes   448 Mbits/sec
[  5]  15.00-16.00  sec  57.0 MBytes   478 Mbits/sec
[  5]  16.00-17.00  sec  54.1 MBytes   452 Mbits/sec
[  5]  17.00-18.00  sec  55.0 MBytes   462 Mbits/sec
[  5]  18.00-19.00  sec  58.5 MBytes   490 Mbits/sec
[  5]  19.00-20.01  sec  52.7 MBytes   440 Mbits/sec
[  5]  20.01-21.00  sec  42.2 MBytes   354 Mbits/sec
[  5]  21.00-22.00  sec   546 KBytes  4.47 Mbits/sec
[  5]  22.00-23.00  sec  29.1 MBytes   245 Mbits/sec
[  5]  23.00-24.00  sec  58.8 MBytes   493 Mbits/sec
[  5]  24.00-25.00  sec  54.7 MBytes   460 Mbits/sec
[  5]  25.00-26.00  sec  54.8 MBytes   459 Mbits/sec
[  5]  26.00-27.00  sec  53.1 MBytes   446 Mbits/sec
[  5]  27.00-28.00  sec  58.4 MBytes   490 Mbits/sec
[  5]  28.00-29.00  sec  51.3 MBytes   430 Mbits/sec
[  5]  29.00-30.01  sec  35.4 MBytes   296 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.18  sec  1.44 GBytes   409 Mbits/sec  59354             sender
[  5]   0.00-30.01  sec  1.43 GBytes   411 Mbits/sec                  receiver

iperf Done.

Test 2 - Offloaded flow did not appear in ppe0:

❯ iperf3 -c 192.168.xx.5 -i1 -t30 -R
Connecting to host 192.168.xx.5, port 5201
Reverse mode, remote host 192.168.xx.5 is sending
[  5] local 192.168.xx.109 port 53377 connected to 192.168.xx.5 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  44.5 MBytes   373 Mbits/sec
[  5]   1.00-2.00   sec  57.8 MBytes   485 Mbits/sec
[  5]   2.00-3.00   sec  57.4 MBytes   481 Mbits/sec
[  5]   3.00-4.01   sec  63.4 MBytes   529 Mbits/sec
[  5]   4.01-5.00   sec  61.6 MBytes   519 Mbits/sec
[  5]   5.00-6.00   sec  61.5 MBytes   514 Mbits/sec
[  5]   6.00-7.01   sec  63.9 MBytes   535 Mbits/sec
[  5]   7.01-8.00   sec  55.3 MBytes   466 Mbits/sec
[  5]   8.00-9.00   sec  14.8 MBytes   124 Mbits/sec
[  5]   9.00-10.00  sec  60.1 MBytes   503 Mbits/sec
[  5]  10.00-11.00  sec  43.6 MBytes   366 Mbits/sec
[  5]  11.00-12.00  sec  61.8 MBytes   519 Mbits/sec
[  5]  12.00-13.00  sec  57.3 MBytes   481 Mbits/sec
[  5]  13.00-14.00  sec  58.2 MBytes   488 Mbits/sec
[  5]  14.00-15.00  sec  60.8 MBytes   508 Mbits/sec
[  5]  15.00-16.00  sec  60.3 MBytes   506 Mbits/sec
[  5]  16.00-17.00  sec  63.2 MBytes   531 Mbits/sec
[  5]  17.00-18.00  sec  18.8 MBytes   158 Mbits/sec
[  5]  18.00-19.00  sec  54.1 MBytes   454 Mbits/sec
[  5]  19.00-20.00  sec  58.3 MBytes   489 Mbits/sec
[  5]  20.00-21.00  sec  38.0 MBytes   319 Mbits/sec
[  5]  21.00-22.00  sec  61.0 MBytes   512 Mbits/sec
[  5]  22.00-23.00  sec  60.0 MBytes   503 Mbits/sec
[  5]  23.00-24.00  sec  55.0 MBytes   461 Mbits/sec
[  5]  24.00-25.01  sec  57.2 MBytes   477 Mbits/sec
[  5]  25.01-26.00  sec  58.3 MBytes   490 Mbits/sec
[  5]  26.00-27.00  sec  33.1 MBytes   278 Mbits/sec
[  5]  27.00-28.00  sec  53.8 MBytes   450 Mbits/sec
[  5]  28.00-29.00  sec  48.9 MBytes   410 Mbits/sec
[  5]  29.00-30.00  sec  49.1 MBytes   413 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  1.56 GBytes   446 Mbits/sec    3             sender
[  5]   0.00-30.00  sec  1.55 GBytes   445 Mbits/sec                  receiver

iperf Done.

Something odd happens around 21-22seconds. Look at the number of retries (!!!) when packets are traversing ppe0 vs not:
59354 vs 3

@nbd I suspect this is not the intended behavior for data traversing ppe0. Could you confirm this?

Also, this is still very true in my testing:

d0a0690

Please provide full commit key. Cannot find d0a0690, as seen here: https://github.com/openwrt/openwrt/search?q=d0a0690

commit d0a06965e8ab0a92a4813fddbb8e045407897310
Author: Felix Fietkau <nbd@nbd.name>
Date:   Mon Mar 20 18:51:06 2023 +0100
2 Likes

I found it too, but probably need to spend hours on the documentation to figure out how gdb works. There isn't just an example, instead there are hundreds of pages of technical documentation.

1 Like

You can cd into the folder the gedit the file to see if the line is there:

osboxes@osboxes:/$ cd ~/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211
osboxes@osboxes:~/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211$ gedit sta_info.c

Although your backports might not be 6.1-rc8, so you'll have to find the folder.

Thank you - that was helpful. The answer is no, the patch didn't apply. I have no idea why not though. Do patches need to be listed somewhere in order to get applied to a particular target?

I haven't read OpenWRT documentation on how to apply a specific patch, so I can't answer that. But I'm well versed with git manipulation, so I just do my own patching.

If you run the patch again, you can check that file before you compile. If it's not there, then the patch maybe didn't run? If it did and you run "make world", check it again afterwards and the patch should still be there. If it downloads over it... well, it shouldn't, because each package has a git repo and SHA256 checksum, and packages do not re-download/unpack as long as they haven't changed, leaving source files as is.