Mt76 wireless driver debugging

Oh dear... :rofl:

I moved this to a new thread, as this is not really E8450/RT3200 specific.

Your nice efforts are more related to the mt76 driver itself.

4 Likes

So continuing on this new thread... here's the progress I made today:

First, the sta pointers are stored in the drv->wcid[wcid->idx] array. There is a bitmap which keeps track of which idx entries are empty/used.

I found that sta cleanup has two separate stages, callbacks to sta_pre_rcu_remove and sta_remove. It looks like it calls the pre_rcu_remove first, which I believe that is where you are supposed to swap out your pointer and perform a synchronize_rcu. And no, it does not call synchronize_rcu for you, which is where I think things are falling apart.

Then it starts to clean up a few items, stop transmissions, and then it calls sta_remove, which is the absolute cutoff of where an sta can be accessed afterwards.

Since some items are cleaned up between the two, I feel it might be safer to move the sta_remove code to the sta_pre_rcu_remove function.

At the same time, there are places that access the drv->wcid[idx] array without an rcu read lock around it, which could lead to a crash, so I've added a couple of those.

I'm currently testing these changes and so far it has not had any adverse affects yet today, so I may push these into the github repo later this evening if you want to give it a try as well.

2 Likes

I'm not really following what issues these patches are addressing. Is there some particular featureset (besides WED, which I haven't tried) that is causing these to crash?

I was reading about those callbacks this AM thanks to your link to the mac80211 docs on kernel.org.

Nice find! Sounds quite promising, indeed :+1:

Yes--please do! I will be happy to pull those in and roll another build. I'm not the sharpest tack in the box around the inner workings of this driver, but I can definitely be a test dummy and try to kick the tires off any code changes you wish to try :slight_smile:

Largely working toward resolving this one:

The issue where I see WED flows stop offloading is an edge case at this point and I'm just poking around trying to get a better understand of how that works within the driver.

2 Likes

Ok, I updated the repo.

So a couple things I found.

First, most of the rcu_read_locks( ) that I added ended up doing mostly nothing because they were nested if you followed the code back up to the top. However, one or two were missing. I'm starting to wind some of that back out that is unnecessary, but also leaving comments on functions that need to hold an RCU lock.

Second, I think the part that helped your RT3200 not crash was me disabling the AMSDU_OFFLOAD, which helped avoid the issue with sta_remove. But ultimately not a proper fix.

Third, the spinlock added for the 80211_tx( ) and 80211_rx( ) functions probably needs to stay in order for those functions not to ever be called simultaneously, as according to the 80211 documentation. I don't think that has any bearing on what we are doing here, but it probably should stay.

Finally, the actual attempted fix. I moved the sta_remove into the pre_rcu_sta_remove and added a synchronize_rcu at the end of it.

FINGERS CROSSED!

1 Like

Hi VA1DER! A fellow HAM operator, nice to meet you.

So I have a number of the Belkin RT3200's, and it turns out that they work great at my house. But in an environment with dozens of people roaming lands them crashed pretty quickly. I've never dabbled with a Linux driver, but have decided to take on this issue. Given that, I've been doing a lot of reading and studying the MT76 code.

This week I also learned that a TTL serial cable is not an RS232 serial cable, so now I have one and can finally pull stack traces. I think I'm finally starting to narrow in on the problem.

2 Likes

Well, not going too smoothly at the moment. Got two of my APs updated to your latest commit and both are crashing pretty quickly. But this is what progress looks like :slight_smile:

Here's what I'm seeing, here is the crash dump from one of the two APs:

root@OpenWrt:/sys/fs/pstore# cat dmesg-ramoops-0
Oops#1 Part1
...
<1>[  113.323540] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
<1>[  113.332361] Mem abort info:
<1>[  113.335157]   ESR = 0x0000000096000045
<1>[  113.338908]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[  113.344230]   SET = 0, FnV = 0
<1>[  113.347284]   EA = 0, S1PTW = 0
<1>[  113.350420]   FSC = 0x05: level 1 translation fault
<1>[  113.355301] Data abort info:
<1>[  113.358180]   ISV = 0, ISS = 0x00000045
<1>[  113.362011]   CM = 0, WnR = 1
<1>[  113.364981] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000044e0f000
<1>[  113.371423] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
<0>[  113.380138] Internal error: Oops: 96000045 [#1] SMP
<7>[  113.385013] Modules linked in: nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 cfg80211 nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c hwmon compat cls_flower act_vlan cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact cryptodev autofs4 seqiv authencesn authenc leds_gpio gpio_button_hotplug
<7>[  113.455317] CPU: 0 PID: 1545 Comm: hostapd Tainted: G S                5.15.98 #0
<7>[  113.462796] Hardware name: Linksys E8450 (UBI) (DT)
<7>[  113.467665] pstate: a0000005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
<7>[  113.474620] pc : mt7615_mac_sta_remove+0x100/0x930 [mt7615_common]
<7>[  113.480814] lr : mt7615_mac_sta_remove+0x9c/0x930 [mt7615_common]
<7>[  113.486903] sp : ffffffc00909b760
<7>[  113.490208] x29: ffffffc00909b760 x28: ffffff8000ad5800 x27: ffffffc00909bdb0
<7>[  113.497341] x26: ffffff8001db7410 x25: ffffff8005e85cc0 x24: ffffffc000a2bf58
<7>[  113.504471] x23: ffffff8001db7650 x22: ffffff8001db7410 x21: ffffff8005e85aa0
<7>[  113.511603] x20: ffffff8005eb58b8 x19: ffffff8001db75f8 x18: 0000000000000000
<7>[  113.518732] x17: 0000000000000de0 x16: ffffffc008e8b000 x15: 00000000000006f0
<7>[  113.525863] x14: 0000000000000000 x13: 0000000000000000 x12: ffffffc0088186b0
<7>[  113.532993] x11: 000000000000020b x10: 0000000000000840 x9 : ffffffc00909b530
<7>[  113.540124] x8 : ffffff8000ad60a0 x7 : 0000000000000001 x6 : 0000000000000000
<7>[  113.547254] x5 : 00000000000001f4 x4 : 0000000000000000 x3 : 0000000000000000
<7>[  113.554385] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff8005e85f88
<7>[  113.561515] Call trace:
<7>[  113.563953]  mt7615_mac_sta_remove+0x100/0x930 [mt7615_common]
<7>[  113.569783]  __mt76_sta_remove+0x68/0xe4 [mt76]
<7>[  113.574316]  mt76_sta_pre_rcu_remove+0x44/0x10c [mt76]
<7>[  113.579452]  ieee80211_find_sta_by_link_addrs+0x430/0x540 [mac80211]
<7>[  113.585829]  sta_info_destroy_addr_bss+0x38/0x70 [mac80211]
<7>[  113.591415]  ieee80211_color_change_finish+0x1278/0x1500 [mac80211]
<7>[  113.597694]  cfg80211_check_station_change+0x1384/0x4720 [cfg80211]
<7>[  113.603970]  genl_family_rcv_msg_doit+0xb4/0x110
<7>[  113.608587]  genl_rcv_msg+0xd0/0x1c0
<7>[  113.612156]  netlink_rcv_skb+0x58/0x120
<7>[  113.615986]  genl_rcv+0x34/0x50
<7>[  113.619121]  netlink_unicast+0x1f0/0x2ec
<7>[  113.623037]  netlink_sendmsg+0x19c/0x3d0
<7>[  113.626954]  ____sys_sendmsg+0x258/0x2a0
<7>[  113.630873]  ___sys_sendmsg+0x78/0xc0
<7>[  113.634528]  __sys_sendmsg+0x54/0xb0
<7>[  113.638096]  __arm64_sys_sendmsg+0x20/0x30
<7>[  113.642183]  invoke_syscall+0x44/0x110
<7>[  113.645927]  el0_svc_common.constprop.0+0x48/0xf0
<7>[  113.650623]  do_el0_svc+0x18/0x20
<7>[  113.653930]  el0_svc+0x14/0x50
<7>[  113.656979]  el0t_64_sync_handler+0xe0/0x110
<7>[  113.661240]  el0t_64_sync+0x158/0x15c
<0>[  113.664899] Code: eb02001f 540000e0 f94276a2 f9427aa1 (f9000441)
<4>[  113.670983] ---[ end trace 6e69b3da17b58560 ]---

Going to roll back to my previous image for now. But happy to have another go at this once we identify the crash in this case!

1 Like

Thank you! Wow, you got that to crash fast. Maybe I should enable WED.
The next thing I need to learn is how to find out which line of code an offset is. Such as mt7615_mac_sta_remove+0x100

Interesting that the dereference address is 8 instead of all zeros.

I had a script to pull out file:line_number references to match each function+offset against the build_dir. But it is failing me at the moment. I'll keep banging on that script to see if I can get it back to working, but probably won't mess with it anymore tonight (it's 12:50am).

That said, any clue why mt7615_mac_sta_remove() was called and not mt7915_mac_sta_remove()?

My RT3200s have mt7622 and mt7915e radios, so does the mt7615e driver control the mt7622 device? :face_with_monocle:

mac80211              495616  5 mt7915e,mt7615e,mt7615_common,mt76_connac_lib,mt76
mt76                   65536  4 mt7915e,mt7615e,mt7615_common,mt76_connac_lib
mt76_connac_lib        45056  3 mt7915e,mt7615e,mt7615_common
mt7615_common          81920  1 mt7615e
mt7615e                24576  0

Yes that's for the mt7622. It's the same with the mt7915 driver which is used for other mt79 devices.

1 Like

I'm actually trying with the opposite approach now. I pushed your latest commit back to just one of my RT3200s, but specifically disabled WED on it this time. Going to see what happens this time around :slight_smile:

Update 1 -- Hasn't crashed yet, so that's definitely an improvement over my testing last night. I am starting to see output from your dev_info() call:

[  269.593191] mt7915e 0000:01:00.0: done sta_remove=0x00000000eea854c3
[  272.553166] mt7915e 0000:01:00.0: done sta_remove=0x00000000054c4333
[  444.351924] mt7915e 0000:01:00.0: done sta_remove=0x0000000033048b82
[  769.559443] mt7915e 0000:01:00.0: done sta_remove=0x00000000d73db3d1
[  796.979196] mt7915e 0000:01:00.0: done sta_remove=0x00000000cdc79f6f
[  797.179192] mt7622-wmac 18000000.wmac: done sta_remove=0x000000002429b81a
[  799.129176] mt7622-wmac 18000000.wmac: done sta_remove=0x000000002e7285df

Update 2 -- @Brain2000 I was getting pretty optimistic, but hit another crash:

...
<6>[  269.593191] mt7915e 0000:01:00.0: done sta_remove=0x00000000eea854c3
<6>[  272.553166] mt7915e 0000:01:00.0: done sta_remove=0x00000000054c4333
<6>[  444.351924] mt7915e 0000:01:00.0: done sta_remove=0x0000000033048b82
<6>[  769.559443] mt7915e 0000:01:00.0: done sta_remove=0x00000000d73db3d1
<6>[  796.979196] mt7915e 0000:01:00.0: done sta_remove=0x00000000cdc79f6f
<6>[  797.179192] mt7622-wmac 18000000.wmac: done sta_remove=0x000000002429b81a
<6>[  799.129176] mt7622-wmac 18000000.wmac: done sta_remove=0x000000002e7285df
<6>[ 1557.372269] mt7622-wmac 18000000.wmac: done sta_remove=0x000000001db9611f
<1>[ 1560.086582] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
<1>[ 1560.095396] Mem abort info:
<1>[ 1560.098180]   ESR = 0x0000000096000045
<1>[ 1560.101920]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[ 1560.107239]   SET = 0, FnV = 0
<1>[ 1560.110295]   EA = 0, S1PTW = 0
<1>[ 1560.113444]   FSC = 0x05: level 1 translation fault
<1>[ 1560.118323] Data abort info:
<1>[ 1560.121198]   ISV = 0, ISS = 0x00000045
<1>[ 1560.125035]   CM = 0, WnR = 1
<1>[ 1560.128002] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000044c31000
<1>[ 1560.134452] [0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
<0>[ 1560.143175] Internal error: Oops: 96000045 [#1] SMP
<7>[ 1560.148054] Modules linked in: nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 cfg80211 nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c hwmon compat cls_flower act_vlan cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact cryptodev autofs4 seqiv authencesn authenc leds_gpio gpio_button_hotplug
<7>[ 1560.218355] CPU: 0 PID: 1557 Comm: hostapd Tainted: G S                5.15.98 #0
<7>[ 1560.225831] Hardware name: Linksys E8450 (UBI) (DT)
<7>[ 1560.230699] pstate: a0000005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
<7>[ 1560.237653] pc : mt7915_mac_sta_remove+0xb8/0x160 [mt7915e]
<7>[ 1560.243235] lr : mt7915_mac_sta_remove+0x58/0x160 [mt7915e]
<7>[ 1560.248805] sp : ffffffc0090bb770
<7>[ 1560.252109] x29: ffffffc0090bb770 x28: ffffff8001cec200 x27: ffffffc0090bbdb0
<7>[ 1560.259241] x26: ffffff8000066880 x25: ffffffc008bbf1c0 x24: 0000000000000000
<7>[ 1560.266372] x23: ffffffc000a1ff58 x22: ffffff80027caaa0 x21: ffffff80027cacc0
<7>[ 1560.273503] x20: ffffff8002799e00 x19: 0000000000000008 x18: 0000000000000000
<7>[ 1560.280633] x17: 0000000000000000 x16: 0000000000000000 x15: ffffff8002790886
<7>[ 1560.287763] x14: 0000000000000000 x13: 00003d106ea68d44 x12: ffffffc0088186b0
<7>[ 1560.294894] x11: 000000000000001d x10: 0000000000000840 x9 : ffffffc0090bb530
<7>[ 1560.302024] x8 : ffffff8001cecaa0 x7 : ffffff8002796020 x6 : 000000000000000c
<7>[ 1560.309154] x5 : 000000000000002a x4 : 0000000000000000 x3 : 0000000000000000
<7>[ 1560.316285] x2 : 0000000000000000 x1 : ffffff80027caf88 x0 : ffffff80027caea0
<7>[ 1560.323417] Call trace:
<7>[ 1560.325854]  mt7915_mac_sta_remove+0xb8/0x160 [mt7915e]
<7>[ 1560.331077]  __mt76_sta_remove+0x68/0xe4 [mt76]
<7>[ 1560.335609]  mt76_sta_pre_rcu_remove+0x44/0x10c [mt76]
<7>[ 1560.340756]  ieee80211_find_sta_by_link_addrs+0x430/0x540 [mac80211]
<7>[ 1560.347131]  sta_info_destroy_addr_bss+0x38/0x70 [mac80211]
<7>[ 1560.352717]  ieee80211_color_change_finish+0x1278/0x1500 [mac80211]
<7>[ 1560.358996]  cfg80211_check_station_change+0x1384/0x4720 [cfg80211]
<7>[ 1560.365272]  genl_family_rcv_msg_doit+0xb4/0x110
<7>[ 1560.369888]  genl_rcv_msg+0xd0/0x1c0
<7>[ 1560.373458]  netlink_rcv_skb+0x58/0x120
<7>[ 1560.377287]  genl_rcv+0x34/0x50
<7>[ 1560.380422]  netlink_unicast+0x1f0/0x2ec
<7>[ 1560.384338]  netlink_sendmsg+0x19c/0x3d0
<7>[ 1560.388254]  ____sys_sendmsg+0x258/0x2a0
<7>[ 1560.392174]  ___sys_sendmsg+0x78/0xc0
<7>[ 1560.395828]  __sys_sendmsg+0x54/0xb0
<7>[ 1560.399395]  __arm64_sys_sendmsg+0x20/0x30
<7>[ 1560.403483]  invoke_syscall+0x44/0x110
<7>[ 1560.407226]  el0_svc_common.constprop.0+0x48/0xf0
<7>[ 1560.411922]  do_el0_svc+0x18/0x20
<7>[ 1560.415228]  el0_svc+0x14/0x50
<7>[ 1560.418275]  el0t_64_sync_handler+0xe0/0x110
<7>[ 1560.422537]  el0t_64_sync+0x158/0x15c
<0>[ 1560.426195] Code: 911002c0 eb02003f 540000c0 a94e8803 (f9000462)
<4>[ 1560.432281] ---[ end trace edae0d8c7a14bcef ]---

Again, this was with WED disabled this time.

1 Like

I just hit the same crash this morning. Turning back on the AMSDU offload is making it crash more, so that's helpful for this cause.

I notice that it only crashes when the driver shows "mt7622-wmac" instead of "mt7915e".

Our first crash was both mt7615_mac_sta_remove, but I see yours also died in mt7915_mac_sta_remove. So I have a feeling there is a null pointer passed by "mt7622-wmac" crashing either sta_remove functions.

I couldn't spot which line it was on last night, and I couldn't get addr2line working because the architecture is different. So many obstacles.

So my next step is to go at this the old fashioned way and add debugging output on every line in the mac_sta_remove functions. If I can't spot the pointer, at least I'll know the last line that ran before it crashes.

Agreed--same issue I was banging my head against last night while trying to decode my stack trace. I ran across this and am sure it probably has some significance to the architecture issue, but I'm not fully wrapping my head around how to apply the concept to addr2line.

Ok, one step closer, here's the crash:

<6>[ 3081.387841] mt7915e 0000:01:00.0: start sta_remove=0x00000000460fcad2
<6>[ 3081.394305] mt7915e 0000:01:00.0: mt7915_mac_sta_remove mdev=0x00000000cb2fc8d8 vif=0x0000000059d98cbf
<6>[ 3081.403702] mt7915e 0000:01:00.0: mt7915_mac_sta_remove dev=0x00000000cb2fc8d8
<6>[ 3081.410954] mt7915e 0000:01:00.0: mt7915_mac_sta_remove msta=0x00000000a7895a6c
<6>[ 3081.418909] mt7915e 0000:01:00.0: mt7915_mac_sta_remove done mcu_add_sta
<6>[ 3081.425676] mt7915e 0000:01:00.0: mt7915_mac_sta_remove done mac_wtlb_update
<6>[ 3081.432730] mt7915e 0000:01:00.0: mt7915_mac_sta_remove done mac_twt_teardown_flow
<1>[ 3081.440340] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008

And here's the code:

	for (i = 0; i < ARRAY_SIZE(msta->twt.flow); i++)
		mt7915_mac_twt_teardown_flow(dev, msta, i);
	dev_info(mdev->dev, "mt7915_mac_sta_remove done mac_twt_teardown_flow\n");

	spin_lock_bh(&dev->sta_poll_lock);
--- I think we've narrowed it down to this, I'm going to output the poll list and rc list next
	if (!list_empty(&msta->poll_list))
		list_del_init(&msta->poll_list);
	if (!list_empty(&msta->rc_list))
		list_del_init(&msta->rc_list);
---
	spin_unlock_bh(&dev->sta_poll_lock);

	dev_info(mdev->dev, "mt7915_mac_sta_remove done list_del_init\n");
1 Like

@_FailSafe I found a missing spinlock in mt7615\mac.c protecting the msta->poll_list in mt7615_mac_sta_poll( )

Observation is telling me the mt7615 is shown when you disconnect a 2.4Ghz device, where the mt7915 is shown when you disconnect a 5Ghz device.

Also, that spinlock that I thought we didn't need in mt7915\mac.c... well, we do need it. It's the same spinlock that was missing in mt7615.

I updated the repo, this new test version has both of them...

1 Like

Awesome progress--you're crushing it! I'm building from your 50f011e commit now and will let you know in a little while if it looks like you struck gold with this one.

Updates to follow...

If you pinpoint a clear bug and identify the fix, please highlight it quickly to @nbd in the mt76 repo, so that it can be implemented in the driver.

Once I've confirmed a fix I will submit a PR and tag Felix on it. He's aware I'm knee deep in this.
I just don't want to submit anything until I've verified for sure this is the right fix.

2 Likes