Mt76 wireless driver debugging

Mine crashed. I verified all poll_list and rc_list manipulation is surrounded by the same spin lock across the board. I'm adding more debug output about the list's next/previous pointers.

I'm wondering if something that goes through an adjacent array in the mt7915 sta structure is going one too far, wiping out one of the next/prev list pointers....

Got it!!!
<6>[ 689.901686] mt7915e 0000:01:00.0: mt7915_mac_sta_remove poll_list next=0x0
000000000000000 prev=0x0000000000000000
<1>[ 689.911994] Unable to handle kernel NULL pointer dereference at virtual ad
dress 0000000000000008

I have no idea why the next/prev are 0, something has come along and wiped them out, but I can check for that condition and skip trying to clear the list. The address of 8 is because the next/prev list pointers are 64 bits and I believe it's accessing the prev pointer first... thus 8.

Masking this is probably not the ideal fix, but it might stop the crashing until we can find out what is zeroing those.

Repo updated.

5 Likes

Awesome progress, again! I’ll be glad to build and test in the morning!

One of the recent snapshot commits caused a breakage in my build machine, so I spent tonight just trying to get my RT3200 image to build successfully again. Thankfully, I’m back in business, so I will pull any of your additional commits and start testing again in a few hours from now.

Thanks for your persistence and resolve to figure this out, and hopefully get to a solid fix soon :pray:

1 Like

Woahhh... I just noticed something in the logs. When it crashes, it is trying to remove station idx 0. Umm... I don't think that's supposed to happen. This is getting meaty!!

start sta_remove=0x00000000a70616bd idx=0  <------ IDX 0 ???
mt7915_mac_sta_remove poll_list next=0x0000000000000000 prev=0x0000000000000000
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008

mt7915/init.c allocates idx 0 on purpose, with a comment:

	/* Beacon and mgmt frames should occupy wcid 0 */
	idx = mt76_wcid_alloc(dev->mt76.wcid_mask, MT7915_WTBL_STA);
	if (idx)
		return -ENOSPC;

So if sta_remove tries to remove wcid idx 0.... um.... the world might explode
I'm thinking the logic if (!wcid->idx || !wcid->sta) skip might be needed in sta_remove.
This might be a kernel bug.

I will also note, sta pointers are sometimes the same addresses (just the nature of kmalloc/kfree). The address referenced above, 0x00000000a70616bd, was used three seconds prior in my log for a valid STA pointer, so this is definately not the real STA idx 0, which is a pointer to global_wcid that never changes. I did confirm that 80211 does zero memory for the drv_priv allocation, which is where the wcid structure lives. So if the wcid idx was not filled out in the sta_add( ) function, it WILL be 0 in here. In fact, everything will be 0 here. Which is kinda what I'm seeing.

Repo has been updated.

I've been running a build with commit 0c6c60a since early this AM and I'm not sure if this is related to the interesting discoveries you just posted about, but I did hit a different crash:

Crash Details
[ 2817.012567] mt7622-wmac 18000000.wmac: done sta_remove=0x000000005802f5bb
[ 2921.111546] mt7915e 0000:01:00.0: done sta_remove=0x000000005d2bc111
[ 2921.241551] mt7915e 0000:01:00.0: done sta_remove=0x000000005d2bc111
[ 2962.491082] mt7915e 0000:01:00.0: done sta_remove=0x000000005d2bc111
[ 2979.760870] mt7622-wmac 18000000.wmac: done sta_remove=0x00000000c75dc093
[ 4765.322943] mt7915e 0000:01:00.0: done sta_remove=0x00000000047f5a9d
[ 4882.512034] mt7915e 0000:01:00.0: done sta_remove=0x000000005e7458fc
[ 4893.971947] mt7622-wmac 18000000.wmac: done sta_remove=0x00000000a6871dcc
[ 5004.941074] mt7915e 0000:01:00.0: done sta_remove=0x000000006998914d
[ 5008.451038] mt7915e 0000:01:00.0: done sta_remove=0x000000006998914d
[ 5026.630837] mt7622-wmac 18000000.wmac: done sta_remove=0x000000005802f5bb
[ 5248.058751] mt7915e 0000:01:00.0: done sta_remove=0x000000000d1ef23d
[ 5402.947382] mt7915e 0000:01:00.0: done sta_remove=0x000000005d2bc111
[ 5403.167383] mt7915e 0000:01:00.0: done sta_remove=0x000000005d2bc111
[ 5430.967151] mt7915e 0000:01:00.0: done sta_remove=0x00000000037fc14e
[ 5452.646971] mt7915e 0000:01:00.0: Message 000025ed (seq 15) timeout
[ 5452.686977] mt7915e 0000:01:00.0: done sta_remove=0x000000000d1ef23d
[ 5473.126800] mt7915e 0000:01:00.0: Message 00005aed (seq 1) timeout
[ 5493.606629] mt7915e 0000:01:00.0: Message 0000aded (seq 2) timeout
[ 5514.086422] mt7915e 0000:01:00.0: Message 00005aed (seq 3) timeout
[ 5534.566179] mt7915e 0000:01:00.0: Message 00005aed (seq 4) timeout
[ 5555.045963] mt7915e 0000:01:00.0: Message 00005aed (seq 5) timeout
[ 5575.525746] mt7915e 0000:01:00.0: Message 000032ed (seq 6) timeout
[ 5596.005540] mt7915e 0000:01:00.0: Message 00005aed (seq 7) timeout
[ 5616.485319] mt7915e 0000:01:00.0: Message 000036ed (seq 8) timeout
[ 5636.965114] mt7915e 0000:01:00.0: Message 00005aed (seq 9) timeout
[ 5657.444908] mt7915e 0000:01:00.0: Message 00005aed (seq 10) timeout
[ 5677.924698] mt7915e 0000:01:00.0: Message 00005aed (seq 11) timeout
[ 5698.404491] mt7915e 0000:01:00.0: Message 00005aed (seq 12) timeout
[ 5718.884286] mt7915e 0000:01:00.0: Message 00005aed (seq 13) timeout
[ 5739.364091] mt7915e 0000:01:00.0: Message 00005aed (seq 14) timeout
[ 5759.843895] mt7915e 0000:01:00.0: Message 00005aed (seq 15) timeout
[ 5780.323695] mt7915e 0000:01:00.0: Message 00005aed (seq 1) timeout
[ 5800.803493] mt7915e 0000:01:00.0: Message 00005aed (seq 2) timeout
[ 5821.283304] mt7915e 0000:01:00.0: Message 0000aded (seq 3) timeout
[ 5841.763114] mt7915e 0000:01:00.0: Message 00005aed (seq 4) timeout
[ 5862.242926] mt7915e 0000:01:00.0: Message 0000aded (seq 5) timeout
[ 5882.722734] mt7915e 0000:01:00.0: Message 00005aed (seq 6) timeout
[ 5903.202549] mt7915e 0000:01:00.0: Message 00005aed (seq 7) timeout
[ 5923.682364] mt7915e 0000:01:00.0: Message 00005aed (seq 8) timeout
[ 5944.162173] mt7915e 0000:01:00.0: Message 0000aded (seq 9) timeout
[ 5964.641994] mt7915e 0000:01:00.0: Message 00005aed (seq 10) timeout
[ 5985.121812] mt7915e 0000:01:00.0: Message 000025ed (seq 11) timeout
[ 5985.128102] wl1-ap1: HW problem - can not stop rx aggregation for a4:83:e7:xx:xx:xx tid 0
[ 6005.601634] mt7915e 0000:01:00.0: Message 00005aed (seq 12) timeout
[ 6026.081439] mt7915e 0000:01:00.0: Message 000025ed (seq 13) timeout
[ 6026.087727] wl1-ap1: HW problem - can not stop rx aggregation for a4:83:e7:xx:xx:xx tid 1
[ 6046.561213] mt7915e 0000:01:00.0: Message 00005aed (seq 14) timeout
[ 6067.040983] mt7915e 0000:01:00.0: Message 000025ed (seq 15) timeout
[ 6067.047269] wl1-ap1: HW problem - can not stop rx aggregation for a4:83:e7:xx:xx:xx tid 6
[ 6087.520759] mt7915e 0000:01:00.0: Message 00005aed (seq 1) timeout
[ 6108.000535] mt7915e 0000:01:00.0: Message 000025ed (seq 2) timeout
[ 6108.006779] ------------[ cut here ]------------
[ 6108.011388] WARNING: CPU: 1 PID: 1553 at ___ieee80211_stop_tx_ba_session+0x348/0x3ac [mac80211]
[ 6108.020124] Modules linked in: nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 cfg80211 nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c hwmon compat cls_flower act_vlan cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact cryptodev autofs4 seqiv authencesn authenc leds_gpio gpio_button_hotplug
[ 6108.090416] CPU: 1 PID: 1553 Comm: hostapd Tainted: G S                5.15.98 #0
[ 6108.097893] Hardware name: Linksys E8450 (UBI) (DT)
[ 6108.102761] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6108.109715] pc : ___ieee80211_stop_tx_ba_session+0x348/0x3ac [mac80211]
[ 6108.116344] lr : ___ieee80211_stop_tx_ba_session+0x210/0x3ac [mac80211]
[ 6108.122970] sp : ffffffc00917b7a0
[ 6108.126276] x29: ffffffc00917b7a0 x28: ffffff8000c42c00 x27: ffffffc00917bdb0
[ 6108.133407] x26: ffffff8000066880 x25: ffffff80027a0880 x24: ffffffc008bc5000
[ 6108.140538] x23: ffffffc0009c5344 x22: ffffff8001da00e8 x21: 0000000000000003
[ 6108.147669] x20: ffffff80061df400 x19: ffffff8001da6000 x18: ffffffc008aea320
[ 6108.154799] x17: 00000000000025a0 x16: ffffffc008ed7000 x15: 0000000000000519
[ 6108.161929] x14: 00000000000001b3 x13: ffffffc00917b2a8 x12: ffffffc008b42320
[ 6108.169060] x11: 6465353230303030 x10: ffffffc008b42320 x9 : 0000000000000000
[ 6108.176190] x8 : 0000000000000002 x7 : 0000000000000000 x6 : 0000000cd2f5ebe9
[ 6108.183319] x5 : 0000000000000000 x4 : ffffffc008ad8000 x3 : 0000000000000000
[ 6108.190449] x2 : 0000000000000001 x1 : 0000000000000002 x0 : 00000000ffffff92
[ 6108.197579] Call trace:
[ 6108.200017]  ___ieee80211_stop_tx_ba_session+0x348/0x3ac [mac80211]
[ 6108.206297]  ieee80211_sta_tear_down_BA_sessions+0x74/0x130 [mac80211]
[ 6108.212836]  ieee80211_find_sta_by_link_addrs+0x370/0x540 [mac80211]
[ 6108.219202]  sta_info_destroy_addr_bss+0x38/0x70 [mac80211]
[ 6108.224785]  ieee80211_color_change_finish+0x1278/0x1500 [mac80211]
[ 6108.231063]  cfg80211_check_station_change+0x1384/0x4720 [cfg80211]
[ 6108.237339]  genl_family_rcv_msg_doit+0xb4/0x110
[ 6108.241951]  genl_rcv_msg+0xd0/0x1c0
[ 6108.245518]  netlink_rcv_skb+0x58/0x120
[ 6108.249350]  genl_rcv+0x34/0x50
[ 6108.252484]  netlink_unicast+0x1f0/0x2ec
[ 6108.256401]  netlink_sendmsg+0x19c/0x3d0
[ 6108.260317]  ____sys_sendmsg+0x258/0x2a0
[ 6108.264236]  ___sys_sendmsg+0x78/0xc0
[ 6108.267890]  __sys_sendmsg+0x54/0xb0
[ 6108.271457]  __arm64_sys_sendmsg+0x20/0x30
[ 6108.275546]  invoke_syscall+0x44/0x110
[ 6108.279289]  el0_svc_common.constprop.0+0x48/0xf0
[ 6108.283985]  do_el0_svc+0x18/0x20
[ 6108.287292]  el0_svc+0x14/0x50
[ 6108.290341]  el0t_64_sync_handler+0xe0/0x110
[ 6108.294604]  el0t_64_sync+0x158/0x15c
[ 6108.298259] ---[ end trace 47e6052115c7256a ]---
[ 6128.490312] mt7915e 0000:01:00.0: Message 00005aed (seq 3) timeout
[ 6148.960098] mt7915e 0000:01:00.0: Message 000025ed (seq 4) timeout
[ 6149.010090] mt7915e 0000:01:00.0: done sta_remove=0x00000000041df84e
[ 6169.439878] mt7915e 0000:01:00.0: Message 00005aed (seq 5) timeout
[ 6189.919664] mt7915e 0000:01:00.0: Message 000025ed (seq 6) timeout
[ 6189.925862] wl1-ap1: failed to remove key (0, a4:83:e7:xx:xx:xx) from hardware (-110)
...
[10879.792431] mt7915e 0000:01:00.0: Message 00005aed (seq 10) timeout
[10900.272254] mt7915e 0000:01:00.0: Message 00005aed (seq 11) timeout
[10920.752082] mt7915e 0000:01:00.0: Message 00005aed (seq 12) timeout
[10920.862363] wl1-ap0: failed to remove key (0, 90:81:58:xx:xx:87) from hardware (-12)
[10920.942100] wl1-ap1: failed to remove key (0, 20:69:80:xx:xx:b4) from hardware (-12)
[10921.082125] mt7915e 0000:01:00.0: done sta_remove=0x000000006998914d
[10921.172139] wl1-ap0: HW problem - can not stop rx aggregation for 90:81:58:xx:xx:87 tid 0
[10921.232089] wl1-ap0: HW problem - can not stop rx aggregation for 90:81:58:xx:xx:87 tid 1
[10921.272080] wl1-ap0: HW problem - can not stop rx aggregation for 90:81:58:xx:xx:87 tid 6
[10921.312113] ------------[ cut here ]------------
[10921.316742] WARNING: CPU: 0 PID: 1553 at ___ieee80211_stop_tx_ba_session+0x348/0x3ac [mac80211]
[10921.325479] Modules linked in: nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 cfg80211 nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c hwmon compat cls_flower act_vlan cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact cryptodev autofs4 seqiv authencesn authenc leds_gpio gpio_button_hotplug
[10921.395771] CPU: 0 PID: 1553 Comm: hostapd Tainted: G S      W         5.15.98 #0
[10921.403249] Hardware name: Linksys E8450 (UBI) (DT)
[10921.408119] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[10921.415074] pc : ___ieee80211_stop_tx_ba_session+0x348/0x3ac [mac80211]
[10921.421717] lr : ___ieee80211_stop_tx_ba_session+0x210/0x3ac [mac80211]
[10921.428342] sp : ffffffc00917b7a0
[10921.431648] x29: ffffffc00917b7a0 x28: ffffff8000c42c00 x27: ffffffc00917bdb0
[10921.438779] x26: ffffff8000066880 x25: ffffff80027a0880 x24: ffffffc008bc5000
[10921.445911] x23: ffffffc0009c5344 x22: ffffff80035600e8 x21: 0000000000000003
[10921.453041] x20: ffffff8006200300 x19: ffffff8003566000 x18: ffffff80061d6358
[10921.460171] x17: 0000000000000000 x16: 0000000000000000 x15: ffffff80027a0886
[10921.467301] x14: 0000000000000000 x13: 0000000000000030 x12: 0101010101010101
[10921.474431] x11: 0000000000000200 x10: 0000000000000840 x9 : 0000000000000000
[10921.481561] x8 : ffffff8009e6be4a x7 : 0000000000000000 x6 : 0000000000000000
[10921.488692] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffff8000c42c00
[10921.495821] x2 : 0000000000000000 x1 : ffffff8000c42c00 x0 : 00000000fffffff4
[10921.502953] Call trace:
[10921.505391]  ___ieee80211_stop_tx_ba_session+0x348/0x3ac [mac80211]
[10921.511673]  ieee80211_sta_tear_down_BA_sessions+0x74/0x130 [mac80211]
[10921.518213]  ieee80211_find_sta_by_link_addrs+0x370/0x540 [mac80211]
[10921.524578]  sta_info_destroy_addr_bss+0x38/0x70 [mac80211]
[10921.530162]  ieee80211_color_change_finish+0x1278/0x1500 [mac80211]
[10921.536442]  cfg80211_check_station_change+0x1384/0x4720 [cfg80211]
[10921.542717]  genl_family_rcv_msg_doit+0xb4/0x110
[10921.547330]  genl_rcv_msg+0xd0/0x1c0
[10921.550896]  netlink_rcv_skb+0x58/0x120
[10921.554727]  genl_rcv+0x34/0x50
[10921.557862]  netlink_unicast+0x1f0/0x2ec
[10921.561778]  netlink_sendmsg+0x19c/0x3d0
[10921.565695]  ____sys_sendmsg+0x258/0x2a0
[10921.569615]  ___sys_sendmsg+0x78/0xc0
[10921.573270]  __sys_sendmsg+0x54/0xb0
[10921.576837]  __arm64_sys_sendmsg+0x20/0x30
[10921.580925]  invoke_syscall+0x44/0x110
[10921.584669]  el0_svc_common.constprop.0+0x48/0xf0
[10921.589364]  do_el0_svc+0x18/0x20
[10921.592671]  el0_svc+0x14/0x50
[10921.595719]  el0t_64_sync_handler+0xe0/0x110
[10921.599982]  el0t_64_sync+0x158/0x15c
[10921.603638] ---[ end trace 47e6052115c7256b ]---
[10921.642152] ------------[ cut here ]------------
[10921.646780] WARNING: CPU: 0 PID: 1553 at ___ieee80211_stop_tx_ba_session+0x348/0x3ac [mac80211]
[10921.655519] Modules linked in: nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 cfg80211 nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c hwmon compat cls_flower act_vlan cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact cryptodev autofs4 seqiv authencesn authenc leds_gpio gpio_button_hotplug
[10921.725812] CPU: 0 PID: 1553 Comm: hostapd Tainted: G S      W         5.15.98 #0
[10921.733289] Hardware name: Linksys E8450 (UBI) (DT)
[10921.738158] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[10921.745112] pc : ___ieee80211_stop_tx_ba_session+0x348/0x3ac [mac80211]
[10921.751752] lr : ___ieee80211_stop_tx_ba_session+0x210/0x3ac [mac80211]
[10921.758377] sp : ffffffc00917b7a0
[10921.761682] x29: ffffffc00917b7a0 x28: ffffff8000c42c00 x27: ffffffc00917bdb0
[10921.768814] x26: ffffff8000066880 x25: ffffff80027a0880 x24: ffffffc008bc5000
[10921.775944] x23: ffffffc0009c5344 x22: ffffff8003560508 x21: 0000000000000003
[10921.783076] x20: ffffff8002612100 x19: ffffff8003566000 x18: 0000000000000000
[10921.790206] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffaa0f8b6991
[10921.797336] x14: 0cbf18044c90001a x13: 500018dd00001c00 x12: 010218100009dd00
[10921.804466] x11: 00000000000000bc x10: 0000000000000840 x9 : 0000000000000000
[10921.811597] x8 : ffffff8009e69e4a x7 : 0000000000000000 x6 : 0000000000000000
[10921.818729] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffff8000c42c00
[10921.825858] x2 : 0000000000000000 x1 : ffffff8000c42c00 x0 : 00000000fffffff4
[10921.832989] Call trace:
[10921.835427]  ___ieee80211_stop_tx_ba_session+0x348/0x3ac [mac80211]
[10921.841716]  ieee80211_sta_tear_down_BA_sessions+0x74/0x130 [mac80211]
[10921.848256]  ieee80211_find_sta_by_link_addrs+0x370/0x540 [mac80211]
[10921.854620]  sta_info_destroy_addr_bss+0x38/0x70 [mac80211]
[10921.860205]  ieee80211_color_change_finish+0x1278/0x1500 [mac80211]
[10921.866483]  cfg80211_check_station_change+0x1384/0x4720 [cfg80211]
[10921.872758]  genl_family_rcv_msg_doit+0xb4/0x110
[10921.877369]  genl_rcv_msg+0xd0/0x1c0
[10921.880936]  netlink_rcv_skb+0x58/0x120
[10921.884768]  genl_rcv+0x34/0x50
[10921.887903]  netlink_unicast+0x1f0/0x2ec
[10921.891819]  netlink_sendmsg+0x19c/0x3d0
[10921.895736]  ____sys_sendmsg+0x258/0x2a0
[10921.899655]  ___sys_sendmsg+0x78/0xc0
[10921.903310]  __sys_sendmsg+0x54/0xb0
[10921.906877]  __arm64_sys_sendmsg+0x20/0x30
[10921.910965]  invoke_syscall+0x44/0x110
[10921.914708]  el0_svc_common.constprop.0+0x48/0xf0
[10921.919404]  do_el0_svc+0x18/0x20
[10921.922710]  el0_svc+0x14/0x50
[10921.925758]  el0t_64_sync_handler+0xe0/0x110
[10921.930021]  el0t_64_sync+0x158/0x15c
[10921.933677] ---[ end trace 47e6052115c7256c ]---
[10921.972263] mt7915e 0000:01:00.0: done sta_remove=0x000000005d2bc111
[10922.052116] mt7915e 0000:01:00.0: done sta_remove=0x0000000084c10d2c
[10922.132070] wl1-ap1: HW problem - can not stop rx aggregation for 20:69:80:xx:xx:b4 tid 0
[10922.172113] wl1-ap1: HW problem - can not stop rx aggregation for 20:69:80:xx:xx:b4 tid 1
[10922.222093] wl1-ap1: HW problem - can not stop rx aggregation for 20:69:80:xx:xx:b4 tid 6

I'm going to go ahead and pull your latest commit and start running again with it for a while. Thanks!

Oh yeah, that previous one will crash because idx 0 was getting demolished.
However, that's not to say there aren't other crashes lurking beyond this one...
I saw a very similar stacktrace on another site for a different device running openWRT.

1 Like

Understood! I'm running with your latest commit now and will report back here as anything interesting happens :slight_smile:

1 Like

My RT3200 has gotten past THREE would-be crashes so far !!
According to the logs, here is what I believe is happening:
sta_remove( ) is called with an already removed 80211_sta pointer when a wifi device rapidly connects and disconnects. What is interesting is this sta_remove( ) is called about 5.5 minutes after the rapid connect/disconnect event. Because the drv_priv portion of the 80211_sta structure is zeroed out, things go terribly wrong, such as a kernel oops due to the linked lists having an invalid dereference to 0, if that portion is tested for and skipped, then idx 0 ends up being released, which is needed for the beacon/frame management. So I believe the correct fix here for the time being is to check for !wcid->idx or !wcid->sta and exit the sta_remove( ) function immediately.

Mar 16 14:01:25 [ 6907.783912] mt7915e start sta_add=0x00000000f53bf1c3 idx=2
Mar 16 14:01:25 [ 6907.795647] mt7915e start sta_remove=0x00000000f53bf1c3 idx=2
Mar 16 14:01:26 [ 6907.902339] mt7915e start sta_add=0x00000000f53bf1c3 idx=2
Mar 16 14:01:26 [ 6907.913964] mt7915e start sta_remove=0x00000000f53bf1c3 idx=2
Mar 16 14:07:10 [ 7251.911771] mt7915e start sta_remove=0x00000000f53bf1c3 idx=0 <--- oops
2 Likes

Right on! This is looking a LOT better now! No crashes for me so far and I've run through my normal test routine a few times:

[   33.176865] br-lan: port 10(wl1-ap1) entered forwarding state
[   33.182640] br-lan: topology change detected, propagating
[ 1231.228062] mt7915e 0000:01:00.0: done sta_remove=0x000000002cafe5b9
[ 4484.935937] mt7915e 0000:01:00.0: done sta_remove=0x00000000b312a7c3
[ 4490.265879] mt7915e 0000:01:00.0: done sta_remove=0x00000000e65e7c99
[ 4766.752732] mt7915e 0000:01:00.0: done sta_remove=0x000000002cafe5b9
[ 4780.112595] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 4787.662534] mt7915e 0000:01:00.0: done sta_remove=0x00000000b312a7c3
[ 4852.741933] mt7915e 0000:01:00.0: done sta_remove=0x000000006b2ceb2f
[ 4853.561950] mt7915e 0000:01:00.0: done sta_remove=0x00000000b312a7c3
[ 4861.601874] mt7915e 0000:01:00.0: done sta_remove=0x00000000e65e7c99
[ 4863.401850] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 4865.131817] mt7915e 0000:01:00.0: done sta_remove=0x00000000e65e7c99
[ 4866.571802] mt7915e 0000:01:00.0: done sta_remove=0x00000000569f78b8
[ 4870.211762] mt7915e 0000:01:00.0: done sta_remove=0x00000000569f78b8
[ 4871.841767] mt7915e 0000:01:00.0: done sta_remove=0x0000000093763e25
[ 4876.131712] mt7915e 0000:01:00.0: done sta_remove=0x00000000e65e7c99
[ 4876.511706] mt7915e 0000:01:00.0: done sta_remove=0x0000000093763e25
[ 4878.601708] mt7915e 0000:01:00.0: done sta_remove=0x00000000e65e7c99
[ 4878.891731] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 4880.001692] mt7915e 0000:01:00.0: done sta_remove=0x00000000569f78b8
[ 4882.321676] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 4979.650829] mt7915e 0000:01:00.0: done sta_remove=0x00000000e65e7c99
[ 4980.370818] mt7915e 0000:01:00.0: done sta_remove=0x00000000569f78b8
[ 4982.520799] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 4987.290788] mt7915e 0000:01:00.0: done sta_remove=0x00000000569f78b8
[ 4989.910744] mt7915e 0000:01:00.0: done sta_remove=0x00000000e65e7c99
[ 4991.190722] mt7915e 0000:01:00.0: done sta_remove=0x00000000569f78b8
[ 4993.870721] mt7915e 0000:01:00.0: done sta_remove=0x00000000e65e7c99
[ 4997.800667] mt7915e 0000:01:00.0: done sta_remove=0x00000000e65e7c99
[ 4998.810650] mt7915e 0000:01:00.0: done sta_remove=0x0000000093763e25
[ 4999.960636] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 5002.640648] mt7915e 0000:01:00.0: done sta_remove=0x0000000093763e25
[ 5003.440617] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 5014.540511] mt7915e 0000:01:00.0: done sta_remove=0x0000000093763e25
[ 5015.470511] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 5060.630100] mt7915e 0000:01:00.0: done sta_remove=0x00000000569f78b8
[ 5297.168021] mt7622-wmac 18000000.wmac: done sta_remove=0x000000009ef6bff3
[ 5825.913234] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 5898.272596] mt7622-wmac 18000000.wmac: done sta_remove=0x00000000c09feabd
[ 6011.941624] mt7915e 0000:01:00.0: done sta_remove=0x0000000093763e25
[ 6031.461455] mt7915e 0000:01:00.0: done sta_remove=0x000000002cafe5b9
[ 7523.985879] mt7915e 0000:01:00.0: done sta_remove=0x0000000093763e25
[ 7524.935867] mt7915e 0000:01:00.0: done sta_remove=0x000000005c5f4425
[ 7539.895731] mt7915e 0000:01:00.0: done sta_remove=0x0000000032034ead
[ 7548.387793] mt7915e 0000:01:00.0: idx==0, skipping standard sta_remove procedure
[ 7563.065496] mt7622-wmac 18000000.wmac: done sta_remove=0x000000000bcf00e5

Update 1 Given the promising progress with this latest build, I'm going to go ahead and throw it on a second RT3200. :+1:

Update 2 I've got this running now on all three of my RT3200s. I've enabled WED again on a couple of them as well to see what the behavior looks like now. fingers crossed

1 Like

I have this weird erros where mesh suddenly stops working and devices stop responding to multicast pings:

I will give your branch also a try. Looks like you added some locks and so on. Maybe it will help. ^^

1 Like

I'm curious, let me know!

It did not help. :confused:

If you have WED enabled, try disabling that. It's an extra layer and something that we are slated to look into further.

It's not enabled somehow by default? I already looked at the modules.d folder if wed is in the module parameter, but it is not.

Any idea what it could be or where I could look at? It seems that some state machine is freezing or whatever. It does not crash, just stops working. Sometimes I see those wpa_supplicant notifications

wpa_supplicant[1652]: wlan5-mesh: mesh plink with b8:ec:a3:e1:22:05 closed with reason 55
wpa_supplicant[1652]: wlan5-mesh: MESH-PEER-DISCONNECTED b8:ec:a3:e1:22:05
1 Like

Am I correct in thinking that these fixes may help with instability on other devices? I have a misbehaving Archer C6 v3.2 which exhibits lockups when heavily loaded (see Archer C6 on 22.03.2 crashes every ~3 days)

I believe you would be correct. Any mac80211 driver that doesn't check to see if the drv_priv variable has been wiped in the sta_remove function could have issues.

static int ath9k_htc_sta_remove(struct ieee80211_hw *hw,
				struct ieee80211_vif *vif,
				struct ieee80211_sta *sta)
{
	struct ath9k_htc_priv *priv = hw->priv;
	struct ath9k_htc_sta *ista = (struct ath9k_htc_sta *) sta->drv_priv;
	int ret;

	if (!ista->index) return 0;   <--- I added this line as an example, I don't know if this would be the correct check or not

	cancel_work_sync(&ista->rc_update_work);

	mutex_lock(&priv->mutex);
	ath9k_htc_ps_wakeup(priv);
	htc_sta_drain(priv->htc, ista->index);
	ret = ath9k_htc_remove_station(priv, vif, sta);
	ath9k_htc_ps_restore(priv);
	mutex_unlock(&priv->mutex);

	return ret;
}

Can you paste your mesh config portion in the /etc/config/wireless file?

Ooh WED on top of this. I have a feeling it's still going to stop offloading at some point, as I think that's a different issue. The 4096 really has me baffled, I need to read more about how WED is supposed to work.

config wifi-iface 'radio0_if0'
	option device 'radio0'
	option ifname 'wlan5-mesh'
	option mode 'mesh'
	option mesh_id 'Mesh-Freifunk-Berlin'
	option mesh_fwding '0'
	option mcast_rate '12000'
	option network 'mesh_11s_5ghz'

Yeah, it finally stopped offloading on one of my APs, but... NO CRASHES! I was hopeful that the crashes were leading to WED flows stopping, but apparently something else is causing it, as you expected.

I am thrilled about the clean kernel logs with this latest commit--kudos for the progress on it :+1: