Mt76 wireless driver debugging

So I have some possible good news. I have been making modifications to the mt76 driver and compiling the openwrt kernel. I've had the RT3200 running for around 10 days now without a crash.

I found a number of potential race conditions, so filled in some missing rcu_read_locks/synchronize_rcu() as well as making sure some of the ieee80211 kernel functions marked as "do not call these concurrently" can no longer be called concurrently.

I believe the worse that can happen here is slowing things down ever so slightly, but I've been able to push over a gigabit (500 each direction simultaneously) through this thing sustained for an entire day.

(It might be worth mentioning that, since the beginning, I have had both the hardware and software acceleration disabled as well as WED disabled)

I'm going to give this another week, and if there still is not a crash, offer a pull request to the mt76 repository.

Here are the code changes in case you're interested in trying it out:

19 Likes

I'm running with your fork & latest commit on my three RT3200s now. I've been seeing the crash daily--sometimes multiple times per day. I do have WED enabled, so I'll be sure to report back on any crashes (or hopefully the absence thereof!) in a few days.

Many thanks for your work on this!

3 Likes

I had that branch crash yesterday after almost 10 days! So while there were some missing locks (especially around the ieee80211_tx/rx( ) functions) that may cause unrelated issues, the core crash has not been found yet.

HOWEVER, I have two other RT3200's running with slight code modifications, and one of them is NOT crashing. In fact, it hit 23 simultaneously active roaming stations yesterday! Double the max I've ever had on one of these, as they usually crash between 12-15 actively roaming stations.

Here's the line that seems to make the difference. It disables the AMSDU offload. Clearly it's not ideal because it disables an optimization, but it is a significant clue if it continues to remain solid.

mt76\mt7915\mmio.c::mt7915_mmio_probe (around line 1014):

		.drv_flags = MT_DRV_TXWI_NO_FREE | MT_DRV_HW_MGMT_TXQ |
			     MT_DRV_AMSDU_OFFLOAD,

change to:

		.drv_flags = MT_DRV_TXWI_NO_FREE | MT_DRV_HW_MGMT_TXQ,

I'll add this to the github branch later today.

5 Likes

Fantastic headway! FWIW, I woke up to find that none of my RT3200s crashed overnight (!!!). First time in a while.

I still don't know why, but for some reason I continue to notice that with WED enabled, offloading seems to stop after some time. Wireless connectivity continues to work, but I can definitely see the lack of offloaded flows resulting in significantly higher CPU usage than when the flow offloading is functional.

Anyway, I will keep an eye on your repo for any changes you push and will roll another build and test. Many thanks!

I would like to try this repo. What is the best way to pull this into my existing build system?

1 Like

If you're pretty comfortable with the build system, you should be able to apply this patch. You can adjust the PKG_SOURCE_VERSION and PKG_MIRROR_HASH as needed based on the commit from @Brain2000 you wish to test.

1 Like

That's good news!

I was looking at the WED code, and it looks like it only changes the wed token counter it if the token value remains between token_start and token_size. However, the idr_alloc( ) function creates a token value between 0 and token_size. So if token_start is greater than 0, then it's possible that "wake_up(&dev->tx_wait);" will not be called when the token count hits 0. I'm not sure what this wake_up call does, but I have a feeling that if it isn't called, WED will stop working.

This could also be an issue with the firmware bin file that lives directly in the MT7915 controller.
Unfortunately I am not completely familiar with all this code yet, so I can't say for sure.

	token = idr_alloc(&dev->token, *ptxwi, 0, dev->token_size, GFP_ATOMIC);
	if (token >= 0)
		dev->token_count++;

#ifdef CONFIG_NET_MEDIATEK_SOC_WED
	if (mtk_wed_device_active(&dev->mmio.wed) &&
	    token >= dev->mmio.wed.wlan.token_start)
		dev->wed_token_count++;
#endif

then later on

	txwi = idr_remove(&dev->token, token);
	if (txwi) {
		dev->token_count--;

#ifdef CONFIG_NET_MEDIATEK_SOC_WED
		if (mtk_wed_device_active(&dev->mmio.wed) &&
		    token >= dev->mmio.wed.wlan.token_start &&
		    --dev->wed_token_count == 0)
			wake_up(&dev->tx_wait);
#endif

From some testing I performed, dev->mmio.wed.wlan.token_start is 0x1000 at this point in the code. Under my normal usage (and even stress testing), even when I was seeing flows offloaded, token never came close to >= 0x1000. I never saw the wake_up(&dev->tx_wait) invoked in my testing.

Example (where my values are output as hex):

Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.675316] MT76-1: token=0 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.681200] MT76-1: token=1 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.687483] MT76-1: token=2 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.693361] MT76-1: token=3 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.699221] MT76-1: token=4 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.705093] MT76-1: token=5 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.710947] MT76-1: token=6 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.716812] MT76-1: token=7 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.722679] MT76-1: token=8 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.728527] MT76-1: token=9 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.734379] MT76-1: token=f token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.740213] MT76-1: token=10 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.746057] MT76-1: token=11 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.752007] MT76-1: token=12 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.757932] MT76-1: token=13 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.763861] MT76-1: token=14 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.769788] MT76-1: token=15 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.775725] MT76-1: token=16 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.781651] MT76-1: token=17 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.787587] MT76-1: token=18 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.793529] MT76-1: token=19 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.799454] MT76-1: token=1a token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.805440] MT76-1: token=0 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.811374] MT76-1: token=1b token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.817250] MT76-1: token=1c token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.823211] MT76-1: token=1d token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.829145] MT76-1: token=1e token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.835130] MT76-1: token=1 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.841097] MT76-1: token=2 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.847015] MT76-1: token=3 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.852912] MT76-1: token=4 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.858750] MT76-1: token=5 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.864658] MT76-1: token=6 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.870529] MT76-1: token=7 token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.876412] MT76-1: token=a token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.882296] MT76-1: token=b token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.888145] MT76-1: token=c token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.893996] MT76-1: token=d token_start=1000 wed_token_count=0
Fri Mar 10 16:14:18 2023 kern.info kernel: [  727.899853] MT76-1: token=0 token_start=1000 wed_token_count=0

But, I'm pretty rusty with C these days and I've never been a driver developer. So take my feedback lightly here. :slight_smile:

Awesome, you just took all the guesswork out of it!
I'm going to find out why token_start = 1000.

It looks like MT7915_TOKEN_SIZE = 8192 and MT7915_HW_TOKEN_SIZE = 4096, so wed->wlan.token_start = 4096 (0x1000):

You probably were already ahead of me on that, but in case it helps others as they're following along, there it is. :slight_smile:

1 Like

I'm presently mulling over this as I'm puzzling through the stoppage of WED flow offloading I encounter often:
https://patchwork.kernel.org/project/linux-wireless/patch/329444b7decc3853d17488117d5df22cf4c39bad.1673963962.git.lorenzo@kernel.org/

This sits within mac.c: (ref: https://github.com/Brain2000/mt76/blob/master/mt7915/mac.c#L1620)

/* system error recovery */
void mt7915_mac_reset_work(struct work_struct *work)
{
	...

	/* chip full reset */
	if (dev->recovery.restart) {
		/* disable WA/WM WDT */
		mt76_clear(dev, MT_WFDMA0_MCU_HOST_INT_ENA,
			   MT_MCU_CMD_WDT_MASK);

		if (READ_ONCE(dev->recovery.state) & MT_MCU_CMD_WA_WDT)
			dev->recovery.wa_reset_count++;
		else
			dev->recovery.wm_reset_count++;

		mt7915_mac_full_reset(dev);

		/* enable mcu irq */
		mt7915_irq_enable(dev, MT_INT_MCU_CMD);
		mt7915_irq_disable(dev, 0);

		/* enable WA/WM WDT */
		mt76_set(dev, MT_WFDMA0_MCU_HOST_INT_ENA, MT_MCU_CMD_WDT_MASK);

		dev->recovery.state = MT_MCU_CMD_NORMAL_STATE;
		dev->recovery.restart = false;
		return;
	}

	/* chip partial reset */
	if (!(READ_ONCE(dev->recovery.state) & MT_MCU_CMD_STOP_DMA))
		return;

	if (mtk_wed_device_active(&dev->mt76.mmio.wed)) {
		mtk_wed_device_stop(&dev->mt76.mmio.wed);
		if (!is_mt7986(&dev->mt76))
			mt76_wr(dev, MT_INT_WED_MASK_CSR, 0);
	}
...

I'm wondering if there is some other error that is occurring that lands me in this recovery from system error where the call to mtk_wed_device_stop(&dev->mt76.mmio.wed); is placed, yet the recovery is successful. The only location in the mt76 code where I see a call to start WED is here: https://github.com/Brain2000/mt76/blob/master/mt7915/dma.c#L382

I'm trying to see if mt7915_dma_enable() is invoked again after recovery from system error.

Hmm, good question. I still need to see what the idle function does, but what you found could be the root cause. You can always find out by adding a drv_info(dev->mt76.dev, "Hit mt7915_dma_enable\n"); then seeing if after that message appears in the logs, if WED offloading stops.

Keeping in mind I'm not a driver dev, I was literally in the process of compiling a release where I have a couple printk() calls sprinkled around the start/stop WED functions.

e.g. printk(KERN_INFO "MT76-1: mtk_wed_device_active=%s", mtk_wed_device_active(&dev->mt76.mmio.wed) ? "true" : "false");

However, is dev_info() functionally equivalent to printk()? Or is printk() deprecated these days?

This is also my first adventure with driver programming! Though I'm fairly well versed with multi-threading and race conditions, and low level code. I just learned what TTL serial is, and that RS232 is not compatible. Who knew?! So I picked one up and was finally able to pull a stack trace from my crash, I'm hot on the trail !

I just recently read all about printk, TP_printk, tp_info, dev_info.... so many different ways to output information. The article I read seemed to indicate that everyone is leaning away from printk and to use tp_xxxx, but one of the best alternatives to printk for device drivers was the dev_xxxx functions. So I just started using those today and they show up in the logread, so I'm happy with it.

1 Like

@Brain2000 Two of my three RT3200s have experienced a crash after several days, but a different crash than I was originally having.

Crash dump
[407711.630282] ------------[ cut here ]------------
[407711.634997] WARNING: CPU: 0 PID: 0 at idr_alloc_u32+0x9c/0xc0
[407711.640833] Modules linked in: nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 cfg80211 nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c hwmon compat cls_flower act_vlan cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact cryptodev autofs4 seqiv authencesn authenc leds_gpio gpio_button_hotplug
[407711.711223] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G S 5.15.98 #0
[407711.718699] Hardware name: Linksys E8450 (UBI) (DT)
[407711.723654] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[407711.730695] pc : idr_alloc_u32+0x9c/0xc0
[407711.734698] lr : idr_alloc+0x2c/0x50
[407711.738354] sp : ffffffc008003150
[407711.741745] x29: ffffffc008003150 x28: ffffffc008b7d000 x27: 0000000000000002
[407711.748964] x26: ffffff8001ceeaa0 x25: ffffffc000a2ab10 x24: ffffff8002424a40
[407711.756182] x23: ffffffc000a72cb4 x22: ffffff8006c1b500 x21: ffffffc0080031bc
[407711.763399] x20: 0000000000000000 x19: ffffff8001ceedf0 x18: 0000000000000000
[407711.770617] x17: ffffffc017419000 x16: 0000000000000620 x15: ffffff8001ceefd3
[407711.777835] x14: 0000000000000001 x13: ffffff8001ceefd6 x12: ffffff8006c1b530
[407711.785052] x11: ffffff8002422020 x10: ffffffc0080031f2 x9 : 0000000000000001
[407711.792268] x8 : 0000000000000228 x7 : 0000000000000220 x6 : ffffff8006829468
[407711.799486] x5 : 000000007fffffff x4 : 0000000000000a20 x3 : 000000000000007e
[407711.806703] x2 : 0000000000000a20 x1 : 0000000000000000 x0 : 0000000000000003
[407711.813921] Call trace:
[407711.816444] idr_alloc_u32+0x9c/0xc0
[407711.820100] idr_alloc+0x2c/0x50
[407711.823409] mt76_tx_status_skb_add+0xac/0x190 [mt76]
[407711.828555] mt7615_tx_prepare_skb+0x114/0x350 [mt7615e]
[407711.833950] mt76_dma_wed_setup+0x810/0x970 [mt76]
[407711.838825] mt76_tx_check_agg_ssn+0xa8/0x110 [mt76]
[407711.843874] mt76_tx+0xe8/0x1ec [mt76]
[407711.847707] mt7615_mac_sta_remove+0x820/0x930 [mt7615_common]
[407711.853626] __ieee80211_schedule_txq+0xbe0/0x1980 [mac80211]
[407711.859480] __ieee80211_xmit_fast+0x798/0x934 [mac80211]
[407711.864978] __ieee80211_subif_start_xmit+0x1cc/0x31c [mac80211]
[407711.871083] ieee80211_subif_start_xmit+0x3c/0x3a0 [mac80211]
[407711.876928] dev_hard_start_xmit+0xd0/0x150
[407711.881194] __dev_queue_xmit+0x960/0xc24
[407711.885283] dev_queue_xmit+0x10/0x1c
[407711.889025] br_dev_queue_push_xmit+0x70/0x154
[407711.893552] br_forward_finish+0x98/0xb0
[407711.897555] __br_forward+0xcc/0x140
[407711.901210] br_forward+0x124/0x144
[407711.904779] br_handle_frame_finish+0x330/0x524
[407711.909390] br_handle_frame+0x408/0x46c
[407711.913394] __netif_receive_skb_core.constprop.0+0x254/0xca4
[407711.919219] __netif_receive_skb_list_core+0xd4/0x1ec
[407711.924350] netif_receive_skb_list_internal+0x1b8/0x280
[407711.929741] napi_complete_done+0x64/0x1a0
[407711.933917] gro_cell_poll+0x80/0xa0
[407711.937576] __napi_poll+0x54/0x1b0
[407711.941144] net_rx_action+0xd8/0x26c
[407711.944886] _stext+0x10c/0x28c
[407711.948108] __irq_exit_rcu+0xdc/0xfc
[407711.951849] irq_exit+0xc/0x1c
[407711.954981] handle_domain_irq+0x60/0x8c
[407711.958984] gic_handle_irq+0x64/0x8c
[407711.962725] call_on_irq_stack+0x28/0x44
[407711.966728] do_interrupt_handler+0x4c/0x54
[407711.970991] el1_interrupt+0x2c/0x4c
[407711.974647] el1h_64_irq_handler+0x14/0x20
[407711.978822] el1h_64_irq+0x74/0x78
[407711.982302] arch_cpu_idle+0x14/0x20
[407711.985957] do_idle+0xc0/0x150
[407711.989181] cpu_startup_entry+0x24/0x60
[407711.993184] rest_init+0xc4/0xd0
[407711.996493] arch_call_rest_init+0xc/0x14
[407712.000582] start_kernel+0x5b4/0x5d4
[407712.004323] __primary_switched+0xa0/0xa8
[407712.008412] ---[ end trace 7b4c1b9ce5312380 ]---
[407712.013485] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
[407712.022407] Mem abort info:
[407712.025292] ESR = 0x0000000096000046
[407712.029122] EC = 0x25: DABT (current EL), IL = 32 bits
[407712.034527] SET = 0, FnV = 0
[407712.037667] EA = 0, S1PTW = 0
[407712.040907] FSC = 0x06: level 2 translation fault
[407712.045866] Data abort info:
[407712.048824] ISV = 0, ISS = 0x00000046
[407712.052751] CM = 0, WnR = 1
[407712.055800] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000044f9d000
[407712.062335] [0000000000000008] pgd=0800000044d5d003, p4d=0800000044d5d003, pud=0800000044d5d003, pmd=0000000000000000

One of my crashes happened after ~3.5 days of uptime. The second one (a different RT3200) happened today after 4 days 17 hours of uptime. When the crash occurs, the RT3200 is down hard. Literally unresponsive to network, even ethernet. Has to be manually powered off and back on.

I'll be curious to hear if this resembles the crash you are analyzing.

Mine happens in a different place, but also due to the sta_remove functionality.
Interesting that yours has dma_wed_setup here as well... it maybe that sometimes it crashes, and other times it takes the WED out.... depending on the timing.

It is imperative to not used sta pointers after sta_remove( ) is called from the mac80211 framework! And I believe that may be the core of all this weirdness.

Earlier today I added synchronize_rcu( ) at the end of every sta_remove( ) function in mt76 (there's several) as a test to see if it stops the crash. I've caught the attention of the mt76 repository overseer, who may also be looking at these stack traces.

1 Like

I just thought of something. Was it your dmesg that showed the NIC card going up and down over and over?? That also causes an sta_remove ( ) to be called, as that function is for different types of connections, not just wifi.

If so, is there any possibility that you have a bad ethernet cable?

I don't recall that being something I've seen in my logs, but your point around sta_remove() is well noted. I will definitely keep an eye out for that as I'm observing any additional crashes.

Hopefully with some more attention on all this in the mt76 repo, help will be on the way soon :slight_smile:

So the "sta_remove" is a callback from the mac80211 framework. It says very specifically that you must not use an ieee80211_sta pointer after sta_remove is called. I believe there may be a storage list somewhere with sta pointers that need to be extracted when this happens, otherwise all kinds of crashes and side effects are possible.

The sta_remove callback is allowed to sleep, which means they are probably called concurrently so it won't slow down other connections. So worse case scenario, we might be able to do something stupid like sleep(60 * 60 * 1000) to wait an hour before returning from this function, to make ABSOLUTELY sure all sta pointers have been flushed out.

Also, sta_state( ) is an alternative to sta_add and sta_remove. Keep that in mind, should you see that in the stacktrace.

Here's the documentation in case you find anything further that I might have missed:

https://www.kernel.org/doc/html/v4.9/80211/mac80211.html

Learned something.
Note to self: do not msleep( ) for an hour in a kernel driver.....