So I have some possible good news. I have been making modifications to the mt76 driver and compiling the openwrt kernel. I've had the RT3200 running for around 10 days now without a crash.
I found a number of potential race conditions, so filled in some missing rcu_read_locks/synchronize_rcu() as well as making sure some of the ieee80211 kernel functions marked as "do not call these concurrently" can no longer be called concurrently.
I believe the worse that can happen here is slowing things down ever so slightly, but I've been able to push over a gigabit (500 each direction simultaneously) through this thing sustained for an entire day.
(It might be worth mentioning that, since the beginning, I have had both the hardware and software acceleration disabled as well as WED disabled)
I'm going to give this another week, and if there still is not a crash, offer a pull request to the mt76 repository.
Here are the code changes in case you're interested in trying it out:
I'm running with your fork & latest commit on my three RT3200s now. I've been seeing the crash daily--sometimes multiple times per day. I do have WED enabled, so I'll be sure to report back on any crashes (or hopefully the absence thereof!) in a few days.
I had that branch crash yesterday after almost 10 days! So while there were some missing locks (especially around the ieee80211_tx/rx( ) functions) that may cause unrelated issues, the core crash has not been found yet.
HOWEVER, I have two other RT3200's running with slight code modifications, and one of them is NOT crashing. In fact, it hit 23 simultaneously active roaming stations yesterday! Double the max I've ever had on one of these, as they usually crash between 12-15 actively roaming stations.
Here's the line that seems to make the difference. It disables the AMSDU offload. Clearly it's not ideal because it disables an optimization, but it is a significant clue if it continues to remain solid.
mt76\mt7915\mmio.c::mt7915_mmio_probe (around line 1014):
Fantastic headway! FWIW, I woke up to find that none of my RT3200s crashed overnight (!!!). First time in a while.
I still don't know why, but for some reason I continue to notice that with WED enabled, offloading seems to stop after some time. Wireless connectivity continues to work, but I can definitely see the lack of offloaded flows resulting in significantly higher CPU usage than when the flow offloading is functional.
Anyway, I will keep an eye on your repo for any changes you push and will roll another build and test. Many thanks!
If you're pretty comfortable with the build system, you should be able to apply this patch. You can adjust the PKG_SOURCE_VERSION and PKG_MIRROR_HASH as needed based on the commit from @Brain2000 you wish to test.
I was looking at the WED code, and it looks like it only changes the wed token counter it if the token value remains between token_start and token_size. However, the idr_alloc( ) function creates a token value between 0 and token_size. So if token_start is greater than 0, then it's possible that "wake_up(&dev->tx_wait);" will not be called when the token count hits 0. I'm not sure what this wake_up call does, but I have a feeling that if it isn't called, WED will stop working.
This could also be an issue with the firmware bin file that lives directly in the MT7915 controller.
Unfortunately I am not completely familiar with all this code yet, so I can't say for sure.
From some testing I performed, dev->mmio.wed.wlan.token_start is 0x1000 at this point in the code. Under my normal usage (and even stress testing), even when I was seeing flows offloaded, token never came close to >= 0x1000. I never saw the wake_up(&dev->tx_wait) invoked in my testing.
I'm wondering if there is some other error that is occurring that lands me in this recovery from system error where the call to mtk_wed_device_stop(&dev->mt76.mmio.wed); is placed, yet the recovery is successful. The only location in the mt76 code where I see a call to start WED is here: https://github.com/Brain2000/mt76/blob/master/mt7915/dma.c#L382
I'm trying to see if mt7915_dma_enable() is invoked again after recovery from system error.
Hmm, good question. I still need to see what the idle function does, but what you found could be the root cause. You can always find out by adding a drv_info(dev->mt76.dev, "Hit mt7915_dma_enable\n"); then seeing if after that message appears in the logs, if WED offloading stops.
Keeping in mind I'm not a driver dev, I was literally in the process of compiling a release where I have a couple printk() calls sprinkled around the start/stop WED functions.
e.g. printk(KERN_INFO "MT76-1: mtk_wed_device_active=%s", mtk_wed_device_active(&dev->mt76.mmio.wed) ? "true" : "false");
However, is dev_info() functionally equivalent to printk()? Or is printk() deprecated these days?
This is also my first adventure with driver programming! Though I'm fairly well versed with multi-threading and race conditions, and low level code. I just learned what TTL serial is, and that RS232 is not compatible. Who knew?! So I picked one up and was finally able to pull a stack trace from my crash, I'm hot on the trail !
I just recently read all about printk, TP_printk, tp_info, dev_info.... so many different ways to output information. The article I read seemed to indicate that everyone is leaning away from printk and to use tp_xxxx, but one of the best alternatives to printk for device drivers was the dev_xxxx functions. So I just started using those today and they show up in the logread, so I'm happy with it.
One of my crashes happened after ~3.5 days of uptime. The second one (a different RT3200) happened today after 4 days 17 hours of uptime. When the crash occurs, the RT3200 is down hard. Literally unresponsive to network, even ethernet. Has to be manually powered off and back on.
I'll be curious to hear if this resembles the crash you are analyzing.
Mine happens in a different place, but also due to the sta_remove functionality.
Interesting that yours has dma_wed_setup here as well... it maybe that sometimes it crashes, and other times it takes the WED out.... depending on the timing.
It is imperative to not used sta pointers after sta_remove( ) is called from the mac80211 framework! And I believe that may be the core of all this weirdness.
Earlier today I added synchronize_rcu( ) at the end of every sta_remove( ) function in mt76 (there's several) as a test to see if it stops the crash. I've caught the attention of the mt76 repository overseer, who may also be looking at these stack traces.
I just thought of something. Was it your dmesg that showed the NIC card going up and down over and over?? That also causes an sta_remove ( ) to be called, as that function is for different types of connections, not just wifi.
If so, is there any possibility that you have a bad ethernet cable?
I don't recall that being something I've seen in my logs, but your point around sta_remove() is well noted. I will definitely keep an eye out for that as I'm observing any additional crashes.
Hopefully with some more attention on all this in the mt76 repo, help will be on the way soon
So the "sta_remove" is a callback from the mac80211 framework. It says very specifically that you must not use an ieee80211_sta pointer after sta_remove is called. I believe there may be a storage list somewhere with sta pointers that need to be extracted when this happens, otherwise all kinds of crashes and side effects are possible.
The sta_remove callback is allowed to sleep, which means they are probably called concurrently so it won't slow down other connections. So worse case scenario, we might be able to do something stupid like sleep(60 * 60 * 1000) to wait an hour before returning from this function, to make ABSOLUTELY sure all sta pointers have been flushed out.
Also, sta_state( ) is an alternative to sta_add and sta_remove. Keep that in mind, should you see that in the stacktrace.
Here's the documentation in case you find anything further that I might have missed: