How do you know that they know and not care?
I don't for certain yet, but I'm basing it off of an IRC conversation in #kernelnewbies. I haven't received a response to my message to the wireless-linux mailing list yet, so I'm probably going to post again with a more pointed subject like "is this race in mac80211 intentional?" I'm completely new to the mac80211 code and I haven't done a lot in the net subsystem in general before, so I'm still learning a lot.
There are at least two different code paths that I've seen thus far to arrive at the race, one as a tasklet and one on behalf of a userland process calling sendmsg(). I don't yet know the code path that leads to this function when the transmission is due to routing from another interface -- that's next on my list.
Did you try your fix? How is the router behaving from a stability and speed prospective?
Well yes, I tried a few alterations and have had no problem, but I'm not routing. I think that I need to setup ip masquerading and get a few more wifi adapters to do proper testing.
I really need to be able to reproduce the original error. I was hoping to hear back from a customer that's trying the temporary fix by now.
There were 2 main problems with the driver of the MT7620: dropped frames that led to a very slow speed. And the fact that sometimes once it drops the frames it also get stuck forever. It seems that you are telling that you have a solution for the latter.
Well I'm saying that, first of all, I believe I have discovered the cause of the dropped frames -- I have definitely discovered a cause to the problem, but I wouldn't rule out the possibility of another. I have an unrefined solution for that, the patch in my message to the list.
Second, I suspect that the "stuck forever" problem is a self-inflicted injury -- a condition caused by emitting a printk every time a frame is dropped unexpectedly. It's probably better, if we accept this race condition, to just increment tx_dropped (that you see in ifconfig) and not print anything.
I suspect that this condition is far more likely once the load average becomes > 1 (or the number of CPU cores). But once it occurs, emitting the printk perpetuates the problem keeping load average > cpu cores and increasing the likelihood that either a process or delayed work is preempted while in this race area. It's just a theory at the moment.
Do you use Openwrt 18 or LEDE as base?
Using 18.06.1 as my base.
Did you look at
master and the commits around September about that driver?
Yes I did and I honestly didn't see a single thing that would help this. They look like good commits, don't get me wrong! I am very much in favor of removing duplicate code!!!!! (Which is another way of saying that I loathe programmers who copy and paste.) I didn't understand the full purpose of the timing tweaks, but I presume they are for good cause.
EDIT: Correction. I did not look at the patches you're talking about, I was referring to Stanilshaw's patches that have been mainlined for 4.20. You'll see them if you run
gitk torvalds/master drivers/net/wireless/ralink where "
torvalds" is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux. I do not yet know the hardware (digital interface or radio) well enough yet to understand the calibration stuff.