NEXX 3020 (MT7620) Wi-Fi issues

Killbrum · October 22, 2018, 5:57am

I thought same when I tried about ~20 different OpenWrt images :))) but actually no, it is not better. There are two different bugs (or who knows? might be the same bug with different behavior). The first bug leads to unstable connection, low speed, problems with TX/RX power above 10 and possible random router crashes, yep, exactly - OpenWRT crash. The second bug - leads to accidental (or not?) disconnects from AP. I can't recall exact message, but in general, openwrt says that something wrong with the connected client and disconnects it. No speed drop, no package drop, everything is ok, but OpenWRT disconnects that client. Recent patches fixed the first bug, second - no. And still now with patches from September I have random disconnects from AP. I have them ONLY with one device (my Galaxy Note 5). When I'm trying to use my laptop (Intel Centrino blah-blah), it works normal.

P.S. I found one untouched NEXX 3020 with OpenWRT on board. Will upgrade it, and we will see

Killbrum · October 22, 2018, 7:51am

I don't care about kernel version. There are thousands of old devices with outdated core, but they works like a charm. I think that we need to have, as a first thing, working device, and only then - updated kernel. Nobody care about kernel if device works almost like a brick

xeros · October 22, 2018, 8:00pm

@Killbrum, thanks. Yes, I think it's the 2nd bug which I've experienced.
Now I have connected 3020F to different router - well, Banana Pro on latest Armbian in master mode (also via WIFI as a client) and since almost 24h I don't have this disconnects problem.
Connectivity is not fast (Banana can get 802.11n set to 75MBit/s max and measured transfer rates give stable 22MBit/s through 2 walls now), but so far this connection seems to be rock stable.
It's strange as none of my other routers (I have nearly 10 of them) experienced such problems, also none of client devices (laptops, phones, tablets,...) with the router which was problematic to this 3020F.
It's all on yesterday's master snapshot.

xeros · October 23, 2018, 7:03pm

I've made few more tests with builds from both:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=shortlog;h=refs/heads/openwrt-18.06
and
https://git.openwrt.org/?p=openwrt/openwrt.git;a=shortlog;h=refs/heads/master

Results are that:

18.06 code tree have still both bugs and in all cases (everytime) bug #1 affects WiFi performance on WT3020F with any other router connectivity which I tested.
master code tree has bug #1 fixed and WiFi performance is quite good & stable for realtime video transfer from webcam. Only bug #2 exists (like in 18.06 code tree), which affects only connectivity to few routers.

Would someone port WiFi driver fixes from master to openwrt-18.06 tree?

daniel-santos · November 21, 2018, 7:00am

Hello!

So I'm here for my bounty!

Seriously, the dropped frame issue is caused by a race in the ieee80211 -- but it's apparently a race that they know about and don't care about? I've sent a message to linux-wireless with a patch that is probably going nowhere. Apparently, despite the lacking API documentation, we're supposed to just drop them on the floor when our struct ieee80211_ops.tx is called and our queue is full. ieee80211_tx_frags is perfectly capable of queuing the damn things up and we could even shove them back in the 80211 queue when the driver queue is full, but noooo...

I guess it's a performance thing of not wanting to hold another lock -- I'm sure there's a way to mitigate that too, but whatever -- we just need to get to properly documented.

So more specifically, the driver sometimes continues to receive frames even thought it has stopped the queue because multiple threads have been preempted after they already checked to see that the queue was running, but later blocked when they hit spin_lock(&queue->tx_lock); in rt2x00queue_write_tx_frame(). So probably the most ideal solution would be to not lock anything in the .tx code path if that is at all possible, which is probably isn't

Either way the solution to the slight problem of the system going to hell in a hand-basket is to just not scream "I HAVE AN ERROR EVERYBODY, STOP AND LOOK!" when the system already saturated. Don't worry, I've done it too, lol! We just snip this:

		rt2x00_err(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
			   queue->qid);</pre>

This should keep the driver from dying, but not from lying. I think mac80211 needs a:

static inline void ieee80211_tx_mic_drop(struct net_device *dev, u32 len)
{
	struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);

	u64_stats_update_begin(&tstats->syncp);
	tstats->tx_dropped++;
	tstats->tx_packets--;
	tstats->tx_bytes -= len;
	u64_stats_update_end(&tstats->syncp);
}

Because at current the stats are pretending that everything was sent just fine.

We'll see what happens tomorrow.

Camicia · November 23, 2018, 5:06am

@daniel-santos why you say:

How do you know that they know and not care?

Did you try your fix? How is the router behaving from a stability and speed prospective?

There were 2 main problems with the driver of the MT7620: dropped frames that led to a very slow speed. And the fact that sometimes once it drops the frames it also get stuck forever. It seems that you are telling that you have a solution for the latter.

Do you use Openwrt 18 or LEDE as base?
LEDE unfortunately is not being updated at all. Last time that I checked I think that was already solved on OpenWrt 18. Did you look at master and the commits around September about that driver?

I wrote something about my point of view about the situation here: Xiaomi mi mini wifi: from pandorabox to lede - #11 by Camicia

If you think you can make the situation better a lot of people will be very thankful to you!

daniel-santos · November 23, 2018, 1:33pm

Hello Camicia

How do you know that they know and not care?

I don't for certain yet, but I'm basing it off of an IRC conversation in #kernelnewbies. I haven't received a response to my message to the wireless-linux mailing list yet, so I'm probably going to post again with a more pointed subject like "is this race in mac80211 intentional?" I'm completely new to the mac80211 code and I haven't done a lot in the net subsystem in general before, so I'm still learning a lot.

There are at least two different code paths that I've seen thus far to arrive at the race, one as a tasklet and one on behalf of a userland process calling sendmsg(). I don't yet know the code path that leads to this function when the transmission is due to routing from another interface -- that's next on my list.

Did you try your fix? How is the router behaving from a stability and speed prospective?

Well yes, I tried a few alterations and have had no problem, but I'm not routing. I think that I need to setup ip masquerading and get a few more wifi adapters to do proper testing.

I really need to be able to reproduce the original error. I was hoping to hear back from a customer that's trying the temporary fix by now.

There were 2 main problems with the driver of the MT7620: dropped frames that led to a very slow speed. And the fact that sometimes once it drops the frames it also get stuck forever. It seems that you are telling that you have a solution for the latter.

Well I'm saying that, first of all, I believe I have discovered the cause of the dropped frames -- I have definitely discovered a cause to the problem, but I wouldn't rule out the possibility of another. I have an unrefined solution for that, the patch in my message to the list.

Second, I suspect that the "stuck forever" problem is a self-inflicted injury -- a condition caused by emitting a printk every time a frame is dropped unexpectedly. It's probably better, if we accept this race condition, to just increment tx_dropped (that you see in ifconfig) and not print anything.

I suspect that this condition is far more likely once the load average becomes > 1 (or the number of CPU cores). But once it occurs, emitting the printk perpetuates the problem keeping load average > cpu cores and increasing the likelihood that either a process or delayed work is preempted while in this race area. It's just a theory at the moment.

Do you use Openwrt 18 or LEDE as base?

Using 18.06.1 as my base.

Did you look at master and the commits around September about that driver?

Yes I did and I honestly didn't see a single thing that would help this. They look like good commits, don't get me wrong! I am very much in favor of removing duplicate code!!!!! (Which is another way of saying that I loathe programmers who copy and paste.) I didn't understand the full purpose of the timing tweaks, but I presume they are for good cause.

EDIT: Correction. I did not look at the patches you're talking about, I was referring to Stanilshaw's patches that have been mainlined for 4.20. You'll see them if you run gitk torvalds/master drivers/net/wireless/ralink where "torvalds" is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux. I do not yet know the hardware (digital interface or radio) well enough yet to understand the calibration stuff.

psyborg · November 23, 2018, 1:40pm

hi

there are two ways to verify if your patch solves the problem:

tool described in this post https://lists.openwrt.org/pipermail/openwrt-devel/2018-May/012270.html
use android app netxpro http://lists.infradead.org/pipermail/lede-bugs/2017-November/006345.html

daniel-santos · November 23, 2018, 1:59pm

@psyborg Thank you. Who is the author of this post? It would appear that the From: header is just the LEDE list its self. I would definitely prefer more details on what was specifically done to produce the error.

psyborg · November 23, 2018, 2:04pm

seems like it was scraped off the bugtracker. i was able to reproduce it: install app on android phone, connect to the network and initiate scan with the tool (the phone had mtk wifi chipset!)

daniel-santos · November 23, 2018, 2:06pm

Oh, the error was produced on the phone using the rt2x00 wifi driver, or on a router running openwrt after using the phone to scan it?

EDIT: Oh cool! I hadn't discovered this bug report yet, thanks again! Gotta read through it now.

EDIT 2: I'M STARTING A PATREON TO GET PERMISSION TO REPLY TO A TOPIC MORE THAN 3 GODDAM TIMES. Please support me by emailing the forum admin and sending them the most confusing pictures you can find along with suggestive themes from single-celled organisms.

psyborg · November 23, 2018, 2:28pm

the error has been reproduced on openwrt router with rt2x00 driver using android phone with mtk chipset and driver that android ships with `

Camicia · November 24, 2018, 12:14am

I have not seen those conversations. I doubt people are deliberately keeping the bug around but there has been some drama around this driver so some people may be a little worn out about conversations and work around it. Being polite and assume the best intentions will help to get some sympathy and support. As a suggestion, I would avoid accusing people. I also think that posting results AFTER you verify that your fix works would solicit more responses.

There is also a problem with a huge slowdown in the WIFI when the USB port was in use. I wonder if they are related. Of course if the CPU is actively managing the data transfer it can be the problem.

It sounds you have at least a device to tun the tests . I would look in the tools that @psyborg suggested. Personally I used this one in the past: https://github.com/adolfintel/speedtest. I run that on a laptop and used another device to run the test.

Yes, you should be able to do it with those tools.

The good think about a debug statement is that clearly connect a dropped frame to that part of the code. A tx_dropped++ may have other causes. That said it also requires more resources and it is slower. If your fix the problem I think you can remove that printk.

Just to make sure we are on the same page. I was talking about these commits:

I am not sure if there are other more recent commits related to the MT7620 driver. Potentially new commits may have fix something but broke something else. So I would also try to apply your patch to older commits as base. For example I would try to apply directly after this commit: https://github.com/openwrt/openwrt/pull/626/files

I would love to see somebody reverse engineer the original Mediatek driver and maybe create a new folder just for the MT7620N and maybe another for the MT7620A. There are still a ton of devices out there using these chips. Because of the lack of documentation and wanting to support too many devices with the same piece of code, I think the community ended up with having hodgepodge of code and no device working well.

Keep up the good work and let us know what you find out!

Camicia · November 24, 2018, 12:59am

@daniel-santos: It looks like somebody answered you here: https://www.spinics.net/lists/linux-wireless/msg180385.html

I just looked closer at what you posted on linux-wireless. mac80211 is used for a a lot of devices, not only MT7620.
A problem in the mac80211 general Linux code would affect a lot of devices and would be quite obvious.
However if you have one or more patches Openwrt applies on mac80211 when building for a device, a bug may well be introduced.

I am not sure where you are coming from and what programming experience is.
Usually a stuck thread is the results of a deadlock. I usually try to avoid locks as much as possible. The best solution is writing code that does not require locks. When you really have to, you should make sure you are locking everywhere in the same order. The same is true for unlocking. See also https://www.codeguru.com/cpp/misc/misc/threadsprocesses/article.php/c15545/Deadlock-the-Problem-and-a-Solution.htm for a quick summary of strategies or https://sites.ualberta.ca/~smartynk/Resources/CMPUT%20379/beck%20notes/deadlock.pdf and https://courses.engr.illinois.edu/cs241/sp2014/lecture/25-deadlock-solutions.pdf for a more in-depth explanation.

I find difficult to reason on static code and patches until they are applied. It is a lot easier when I can see the resulting code that is being compiled. Ideally I like to run a diff between the linux original code and the one after the patches are applied.

It should not be too difficult to debug while it is running and detect where the deadlock is with the proper tools but I have never done it on a router.

reinerotto · November 25, 2018, 10:47pm

is correct, 2) is not. The old priciple of "locking order" (used it already about 40yrs ago) does not require unlock to be ordered. unlock can be done as soon as convenient.