NEXX 3020 (MT7620) Wi-Fi issues

Sorry guys... but you're asking for performance improvement, right? Do you have a stable wi-fi connection? Even with 18.06 (and mt7620 chip of course) it is VERY unstable for me

@Killbrum Did you try a build with this patch https://github.com/openwrt/openwrt/pull/626 ?

No, I didn't. Should I try it? It might be difficult for me to compile openwrt from sources.

It is worth to try it.

@Killbrum There are a lot of changes that went in on Sep 26 relevant to the MT7620 (including the ones from @psyborg ) :

patches

You could try the latest snapshot http://downloads.openwrt.org/snapshots/targets

@Camicia, just tested currently built snapshot from 18.06 line: https://git.openwrt.org/?p=openwrt/openwrt.git;a=shortlog;h=refs/heads/openwrt-18.06
It didn't make any difference since 18.06.1. Big packet loss few times per minute.
In my configuration it's set as a wireless client, connected to other 802.11n router (which works stable for me for 10+ different devices since years) about just 2-2.5 meters away from this WT3020F.
On stock fw, it (WT3020F) was used in WIFI master mode, connected to other router via WAN ethernet port and had stable WIFI performance (iperf from laptop via WIFI connected through WT3020F with router on WAN showed 80-95.5MBit/s), without packet loss. Now it's even hard to measure.

With latest precompiled snapshot:
wt3020-8M-squashfs-sysupgrade.bin 23b65e43f96607d97dae2942679daed7620c013e73d2b6bfaa46de71e5ea7098 3840.7 KB Sun Oct 21 04:47:02 2018
it seems to be a bit better, but still gets packet loss.

Just fleshed four different NEXX 3020 devices using a bit different firmware (not openwrt). Works perfect, no issues. So, I mean the problem is only with openwrt

if you ment padavan-ng - yes - it works perfectly - but what about kernel - it's probably not maintained

I thought same when I tried about ~20 different OpenWrt images :))) but actually no, it is not better. There are two different bugs (or who knows? might be the same bug with different behavior). The first bug leads to unstable connection, low speed, problems with TX/RX power above 10 and possible random router crashes, yep, exactly - OpenWRT crash. The second bug - leads to accidental (or not?) disconnects from AP. I can't recall exact message, but in general, openwrt says that something wrong with the connected client and disconnects it. No speed drop, no package drop, everything is ok, but OpenWRT disconnects that client. Recent patches fixed the first bug, second - no. And still now with patches from September I have random disconnects from AP. I have them ONLY with one device (my Galaxy Note 5). When I'm trying to use my laptop (Intel Centrino blah-blah), it works normal.

P.S. I found one untouched NEXX 3020 with OpenWRT on board. Will upgrade it, and we will see

1 Like

I don't care about kernel version. There are thousands of old devices with outdated core, but they works like a charm. I think that we need to have, as a first thing, working device, and only then - updated kernel. Nobody care about kernel if device works almost like a brick

@Killbrum, thanks. Yes, I think it's the 2nd bug which I've experienced.
Now I have connected 3020F to different router - well, Banana Pro on latest Armbian in master mode (also via WIFI as a client) and since almost 24h I don't have this disconnects problem.
Connectivity is not fast (Banana can get 802.11n set to 75MBit/s max and measured transfer rates give stable 22MBit/s through 2 walls now), but so far this connection seems to be rock stable.
It's strange as none of my other routers (I have nearly 10 of them) experienced such problems, also none of client devices (laptops, phones, tablets,...) with the router which was problematic to this 3020F.
It's all on yesterday's master snapshot.

I've made few more tests with builds from both:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=shortlog;h=refs/heads/openwrt-18.06
and
https://git.openwrt.org/?p=openwrt/openwrt.git;a=shortlog;h=refs/heads/master

Results are that:

  • 18.06 code tree have still both bugs and in all cases (everytime) bug #1 affects WiFi performance on WT3020F with any other router connectivity which I tested.
  • master code tree has bug #1 fixed and WiFi performance is quite good & stable for realtime video transfer from webcam. Only bug #2 exists (like in 18.06 code tree), which affects only connectivity to few routers.

Would someone port WiFi driver fixes from master to openwrt-18.06 tree?

1 Like

Hello!

So I'm here for my bounty! :face_with_symbols_over_mouth:

Seriously, the dropped frame issue is caused by a race in the ieee80211 -- but it's apparently a race that they know about and don't care about? I've sent a message to linux-wireless with a patch that is probably going nowhere. Apparently, despite the lacking API documentation, we're supposed to just drop them on the floor when our struct ieee80211_ops.tx is called and our queue is full. ieee80211_tx_frags is perfectly capable of queuing the damn things up and we could even shove them back in the 80211 queue when the driver queue is full, but noooo...

I guess it's a performance thing of not wanting to hold another lock -- I'm sure there's a way to mitigate that too, but whatever -- we just need to get to properly documented.

So more specifically, the driver sometimes continues to receive frames even thought it has stopped the queue because multiple threads have been preempted after they already checked to see that the queue was running, but later blocked when they hit spin_lock(&queue->tx_lock); in rt2x00queue_write_tx_frame(). So probably the most ideal solution would be to not lock anything in the .tx code path if that is at all possible, which is probably isn't :slight_smile:

Either way the solution to the slight problem of the system going to hell in a hand-basket is to just not scream "I HAVE AN ERROR EVERYBODY, STOP AND LOOK!" when the system already saturated. Don't worry, I've done it too, lol! We just snip this:

		rt2x00_err(queue->rt2x00dev, "Dropping frame due to full tx queue %d\n",
			   queue->qid);</pre>

This should keep the driver from dying, but not from lying. I think mac80211 needs a:

static inline void ieee80211_tx_mic_drop(struct net_device *dev, u32 len)
{
	struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);

	u64_stats_update_begin(&tstats->syncp);
	tstats->tx_dropped++;
	tstats->tx_packets--;
	tstats->tx_bytes -= len;
	u64_stats_update_end(&tstats->syncp);
}

Because at current the stats are pretending that everything was sent just fine.

We'll see what happens tomorrow.

@daniel-santos why you say:

How do you know that they know and not care?

Did you try your fix? How is the router behaving from a stability and speed prospective?

There were 2 main problems with the driver of the MT7620: dropped frames that led to a very slow speed. And the fact that sometimes once it drops the frames it also get stuck forever. It seems that you are telling that you have a solution for the latter.

Do you use Openwrt 18 or LEDE as base?
LEDE unfortunately is not being updated at all. Last time that I checked I think that was already solved on OpenWrt 18. Did you look at master and the commits around September about that driver?

I wrote something about my point of view about the situation here: Xiaomi mi mini wifi: from pandorabox to lede - #11 by Camicia

If you think you can make the situation better a lot of people will be very thankful to you! :grinning:

Hello Camicia

How do you know that they know and not care?

I don't for certain yet, but I'm basing it off of an IRC conversation in #kernelnewbies. I haven't received a response to my message to the wireless-linux mailing list yet, so I'm probably going to post again with a more pointed subject like "is this race in mac80211 intentional?" I'm completely new to the mac80211 code and I haven't done a lot in the net subsystem in general before, so I'm still learning a lot.

There are at least two different code paths that I've seen thus far to arrive at the race, one as a tasklet and one on behalf of a userland process calling sendmsg(). I don't yet know the code path that leads to this function when the transmission is due to routing from another interface -- that's next on my list.

Did you try your fix? How is the router behaving from a stability and speed prospective?

Well yes, I tried a few alterations and have had no problem, but I'm not routing. I think that I need to setup ip masquerading and get a few more wifi adapters to do proper testing.

I really need to be able to reproduce the original error. I was hoping to hear back from a customer that's trying the temporary fix by now.

There were 2 main problems with the driver of the MT7620: dropped frames that led to a very slow speed. And the fact that sometimes once it drops the frames it also get stuck forever. It seems that you are telling that you have a solution for the latter.

Well I'm saying that, first of all, I believe I have discovered the cause of the dropped frames -- I have definitely discovered a cause to the problem, but I wouldn't rule out the possibility of another. I have an unrefined solution for that, the patch in my message to the list.

Second, I suspect that the "stuck forever" problem is a self-inflicted injury -- a condition caused by emitting a printk every time a frame is dropped unexpectedly. It's probably better, if we accept this race condition, to just increment tx_dropped (that you see in ifconfig) and not print anything.

I suspect that this condition is far more likely once the load average becomes > 1 (or the number of CPU cores). But once it occurs, emitting the printk perpetuates the problem keeping load average > cpu cores and increasing the likelihood that either a process or delayed work is preempted while in this race area. It's just a theory at the moment.

Do you use Openwrt 18 or LEDE as base?

Using 18.06.1 as my base.

Did you look at master and the commits around September about that driver?

Yes I did and I honestly didn't see a single thing that would help this. They look like good commits, don't get me wrong! I am very much in favor of removing duplicate code!!!!! (Which is another way of saying that I loathe programmers who copy and paste.) I didn't understand the full purpose of the timing tweaks, but I presume they are for good cause.

EDIT: Correction. I did not look at the patches you're talking about, I was referring to Stanilshaw's patches that have been mainlined for 4.20. You'll see them if you run gitk torvalds/master drivers/net/wireless/ralink where "torvalds" is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux. I do not yet know the hardware (digital interface or radio) well enough yet to understand the calibration stuff.

hi

there are two ways to verify if your patch solves the problem:

  1. tool described in this post https://lists.openwrt.org/pipermail/openwrt-devel/2018-May/012270.html

  2. use android app netxpro http://lists.infradead.org/pipermail/lede-bugs/2017-November/006345.html

@psyborg Thank you. Who is the author of this post? It would appear that the From: header is just the LEDE list its self. :confused: I would definitely prefer more details on what was specifically done to produce the error.

seems like it was scraped off the bugtracker. i was able to reproduce it: install app on android phone, connect to the network and initiate scan with the tool (the phone had mtk wifi chipset!)

Oh, the error was produced on the phone using the rt2x00 wifi driver, or on a router running openwrt after using the phone to scan it?

EDIT: Oh cool! I hadn't discovered this bug report yet, thanks again! :slight_smile: Gotta read through it now.

EDIT 2: I'M STARTING A PATREON TO GET PERMISSION TO REPLY TO A TOPIC MORE THAN 3 GODDAM TIMES. Please support me by emailing the forum admin and sending them the most confusing pictures you can find along with suggestive themes from single-celled organisms.