Netgear R7800 exploration (IPQ8065, QCA9984)

@quarky
With your setup you cannot trigger the bug.
As I understand iPad is connected via WiFi, right? The WiFi is not affected by this bug, only LAN ports that have simultaneous WAN and LAN traffic going through them with at least one 100Mbps device connected to any LAN port but also transferring data over LAN.

First you run iper3 server on a PC (1) as a LAN server connected at 1Gbps. Then run iperf3 as a client with -R (reverse option) on a client device (2) connected at 100Mbps (that's what triggers it). It doesn't matter if the connection of device 2 is directly to a router LAN port or to the same port via second switch (Gbit or not). Finally run WAN speedtest on a PC (1) connected at 1Gbps. The PC (1) WAN down/up speed drops significantly.
The best scenario is when you have third (3) and forth (4) clients. Both (3 and 4) should be 100Mbps clients.
I run network traffic from PC server (1) to all three (2, 3 and 4) 100Mbps clients. Total around 300Mbps going from the PC (1) to clients 2, 3 and 4 at the same time.
And when I run WAN speed test on PC (1) I have ridiculous WAN performance varying from 20-30Mbps to maximum 40-50Mbps with ping increasing to 50-500 ms (normally 1ms).
With this setup If a client device (2) is connected at 1Gbps to LAN port 2 (I use a Laptop and just change the cable with 100Mbps one) and it doesn't send any data to any other 100Mbps device over LAN then the Laptop (2) can download/upload from/to WAN at full Gigabit speeds. Even if the PC (1) cannot do this because it sends data to any other 100Mbps device.
Obviously the router LAN port, through which the LAN data is sent to 100Mbps device connected to another port, cannot receive/send data at Gbps FD speeds to WAN in this case.

In the network setup above if all devices 1, 2, 3 and 4 are connected at 1Gbps the WAN and LAN performance are great. Even if the PC server (1) sends data at 930Mbps through the LAN to 2, 3 and 4 (now all are connected at 1Gbps) at the same time it's still possible to download from WAN on a PC (1) at 930Mbps. We have expected Gigabit Full Duplex performance - 930Mbps download (with 1 ms ping!) from WAN on PC (1) and simultaneous data sending over LAN from PC (1) at 930Mbps to other LAN clients.

In original Netgear firmware (Voxel's firmware too) this bug is even more difficult to notice because if only one 100Mbps client is present and there is LAN traffic as per the above network setup the PC (1) still can achieve 900Mbps download from WAN at much higher ping though. The upload speed to WAN is affected and noticeable - in my case I have 700Mbps upload from ISP but I can only get something between 100-200 Mbps.
With the stock firmware the LAN performance is affected more negatively and is noticeable. The LAN traffic between PC (1) and 100Mbps client is interrupted during the WAN speed test. This cannot be considered as more desirable defect because the LAN data transfer is interrupted.

If I perform the tests with 4 devices (one 1Gbps and three 100Mbps) the stock firmware begins to suffer from the same WAN download slowdowns but they are somehow biased differently toward the WAN or LAN performance degradation with alternating WAN or LAN performance loss.

Got it. I'll try to test it again.

So if I'm understanding it right, we can simulate this with just 2 PCs connected to 2 LAN ports, one of them connected with 1000mbps and the other 100mbps. As long as both nodes are doing iperf3 transfers, the PC with the 1000mbps will suffer thruput degradation? At the moment, most of my network nodes are WiFi clients, so I don't have that many ethernet nodes that I can try ... heh heh.

Incidentally, did you configure any qdisc to the ethernet interface, e.g. eth0 and/or eth1? I have fq_codel configured for both eth0 and eth1 for my R7800, although it is the NSS firmware qdisc instead of the Linux kernnel's.

1 Like

That is the right setup. Even better if you have 2 or more 100Mbps gevices all receiveing data over LAN from LAN server at the same time. Then the WAN performance on the 1Gbps server becomes nightmare.
I didn't configure nor use fq_codel or qdisc but if this was true wouldn't it affect the speeds when I had only Gbps devices connected to the router.

Interesting. I just did another test with two computers, one with a 1000 mbps client PC(A) and another 100mbps server PC (B). B is running iperf3 as server.

So when A starts a iperf3 session in the download direction, i.e. B sending data to A, I get the following:

  • no change in ping times when I ping the router (i.e. less than 1 ms)
  • Speedtest (done on A) downloads are 100mbps less than max, i.e. I get around 840mbps
  • Speedtest (done on A) uploads are at max, i.e. I get around 930mbps

When A starts an iper3 session in the upload direction, i.e. A sending data to B, I get the following:

  • Ping time increased to around 16ms
  • Speedtest (done on A) downloads are at max, around 930mbps
  • Speedtest (done on A) uploads are cut to half, around 470 mbps

I think this looks more like a Linux kernel issue rather than the switch hardware.

The iperf3 runs are both at max for up and down, i.e. 95mbps.

3 Likes

I think you spot the issue now.
Thanks for your trying to catch this. That is exactly what I get, unexpected performance degradations differing in WAN/LAN part of the network depending on the different LAN traffic models run over the LAN part.

I suppose most folks do not notice this issue as they do not run a fileserver and don't normally have 100mbps clients nowadays.

You are right here
In my setup I have a Gigabit switch and I have even more 100Mbps devices connected and then the things go really bad. 20-30 Mbps WAN download with ping over 250-300ms and none of the 100Mbps devices cannot download form WAN at more than 1-2Mbps even if only one of them is involved in LAN traffic (sends or receives data to PC server connected at 1Gbps).
Unfortunately all current TVs and other AndroidTV boxes only have 100Mbps NICs.
None of this happens with another really cheap Gigabit router for 20 bucks.
The paradox is that a 100Mbps device shouldn't put even a light load on the Gigabit device but in this case the result is awful.

@quarky
As I understand you are using your own OpenWRT build with NSS offloading.
Can I try your version to see if I'll get the same results. From what you posted I see that your download speed is less affected. I don't know why.

Unfortunately my builds contains the NSS 11.2 firmware that I have no rights to re-distribute. I would not be able to legally share my builds with you.

Have you tried the builds from @ACwifidude based on the 21.02 tree (Linux kernel 5.4.182 as of my build)? It should be similar.

The issue is unlikely to be caused or even improved by NSS 11.2 as far as I can see (though I may be wrong). I do have custom startup scripts (see below) for my R7800, but it shouldn't be affecting the issue much (well, maybe for the receive and transmit buffers increase.)

# Work-around for CPU freq mux crash?
echo schedutil > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
echo schedutil > /sys/devices/system/cpu/cpufreq/policy1/scaling_governor
echo 800000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
echo 800000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq

# Increase rcv & send buffer
echo 262144 > /proc/sys/net/core/rmem_default
echo 262144 > /proc/sys/net/core/wmem_default
echo 1048576 > /proc/sys/net/core/rmem_max
echo 1048576 > /proc/sys/net/core/wmem_max

# Configure NSS Qdisc - should reduce CPU usage
tc qdisc add dev eth0 handle 8000: root nssfq_codel interval 100ms target 5ms flows 1024 quantum 1514 limit 10240 set_default
tc qdisc add dev eth1 handle 8000: root nssfq_codel interval 100ms target 5ms flows 1024 quantum 1514 limit 10240 set_default

# Disable NSS ECM multicast acceleration
echo 1 > /sys/kernel/debug/ecm/ecm_nss_ipv6/ecm_nss_multicast_ipv6_stop
echo 1 > /sys/kernel/debug/ecm/ecm_nss_ipv4/ecm_nss_multicast_ipv4_stop

# Enable NSS acceleration only for connections with packets from both ends
echo 1 > /sys/kernel/debug/ecm/ecm_classifier_default/accel_delay_pkts

If you compile your own builds and you somehow managed to get a copy of the NSS 11.2 firmware, you can try to clone my Github repo here and do a custom build.

2 Likes

@quarky
I see and thank you.
This issue was reported to the Github but may not be understood completely. What else can we do to inform anyone who can take a look at this and hopefully fix it.
Can you try the same tests but with second and possibly third 100Mbps devices (if you have them) just to see if the WAN performance is going to drop even more. As in my configuration.

I suppose this issue is also present with the standard OpenWrt snapshot builds? If so, opening an issue in Github would be the official way to track it. I guess the only thing to do is to provide as much as possible detailed information to simulate the problem. Pointing the issue back to this discussion would be good for details I guess.

I've tried with versions that don't have NSS but because of this speed test results were a lot lower. And it was difficult for me to make clear difference. Now that I have the right setup to replicate the issue I should try again and compare the results.

1 Like

To see output caused by some crashes, ramoops console logging is required, like the crash I had with the NSS build. Just having ramoops crash logging wasn't enough.

1 Like

Hi guys i'm new here, this is my first rant with this router with openwrt.
Past few days i've been testing this router wit openwrt, the performance is really bad and not very stable.
i decided to put stock firmware back on and its working amazingly, the speed and everything.
I've been searching so much but haven't find a good solution, i really want to run nextdns cli on it and thats about it.
Nextdns will works for a few hours then stop, i installed samba 4 server on it, it works great but then nextdns stopped working, and 5ghz really spotty.
anyone has the same router and working good, can you share any tips? thanks
and sorry for my english.

The R7800 on stock firmware will utilize proprietary software and NSS hardware offloading. Offers performance advantages but you have the limited features of stock firmware + terrible stock user interface.

There are tons of tweaks to get the performance you are looking for.

  1. cpu governor
  2. irqbalance
  3. packet steering
  4. NSS hardware offloading
  5. There are two drivers you can try, ath10k vs ath10k-ct - some clients like one vs the other.

i've tried all except for cpu governor, ath10k ios devices dont work until i revert back to ath10k-ct everything else worked for a few hours then crap out, i switched to dd-wrt it works really close to original firmware but nextdns are not supported yet.. at this point dd-wrt is really not worth my time, im gonna give open-wrt another try.

I have two weird problems in recent master build, I created multiple VLANs one name 'LAN' and the other names 'Manage' and create interface names 'Manage' to isolate routers which not used with WLAN. And put wireless and wired to 'LAN' interface. Then I can't access LUCI via wireless but I can access via wired. And I can access one of routers by others router's wireless. I have removed all firewall and allow forwarding. I have tried all combination in Interfaces->Devices configuration but still no luck. Sometimes it suddenly works for a moment but most time it just can't access.

Another problem is my iOS device seems continuously disconnect while device in sleep, and get reconnect when device wake up. However, when it reconnect, it can't connect to any networks, only wireless connected(WiFi logo shows and hostapd logs connected), I have to wait for 10 seconds or more to get network really connected. And my Android device seems just give up, remaining disconnected until I manually connect WiFi in settings. My WiFi is using EAP-TLS, WPA2. I have tried disable 802.11r and 802.11w. I only left KRACK countermeasure enabled. The WPA2-PSK WiFi seems don't have this problem.I had changed ath10k-ct to ath10k and set DTIM to 3 but problem still exists.

When I using old build build by myself(very very old from 2020) I don't have such problem.

Weird, iOS devices works fine for me with ath10k (and ath10-ct).
But I have to say ath10k also seems more stable in general.

I've tried that version yesterday just to confirm absolutely the same bad results with WAN performance as described in my previous posts.

Unfortunately I cannot open a support case with Netgear Support to report the issue to Netgear. Cause their firmware suffers the same performance degradation.
I just get a message stating that complimentary technical support expired on 2021-05-02

@hnyman very old question... I'm checking the interconnect driver upstream and I notice this...

https://patchwork.ozlabs.org/project/lede/patch/1524029637-12890-1-git-send-email-rjangir@codeaurora.org/

310-msm-adhoc-bus-support.patch

Did we totally miss this driver?