Netgear R7800 exploration (IPQ8065, QCA9984)

After looking for a new device in Next router? ath10k or mwlwifi based? something else? thread, I decided on Netgear R7800. I started exploring the new device a week ago and wrote my first comments to that thread, but it is probably better to have a new device-specific thread.

Netgear R7800 is a dual-core 1,7 GHz AC2600 router based on IPQ8065 SoC and QCA9984 wifi. (iox806x target and arm_cortex-a15_neon-vfpv4 packages in LEDE)

Some notes on the current support in OpenWrt (updated in Aug 2018, valid in Apr 2020):

Most normal functionalities are ok:

  • Flashing via OEM GUI works (questionable, you might need TFTP)
  • Sysupgrade works
  • USB ports work
  • LEDs work, also ath10k wifi LEDs in 18.06 and master
  • CPU frequency scaling works. Both cores scale independently between 384 - 1700 MHz according to CPU load.
  • Software NAT flow offloading with kernel 4.14 (master, 18.06)
  • Note: eth0 and eth1 are swapped compared to earlier Openwrt practice, so SQM etc. configs need checking.

What does not work / work in progress:

  • Hardware NAT (and hardware offloading)
  • Native qca8k hardware switchdriver. Work in progress (by blogic)
  • Note that all network (eth0, eth1, wlan0, wlan1) IRQs are assigned to core0 by default. With gigabit-style connections, performance benefits be obtained by distributing the IRQs more evenly to also core1. See also irqbalance

EDIT: May 2018
kernel 4.14 in master and 18.06 branches

Kernel 4.14 requires a larger flash space for kernel than earlier. Due to this, TFTP flashing is needed once when you move into kernel 4.14 based firmwares. Sysupgrade works normally after that. TFTP flashing needs to be used to move between these two groups:

  • kernel 4.4 or 4.9: 17.01, old master builds (<r7000), plus the quite earliest 18.06 builds
  • kernel 4.14: new master builds (>r7000), new 18.06 builds

You need TFTP flash once here, if you upgrade/downgrade from one group to the other group. Sysupgrade works normally inside each of those groups.

R7800 has a TFTP recovery mode in the u-boot bootloader.
Details at message 5 of this thread: Netgear R7800 exploration (IPQ8065, QCA9984)

11 Likes

Advice on opening Netgear R7800 case and using the serial console

The case bottom and cover are attached to each other pretty tightly, but the case can be opened once you figure out all necessary steps:

  • The circuitboard is attached to a heatsink that is attached to the top cover. So the bottom panel (with sides) is the one to separate from the others. The heatsink is likely attached with a thermal paste or such (I never removed it).
  • There are five screws (Torx T-10) on the bottom that need to be opened. One screw is visible and four are hidden inside the rubber feet (which are attached with adhesive tape)
  • There are small clips on sides keeping the top cover tight even when the screws have been opened. These clips need to be carefully pronged open with a flat screwdriver or something.
  • Additionally, there are fangs in the vertical backpanel protecting the connectors' holes (by preventing vertical lifting). In practice the front side can be lifted up first and the backside last by pulling forward so that the connectors slide free from the backpanel holes.

The serial header is right at the edge of the circuitboard, so it can be accessed even while the cover is attached to the circuitboard. I bended sideways the three needed pins (ground, rx, tx) that are luckily nearest to the circuitboard's edge, so that I could easily attach the serial cable. I left the farthest +3.3V pin untouched.

Using the markings in my USB-ttl converter, the pin order starting from the circuitboard's edge is: ground, rx, tx, +3.3V
Serial connection is the usual 115200,8,N,1

(The serial header can be seen through the left ventilation grill, so it might be possible to get access to it without by opening the case, by cutting a hole to the ventilation grill on the side panel.)

Bootlog starting with u-boot bootloader looks like this (LEDE master r3676-edda26dc4f):
https://gist.github.com/hnyman/8a68f6280ef5d267bba2ee80601fe8f0

EDIT:
I led the wire through side grill so that I access the serial header even when the cover is closed:

7 Likes

I got a nice suggestion from tohojo at the SQM scripts development repo, where I opened an issue about this. It may actually be how the HTB (hierarchical token bucket) qdisc perfoms on this CPU...

Well in that case it sounds like it has something to do with the way HTB runs on that CPU. You could try if TBF has the same behaviour; enable sqm, then issue the following commands to replace the configured qdiscs with a TBF-based one (provided TBF is in LEDE; not sure if it is):

tc qdisc del dev eth0 root
tc qdisc del dev ifb4eth0 root
tc qdisc add dev eth0 handle 1: root tbf rate 8Mbit burst 15140 latency 100ms
tc qdisc add dev ifb4eth0 handle 1: root tbf rate 90Mbit burst 15140 latency 100ms
tc qdisc add dev eth0 handle 2: parent 1:1 fq_codel
tc qdisc add dev ifb4eth0 handle 2: parent 1:1 fq_codel

I tested that suggestion and did a quick and dirty Ookla speedtest and the resulting change is impressive:

  • simple with original HTB: 77 / 7 Mbit/s
  • simple with TBF (above): 85 / 8 Mbit/s

So, an immediate jump of 8 Mbit/s in the download speed and now the throughput is much closer to the set limit of 90 / 8. HTB might be culprit (but what is the ultimate reason for the bad performance, no idea, yet)

1 Like

After realising that the HTB qdisc is part of the problem for R7800 I looked into the original sources of HTB at Linux upstream and noticed that there are a bunch of performance patches that have been implemented in Linux stable after 4.4 that we use here.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/net/sched/sch_htb.c?h=linux-4.8.y

Everything after Feb2016 is not in 4.4, about 10 commits in total. I looked at the commit list and one patch especially looks interesting as it promises to fix a major performance issue in dual-core processors (like ipq8065 in R7800).
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.8.y&id=a9efad8b24bd22616f6c749a6c029957dc76542b

I found a serious performance bug in packet schedulers using hrtimers. sch_htb and sch_fq are definitely impacted by this problem. ... This issue is particularly visible when multiple cpus can queue/dequeue packets on the same qdisc, as hrtimer code has to lock a remote base.

I compiled a firmware with that patch backported to 4.4, and SQM simple's download throughput improved by 1.5-2.0 Mbit/s. So, a partial fix, but not yet perfect.

I will look also into the other htb patches, but good to know that at least a partial reason has been found.

1 Like

R7800 has an easy-to-use TFTP recovery mode in the bootloader, similar as e.g. WNDR3700.

@dissent1 gave me the info, which was really useful as I managed to brick the router a few minutes ago when testing thermal drivers. I recovered from a boot loop with this process. Works like a charm.

Prerequisites for TFTP recovery

  • A TFTP client for your computer. There are both command-line tools and GUI versions available.
    • (I use currently tftpd64 GUI tool from jounin. tftp2 tool (from dd-wrt) used to work earlier, but for some reason does not work with the current master images.)
  • Your computer must have an IP address from the 192.168.1.x network, as the router' bootloader's TFTP recovery mode defaults to 192.168.1.1. You might need to manually config the address, as some operating systems change the IP rather quickly to a link-local 196.254.x.x address if there is no DHCP server. Verify that your PC still has 192.168.1.x before trying to TFTP.
  • A new firmware to flash in. Either an original Netgear firmware or an Openwrt "factory.img" firmware. "Sysupgrade" version will not work.
  • Access to router's reset button (on the back panel)

Recovery process

  1. Turn off the power, push and hold the reset button with a pin
  2. Turn on the power and wait till power led starts flashing white (after it first flashes orange for a while)
  3. Release the pin and use tftp in binary mode to send the factory img to the router. The power led will stop flashing if you succeeded in transferring the image, and the router reboots rather quickly with the new firmware.

Note that this recovery mode is in the bootloader (u-boot?), so it works before the actual firmware gets started.

Quite similar process as e.g. WNDR3700, which is documented in the Openwrt wiki: https://wiki.openwrt.org/toh/netgear/wndr3700#recovery_flash_in_failsafe_mode

6 Likes

The newest commit removed the "workaround" to use wifi & wps button LEDs for wifi 2G/5G status indication.
https://git.lede-project.org/?p=source.git;a=commitdiff;h=3228c2a682f287337db3dda880bc1e35ebaa7ce9

We have to manually add the workaound back. Either directly to /etc/config/system, or to the LED detection script in /etc/board.d. Note that the "rfkill" LED has also been renamed "wifi", so the previous workaround does not quite work.

I added this to my own build:

 r7800)
 	ucidef_set_led_usbport "usb1" "USB 1" "${board}:white:usb1" "usb1-port1" "usb2-port1"
 	ucidef_set_led_usbport "usb2" "USB 2" "${board}:white:usb2" "usb3-port1" "usb4-port1"
 	ucidef_set_led_netdev "wan" "WAN" "${board}:white:wan" "eth0"
 	ucidef_set_led_ide "esata" "eSATA" "${board}:white:esata"
+	ucidef_set_led_wlan "wlan2g" "WLAN 2G" "${board}:white:wifi" "phy1tpt"
+	ucidef_set_led_wlan "wlan5g" "WLAN 5G" "${board}:white:wps" "phy0tpt"
1 Like

Does anybody any specifics about the hardware random number generator in R7800? (likely in the ipq8065 itself)

I noticed that there is /dev/hwrng and it seems to produce random numbers with good speed. And the random numbers pass FIPS tests quite well. Based on ipq806x patch 307, the hwrng is still a pseudo-rng.

So I am considering using rng-tools to load quality entropy from /dev/hwrng to kernel, but it would be great to hear first if anybody has experimented with the built-in rng in ipq806x series.

root@lede:~# rngtest < /dev/hwrng
rngtest 5
Copyright (c) 2004 by Henrique de Moraes Holschuh
This is free software; see the source for copying conditions.  There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

rngtest: starting FIPS tests...
^Crngtest: bits received from input: 78800032
rngtest: FIPS 140-2 successes: 3933
rngtest: FIPS 140-2 failures: 7
rngtest: FIPS 140-2(2001-10-10) Monobit: 2
rngtest: FIPS 140-2(2001-10-10) Poker: 2
rngtest: FIPS 140-2(2001-10-10) Runs: 2
rngtest: FIPS 140-2(2001-10-10) Long run: 1
rngtest: FIPS 140-2(2001-10-10) Continuous run: 0
rngtest: input channel speed: (min=110.841; avg=10080.600; max=13322.817)Kibits/s
rngtest: FIPS tests speed: (min=471.416; avg=67176.523; max=78755.040)Kibits/s
rngtest: Program run time: 8797221 microseconds
2 Likes

I've had exactly same thoughts on prng device and using rng-tools, but I have found no value of it in my use case so I haven't went any further than just thinking of it, although enabling it in device tree for others.
What could be the usecase?

Well, entropy is needed for anything that needs good randomness, starting from SSH key generation for dropbear and SSL key gen for uhttpd (LuCI web server) after the initial flash. Five years ago it was an actual problem for embedded devices, as Linux had removed network drivers as entropy sources, so there was a real danger that devices start with rather identical random seed values. That was fixed for an extent with an ath9k patch in 2013 to again get some real-world randomness from the wifi driver. For reference: https://dev.openwrt.org/ticket/9631 and https://dev.openwrt.org/changeset/38486

Currently I see about 800 bits of available entropy in a quiet router and about 2000 bits in a more live one (R7800). That can get consumed pretty quickly if you e.g. frequently generate 1024/2048 bit keys.

root@lede:~# cat /proc/sys/kernel/random/entropy_avail
1939

I used haveged for a while in 2012-2014, but have dropped it since the wifi randomness addition.

I have not verified that there is similar "use real world randomness" patch also for ath10k, but I suppose that there is one... (@nbd likely knows that best)

I agree that there is no urgent use case for a home router, but a router in environment that frequently requires good randomness for VPN etc. keys/encryption, might have one. If ipq8065's built-in prng /dev/hwrng is good, it might be utilized.

I've merged the performance related commit that hnyman posted to my staging tree at https://git.lede-project.org/?p=lede/nbd/staging.git;a=summary, along with some other commits. Please test if it improves HTB performance

2 Likes

I have been using that HTB commit in my build for three weeks and it seems to help a bit, but is no magical bullet to fix all. There are several HTB related changes in Linux 4.8, but I guess we have to wait until LEDE gets to 4.8, as those commits would touch all/most qdiscs and seem too burdensome to backport (without understanding the possible related changes in kernel itself).

One other interesting observation: the CPU frequency scaling has some peculiarities. Usually the router runs both cores at 384 MHz when there is no load, but sometimes the cores seem to get stuck into running at 600-1000 MHz. They do scale up if needed, but do not scale below to 384. That has happened to me several times when I have monitored the CPU aftera firmware flash. But that has so far never happened with a plain reboot. So, the behaviour is likely somehow connected to the initial boot after flashing.

1 Like

Could you apply this changes from my tree https://github.com/dissent1/r7800/commit/097bb6ee1733a32fb62ac7df5b092c00bcf0a669 and check if you face same behaviour with stuck frequency again.
This commit adds generic linux frequency table instead of builtin into the krait driver. The reason I'm not creating a PR is that I cannot verify if there's any benefit from it.
You can go even further and try to remove pvs tables from device tree while applying the above commit and see how it goes.

1 Like

I applied the CPU frequency patch, but at the first glance it looks the system is not quite happy about it as there are new warnings in the kernel log. The processor started at 384 MHz, so it seems to work ok in principle. The CPU frequency scales up and down according to the load quite normally.

Kernel log contains errors / info messages due to "duplicate OPPs detected". The new values also seem to match the old ones that are replaced, so based on that message my initial verdict is that the patch does not have much effect and is likely unnecessary.

[    3.646651] L2 @ QSB rate. Forcing new rate.
[    3.650482] L2 @ 384000 KHz
[    3.654679] CPU0 @ 800000 KHz
[    3.657078] CPU1 @ QSB rate. Forcing new rate.
[    3.660343] CPU1 @ 384000 KHz
...
[    3.697124] cpu cpu0: _opp_add: duplicate OPPs detected. Existing: freq: 384000000, volt: 875000, enabled: 1. New: freq: 384000000, volt: 875000, enabled: 1
[    3.697914] cpu cpu0: _opp_add: duplicate OPPs detected. Existing: freq: 600000000, volt: 900000, enabled: 1. New: freq: 600000000, volt: 900000, enabled: 1
[    3.712172] cpu cpu0: _opp_add: duplicate OPPs detected. Existing: freq: 800000000, volt: 950000, enabled: 1. New: freq: 800000000, volt: 950000, enabled: 1
[    3.726114] cpu cpu0: _opp_add: duplicate OPPs detected. Existing: freq: 1000000000, volt: 1000000, enabled: 1. New: freq: 1000000000, volt: 1000000, enabled: 1
[    3.740079] cpu cpu0: _opp_add: duplicate OPPs detected. Existing: freq: 1400000000, volt: 1075000, enabled: 1. New: freq: 1400000000, volt: 1075000, enabled: 1
[    3.754418] cpu cpu0: _opp_add: duplicate OPPs detected. Existing: freq: 1725000000, volt: 1150000, enabled: 1. New: freq: 1725000000, volt: 1150000, enabled: 1
[    3.769388] cpu cpu1: _opp_add: duplicate OPPs detected. Existing: freq: 384000000, volt: 875000, enabled: 1. New: freq: 384000000, volt: 875000, enabled: 1
[    3.783015] cpu cpu1: _opp_add: duplicate OPPs detected. Existing: freq: 600000000, volt: 900000, enabled: 1. New: freq: 600000000, volt: 900000, enabled: 1
[    3.797010] cpu cpu1: _opp_add: duplicate OPPs detected. Existing: freq: 800000000, volt: 950000, enabled: 1. New: freq: 800000000, volt: 950000, enabled: 1
[    3.810977] cpu cpu1: _opp_add: duplicate OPPs detected. Existing: freq: 1000000000, volt: 1000000, enabled: 1. New: freq: 1000000000, volt: 1000000, enabled: 1
[    3.824938] cpu cpu1: _opp_add: duplicate OPPs detected. Existing: freq: 1400000000, volt: 1075000, enabled: 1. New: freq: 1400000000, volt: 1075000, enabled: 1
[    3.839284] cpu cpu1: _opp_add: duplicate OPPs detected. Existing: freq: 1725000000, volt: 1150000, enabled: 1. New: freq: 1725000000, volt: 1150000, enabled: 1
1 Like

Those warnings are expected, that's why I suggested "You can go even further and try to remove pvs tables from device tree while applying the above commit and see how it goes."

The point is I'm not aware how krait driver propagates Freq parameters (pvs bins in devicetree) in comparison to the above commit that utilizes generic and expected way (but not quite tested on krait), so it still may have an effect that's not noticed at first glance.

1 Like

I have a question about your forays into sqm. I have a c2600, so pretty close to your hardware, and sqm seems to work fine, except one glitch. Youtube goes absolutely stupid haywire - it will always drop down to 360p and will refuse to stream in any speed that resembles fast. Disabling sqm fixes it immiediately, and no, doesn't matter if i use fq_codel, or cake, or any of the shipped presets. What's even stupider is that transfers and benchmarks work perfectly fine. Can you replicate this in any way on your end?

[quote="npr, post:15, topic:285"]
Youtube goes absolutely stupid haywire - it will always drop down to 360p and will refuse to stream in any speed that resembles fast
[/quote]I have not noticed anything like that. With R7800 and SQM (simple/fq_codel) I can watch 1080p videos from youtube just fine. Works ok with Firefox, IE11 and Edge.

I've found a frequency table in qsdk repository that differs from ours.
https://source.codeaurora.org/quic/qsdk/system/openwrt/tree/arch/arm/boot/dts/qcom-ipq8064-v3.0.dtsi?h=korg/linux-3.4.y/release/arugula_bb_cs

With upstream kernel driver device seems to choose pvs bin 4, parsed from efuses (kind of nvmem) that is absent in that table. But according to driver in that qsdk repo for ipq8065 default is pvs bin 0.

So for testing purposes I've adjusted pvs bin 4 in lede to reflect pvs bin 0 in qsdk.


I'm running it right now, but I dont use qos to test it's behavior if it's fixed. This is the last difference from ipq8064 devices that may affect performance. In ipq8064 pvs bin 0 is chosen in lede automatically.

Please test this.
The point is that ipq8064 units do not suffer from that qos problem and I'm trying to figure out what can cause it by comparing differences between device trees.
Ipq8064 devices use those voltage values with correction to 1.4ghz top freq instead of 1.7ghz.
I've ran it for 3 days without issues and that should be safe as it's within accepted for ipq8065 range.

I'm running the latest lede snapshot on an r7800 now. Seems a formidable device.
only thing i'm seeing in the logs currently is this:

[62665.535785] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[62665.535858] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[62665.543055] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[62665.551125] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[67835.085170] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode

quick google shows me: http://lists.infradead.org/pipermail/ath10k/2016-September/008448.html
and:
http://svn.dd-wrt.com/ticket/5666

But if i check my firmware build, the system seems already be using this firmware version, so the issue is not yet properly resolved?

anyone any ideas?

To answer the question by @Tusc in Build for Netgear R7800 I ran the OpenSSL benchmarks with my LEDE 17.01 release branch build:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5               7653.87k    28352.66k    79638.44k   142638.42k   185739.95k
sha1              8427.71k    33833.88k    97973.85k   188711.90k   258015.12k
des cbc          23428.22k    24085.74k    24288.17k    24244.22k    24199.17k
des ede3          8499.86k     8619.03k     8676.17k     8640.17k     8634.37k
aes-128 cbc      51520.66k    58131.61k    60423.08k    60828.33k    60738.22k
aes-192 cbc      46107.37k    48218.67k    52193.02k    52509.70k    52532.57k
aes-256 cbc      43055.24k    46332.63k    47803.65k    47899.31k    47934.12k
sha256           19601.95k    51452.35k    99644.33k   130418.35k   143253.50k
sha512            7794.49k    31499.69k    47606.87k    66959.70k    76054.53k
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.007564s 0.000168s    132.2   5944.7
                  sign    verify    sign/s verify/s
dsa 2048 bits 0.001635s 0.001682s    611.7    594.7

Wiki format:

| r3006 | ARMv7 Processor rev 0 (v7l) | 6.00 | ARMv7 Processor rev 0 (v7l) | 56.15 | Qualcomm (Flattened Device Tree) | 1.0.2j | 142638420 | 188711900 | 130418350 | 66959700 | 24244220 | 8640170 | 60828330 | 52509700 | 47899310 | 132.2 | 5944.7 611.7 | 594.7 |