I've posted a few benchmarks to the GitHub PR: https://github.com/openwrt/openwrt/pull/3701#issuecomment-756947314
I've received my NanoPI R4S recently and was able to get it running quickly with this code here.
I've made some small changes though to get the most out of it:
- Use Google OP1 overclock to 2.0GHz
- edit
arch/arm64/boot/dts/rockchip/rk3399-nanopi4.dtsi
-
#include "rk3399-op1-opp.dtsi"
instead of #include "rk3399-opp.dtsi"
- optimize with
-O3
instead of -Os
- optimize for RK3399:
-march=armv8-a+crypto+crc -mcpu=cortex-a73.cortex-a53+crypto+crc -mtune=cortex-a73.cortex-a53
Can hopefully be done better/cleaner than this:
diff --git a/include/target.mk b/include/target.mk
index edc6a146de..e7217c6181 100644
--- a/include/target.mk
+++ b/include/target.mk
@@ -194,7 +194,7 @@ LINUX_RECONF_DIFF = $(SCRIPT_DIR)/kconfig.pl - '>' $(call __linux_confcmd,$(filt
ifeq ($(DUMP),1)
BuildTarget=$(BuildTargets/DumpCurrent)
- CPU_CFLAGS = -Os -pipe
+ CPU_CFLAGS = -O3 -pipe
ifneq ($(findstring mips,$(ARCH)),)
ifneq ($(findstring mips64,$(ARCH)),)
CPU_TYPE ?= mips64
@@ -235,7 +235,7 @@ ifeq ($(DUMP),1)
endif
ifeq ($(ARCH),aarch64)
CPU_TYPE ?= generic
- CPU_CFLAGS_generic = -mcpu=generic
+ CPU_CFLAGS_generic = -march=armv8-a+crypto+crc -mcpu=cortex-a73.cortex-a53+crypto+crc -mtune=cortex-a73.cortex-a53
CPU_CFLAGS_cortex-a53 = -mcpu=cortex-a53
endif
ifeq ($(ARCH),arc)
diff --git a/target/linux/rockchip/patches-5.4/202-rockchip-rk3399-Overclock-and-Undervolt-from-Google-OP1.patch b/target/linux/rockchip/patches-5.4/202-rockchip-rk3399-Overclock-and-Undervolt-from-Google-OP1.patch
new file mode 100644
index 0000000000..d0fc1d1a0f
--- /dev/null
+++ b/target/linux/rockchip/patches-5.4/202-rockchip-rk3399-Overclock-and-Undervolt-from-Google-OP1.patch
@@ -0,0 +1,13 @@
+Index: linux-5.4.86/arch/arm64/boot/dts/rockchip/rk3399-nanopi4.dtsi
+===================================================================
+--- linux-5.4.86.orig/arch/arm64/boot/dts/rockchip/rk3399-nanopi4.dtsi
++++ linux-5.4.86/arch/arm64/boot/dts/rockchip/rk3399-nanopi4.dtsi
+@@ -14,7 +14,7 @@
+ /dts-v1/;
+ #include <dt-bindings/input/linux-event-codes.h>
+ #include "rk3399.dtsi"
+-#include "rk3399-opp.dtsi"
++#include "rk3399-op1-opp.dtsi"
+
+ / {
+ chosen {
Though not sure if OpenWRT even considers this.
Also didn't even directly benchmark the impact of it.
Now for some testing:
FriendlyWRT 5.10.2 1.8GHz
# echo 10 > /proc/irq/35/smp_affinity
# echo 20 > /proc/irq/90/smp_affinity
# echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Idle power usage:
Conservative = 3.1-3.2W | Performance = 3.2-3.3W
LAN INTERFACE TX
root@FriendlyWrt:~# iperf3 -c HOST
[ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver
CPU 5: 90%
Power usage: 5.6 Watt
LAN INTERFACE RX
root@FriendlyWrt:~# iperf3 -c HOST -R
[ 5] 0.00-10.00 sec 1.10 GBytes 944 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver
CPU 4: 80% from iperf3 | CPU 5: 100%
Power usage: 6.8 Watt
LAN INTERFACE BIDIR
root@FriendlyWrt:~# iperf3 -c HOST --bidir
[ 5][TX-C] 0.00-10.00 sec 1.07 GBytes 919 Mbits/sec 0 sender
[ 5][TX-C] 0.00-10.00 sec 1.07 GBytes 918 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1.09 GBytes 940 Mbits/sec 0 sender
[ 7][RX-C] 0.00-10.00 sec 1.09 GBytes 937 Mbits/sec receiver
CPU 4: 70% (mostly iperf) | CPU 5: 100%
Power usage: 7.0 Watt
WAN INTERFACE TX
root@FriendlyWrt:~# iperf3 -c HOST
[ 5] 0.00-10.00 sec 1.10 GBytes 944 Mbits/sec 0 sender
[ 5] 0.00-10.01 sec 1.10 GBytes 941 Mbits/sec receiver
CPU 4: 50%
Power usage: 5.3 Watt
WAN INTERFACE RX
root@FriendlyWrt:~# iperf3 -c HOST -R
[ 5] 0.00-10.00 sec 1.10 GBytes 942 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver
CPU 4: 28%
Power usage: 5.2 Watt
WAN INTERFACE BIDIR
$ iperf3 -c HOST --bidir
[ 5][TX-C] 0.00-10.00 sec 1.08 GBytes 932 Mbits/sec 0 sender
[ 5][TX-C] 0.00-10.00 sec 1.08 GBytes 929 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1.07 GBytes 921 Mbits/sec 1 sender
[ 7][RX-C] 0.00-10.00 sec 1.07 GBytes 918 Mbits/sec receiver
CPU 4: 90%
Power usage: 6.1 Watt
ROUTING LAN -> WAN
$ iperf3 -c HOST
[ 5] 0.00-10.00 sec 1.10 GBytes 942 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.09 GBytes 939 Mbits/sec receiver
CPU 4: 80% | CPU 5: 80%
Power usage: 7.3 Watt
ROUTING LAN <-> WAN BIDIR
$ iperf3 -c HOST --bidir
[ 5][TX-C] 0.00-10.00 sec 1.08 GBytes 928 Mbits/sec 0 sender
[ 5][TX-C] 0.00-10.02 sec 1.08 GBytes 925 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 491 MBytes 412 Mbits/sec 150 sender
[ 7][RX-C] 0.00-10.02 sec 487 MBytes 408 Mbits/sec receiver
CPU 4: 80% | CPU 5: 100%
Power usage: 8.2 Watt
super random, can be way worse
SMP affinity seems broken in this firmware
OpenWRT 5.4.86 2GHz
# echo 10 > /proc/irq/35/smp_affinity
# echo 20 > /proc/irq/90/smp_affinity
# echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Idle power usage: 3.1W
r8169 kernel driver
LAN INTERFACE TX
root@OpenWRT:~# iperf3 -c HOST
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.09 GBytes 939 Mbits/sec receiver
CPU 5: 83%
Power usage: 5.7 Watt
LAN INTERFACE RX
root@OpenWRT:~# iperf3 -c HOST -R
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.09 GBytes 939 Mbits/sec receiver
CPU 4: 60% from iperf3 | CPU 5: 96%
Power usage: 7.4 Watt
LAN INTERFACE BIDIR
root@OpenWRT:~# iperf3 -c HOST --bidir
[ 5][TX-C] 0.00-10.00 sec 1.03 GBytes 882 Mbits/sec 0 sender
[ 5][TX-C] 0.00-10.00 sec 1.02 GBytes 880 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1.08 GBytes 927 Mbits/sec 0 sender
[ 7][RX-C] 0.00-10.00 sec 1.08 GBytes 924 Mbits/sec receiver
CPU 4: 70% (mostly iperf) | CPU 5: 100%
Power usage: 7.6 Watt
r8168-8.048.03 realtek kernel module
LAN INTERFACE TX
root@OpenWRT:~# iperf3 -c HOST
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.09 GBytes 940 Mbits/sec receiver
CPU 5: 20%
Power usage: 4.6 Watt
LAN INTERFACE RX
root@OpenWRT:~# iperf3 -c HOST -R
[ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver
CPU 4: 15% from iperf3 | CPU 5: 30%
Power usage: 5.2 Watt
LAN INTERFACE BIDIR
root@OpenWRT:~# iperf3 -c HOST --bidir
[ 5][TX-C] 0.00-10.00 sec 1.07 GBytes 918 Mbits/sec 0 sender
[ 5][TX-C] 0.00-10.00 sec 1.07 GBytes 916 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0 sender
[ 7][RX-C] 0.00-10.00 sec 1.09 GBytes 937 Mbits/sec receiver
CPU 4: 20% from iperf3 | CPU 5: 30%
Power usage: 5.4 Watt
Using r8168-8.048.03 realtek kernel module for the next tests because it's much better than r8169 in the kernel:
(And you'll also see that the WAN / eth0 / SoC integrated / st_gmac / mdio / rgmii / RTL8211E? NIC or driver is crap compared to the PCIe R8111H LAN / eth1 NIC + r8168)
WAN INTERFACE TX
root@FriendlyWrt:~# iperf3 -c HOST
[ 5] 0.00-10.00 sec 1.10 GBytes 944 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver
CPU 4: 86%
Power usage: 5.5 Watt
WAN INTERFACE RX
root@FriendlyWrt:~# iperf3 -c HOST -R
[ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver
CPU 4: 96% | CPU 5: 55% from iperf3
Power usage: 6.8 Watt
WAN INTERFACE BIDIR
root@FriendlyWrt:~# iperf3 -c HOST --bidir
[ 5][TX-C] 0.00-10.00 sec 1.08 GBytes 925 Mbits/sec 0 sender
[ 5][TX-C] 0.00-10.00 sec 1.08 GBytes 923 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1.07 GBytes 920 Mbits/sec 0 sender
[ 7][RX-C] 0.00-10.00 sec 1.07 GBytes 918 Mbits/sec receiver
CPU 4: 100% | CPU 5: 30% from iperf3
Power usage: 6.3 Watt
ROUTING WAN -> LAN
$ iperf3 -c HOST
[ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.09 GBytes 940 Mbits/sec receiver
CPU 4: 90% | CPU 5: 30%
Power usage: 6.6 Watt
ROUTING LAN -> WAN
$ iperf3 -c HOST -R
[ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver
CPU 4: 80% | CPU 5: 20%
Power usage: 6.6 Watt
ROUTING LAN <-> WAN BIDIR
$ iperf3 -c HOST --bidir
[ 5][TX-C] 0.00-10.00 sec 1.08 GBytes 931 Mbits/sec 27 sender
[ 5][TX-C] 0.00-10.00 sec 1.08 GBytes 929 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1007 MBytes 845 Mbits/sec 194 sender
[ 7][RX-C] 0.00-10.00 sec 1003 MBytes 842 Mbits/sec receiver
CPU 4: 100% | CPU 5: 45%
Power usage: 7.1 Watt
ROUTING WAN -> LAN with SQM 1 000 000 kbit/s
$ iperf3 -c HOST
[ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver
CPU 4: 97% | CPU 5: 88%
Power usage: 7.8 Watt
+0ms RTT on LAN
ROUTING LAN -> WAN with SQM 1 000 000 kbit/s
$ iperf3 -c HOST -R
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.09 GBytes 939 Mbits/sec receiver
CPU 4: 100% | CPU 5: 20%
Power usage: 6.8 Watt
+7-8ms RTT on LAN
ROUTING LAN <-> WAN BIDIR with SQM 1 000 000 kbit/s
$ iperf3 -c HOST --bidir
[ 5][TX-C] 0.00-10.00 sec 974 MBytes 817 Mbits/sec 2 sender
[ 5][TX-C] 0.00-10.00 sec 971 MBytes 814 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 1004 MBytes 843 Mbits/sec 3 sender
[ 7][RX-C] 0.00-10.00 sec 1001 MBytes 839 Mbits/sec receiver
CPU 4: 100% | CPU 5: 100%
Power usage: 7.9 Watt
+10ms RTT on LAN
ROUTING LAN -> WAN BIDIR with SQM 800 000 kbit/s
$ iperf3 -c HOST --bidir
[ 5][TX-C] 0.00-10.00 sec 911 MBytes 764 Mbits/sec 6 sender
[ 5][TX-C] 0.00-10.00 sec 907 MBytes 761 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 902 MBytes 757 Mbits/sec 7 sender
[ 7][RX-C] 0.00-10.00 sec 900 MBytes 755 Mbits/sec receiver
CPU 4: 99% | CPU 5: 100%
Power usage: 7.9 Watt
+0.5-1ms RTT on LAN
ROUTING LAN <-> WAN BIDIR 4 (total 8) PARALLEL with SQM 800 000 kbit/s
$ iperf3 -c HOST --bidir -P 4
[SUM][TX-C] 0.00-10.00 sec 844 MBytes 708 Mbits/sec 68 sender
[SUM][TX-C] 0.00-10.00 sec 840 MBytes 705 Mbits/sec receiver
[SUM][RX-C] 0.00-10.00 sec 902 MBytes 757 Mbits/sec 61 sender
[SUM][RX-C] 0.00-10.00 sec 897 MBytes 753 Mbits/sec receiver
CPU 4: 100% | CPU 5: 100%
Power usage: 7.9 Watt
+0.5-1ms RTT on LAN
Conclusion
-
Definitely use 2.0GHz, the device runs stable and cool with it and needs some extra oomph to work with gigabit.
Max recorded temperature was 49°C with the included metal case (solid alumnium block with thermal pad and with fins)
-
Didn't make any benchmarks about the CFLAGS, but surely doesn't hurt to add them.
Installing software from the repo that is optimized for generic armv8 works fine nontheless.
-
Forget about r8169 in-tree kernel module. Use the realtek r8168 one.
https://github.com/BROBIRD/openwrt-r8168
-
Investigate performance issues of the WAN NIC.
It looks like the performance of it is kinda better on FriendlyWRT Linux Kernel 5.10.
Is there a technical explanation for it to perform much worse than the PCIe NIC?
Helpful debug stuff
perf top -C 4
while iperf over WAN interface is running: (CPU4 86%)
echo $((-$(awk '$11 == "eth0" { print $6 }' /proc/interrupts)+$(sleep 1 ; awk '$11 == "eth0" { print $6 }' /proc/interrupts))) irq/s
60800 irq/s
perf top -C 5 -F 50000
while iperf over LAN interface is running: (CPU5 20%)
echo $((-$(awk '$11 == "eth1" { print $7 }' /proc/interrupts)+$(sleep 1 ; awk '$11 == "eth1" { print $7 }' /proc/interrupts))) irq/s
9004 irq/s
So we see that eth0 WAN causes 7x many interrupts as eth1 LAN.
That does explain the 4x higher CPU usage, but why does it do that?
Edit: Okay, so the WAN nic eth0 is not from realtek, only the PHY is.
The WAN NIC is some STMicroelectronics GMAC crap.
The driver can be found in drivers/net/ethernet/stmicro/stmmac
Edit2: Figured out how to tune the WAN nic eth0 to be less of a cpu hog.
# ethtool -C eth0 rx-usecs 1000 rx-frames 25
# ethtool -C eth0 tx-usecs 100 tx-frames 25
Reduces CPU load:
- WAN TX from 86% to 57%
- WAN RX from 96% to 40%
- WAN BIDIR from 100% to 90%
But performance didn't actually improve on full load
So this change was basically useless.
Lets see what Linux 5.10 brings to the table in the future.