Over the past few weeks I have been fortunate enough to do some tests with early OpenWRT builds on the NanoPI R2S provided to me by @jayanta525.
This is a dual Gigabit Ethernet board with a 4 core Rockchip SOC (RK3328) on it, manufactured by FriendlyARM.
Performance of the board is great, especially considering its price point (€32 incl shipping to Europe in my case). In summary (more details below):
- AES h/w accelerated openssl, up to 12x openssl performance compared to RPi4
- WAN-to-LAN/ingress (NAT + firewall) TCP throughput of 940 Mbps
- SQM enabled ingress up to 465 Mbps, egress up to 750 Mbps
@Jayanta525 is working on other similar RK3328 boards as well, like the Rock Pi E, and is currently preparing a PR. Once ready, please consider supporting us getting the PR merged into mainline OpenWRT!
OpenSSL performance
root@nanopi-r2s:~# openssl speed -evp aes-128-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 24232739 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 18557185 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 8901311 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2985544 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 414423 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 208278 aes-128-cbc's in 3.00s
OpenSSL 1.1.1 11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UThar) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-J6qvxk/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 129241.27k 395886.61k 759578.54k 1019065.69k 1131651.07k 1137475.58k
root@nanopi-r2s:~# openssl speed aes-128-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 8962695 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 64 size blocks: 2534616 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 256 size blocks: 652684 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 164408 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 20592 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 16384 size blocks: 10291 aes-128 cbc's in 3.00s
OpenSSL 1.1.1 11 Sep 2018
built on: Tue Nov 12 16:58:35 2019 UTCns:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-J6qvxk/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128 cbc 47801.04k 54071.81k 55695.70k 56117.93k 56229.89k 56202.58k
TCP routing performance
FriendlyARM states 941 Mbps throughput on both eth interfaces. I have confirmed this. Furthermore I have tested routing performance as follows. Test setup: Windows laptop on LAN port, Linux machine on WAN port, both running iperf3.
- iperf3 ~890 Mbps (LAN-to-WAN)
- iperf3 -R ~910 Mbps (WAN-to-LAN)
This packet steering trick from the RPi4 thread actually slows down the iperf3 performance (both directions) with around 1-2%:
- echo 2 > /sys/class/net/eth1/queues/rx-0/rps_cpus
- echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
Resetting that, and enabled software offloading in luci increases performance with a few %:
- iperf3 ~898 Mbps
- iperf3 -R ~941 Mbps (line speed!)
Note: for full performance like this, the IRQ of eth1 needs to be moved to a different CPU core. Failing to get irqbalance to work, I did:
root@OpenWrt:/etc/rc.d# cat /proc/irq/166/smp_affinity
f
root@OpenWrt:/etc/rc.d# echo "e" > /proc/irq/166/smp_affinity
root@OpenWrt:/etc/rc.d# cat /proc/irq/166/smp_affinity
e
TCP routing performance with SQM
Using the default SQM setup as described in OpenWRT wiki ("piece of cake"), I was able to get the following impressive numbers which should cover the vast majority of SOHO broadband installations.
- ingress (WAN-to-LAN) shaping seems to be saturating 1 CPU core and is limited to around 465 Mbps TCP. Moving IRQs around for a bit does not help.
- egress shaping works fine at 750 Mbps, two cores are up to 90%.