It is interesting to note that despite have a 1.35Ghz clock frequency compared to the 1.6-1.8Ghz clock frequency of the Marvel Armada benchmarks posted above by @anomeome and @cybrnook, you have the highest benchmarks for CHACHA20-POLY1305. This suggests that having ARM-v8A/AES-NI is not only very beneficial for OpenVPN, but could also benefit Wireguard.
openssl speed -elapsed -evp CHACHA20-POLY1305
You have chosen to measure elapsed time instead of user CPU time.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
chacha20-poly1305 108986.44k 235554.69k 327910.40k 355238.57k 364726.95k 365379.58k
sorry, couldn't resist this game of one-upmanship lol
-Os is the default and means optimise for size. I override my build with -O2 for speed but somewhat larger code generation, -O3 is greater speed but with some features that yield even larger code size.
openssl overrides defaults in the makefile witth -O3.
You know, given all this benchmarking, I was just curious to see what a 6 year old core i7-4790s would do by comparison.
Some of the AES numbers in this thread are pretty impressive for low power cpus: even better han the core i7 on big buffers, particularly the rockpi.....although with chacha the i7 kicks ass.
Quite impressive performance for low power cpu's....
I meant highest amongst ARM core benchmarks, but you are indeed right that your x86 setup exceeds the other benchmarks.
On the other hand, looking at it from a different perspective, it is quite creditable that a dual-core 1.35Ghz ARMv8 can even be considered in the same league as as an octa-core 2.2Ghz x86 setup for Wireguard, even knowing that these are single-core benchmarks..
Exactly, to my earlier point.
BTW, great benchmark to share on the i7, especially aes-128-cbc. Just to confirm, the chacha20 benchmark on i7 above as well as the chacha20 benchmark earlier on your octa-core x86 setup were single core benchmarks, right?
Wireguard gains nothing from AES-NI. ARM-v8A also introduced "Neon", the ARM SIMD implementation, and Wireguard makes extensive use of SIMD where available to accelerate its algorithm.
That makes sense now on why the ARMv8-A showed improved performance for Wireguard, which I was not expecting. Thanks for sharing.
Do you know if Wireguard natively takes advantage of SIMD when running on ARMv8-A without any additional compilation options, similar to how OpenSSL natively uses AES-NI when running on CPUs supporting that instruction set with no additional work required?
hello everybody this is a new test with a ubi image at this time
root@OpenWrt:~# openssl speed -elapsed -evp aes-128-gcm
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-gcm for 3s on 16 size blocks: 7002290 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 2293045 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 632551 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 1024 size blocks: 162890 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 20547 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 16384 size blocks: 10281 aes-128-gcm's in 3.00s
OpenSSL 1.1.1k 25 Mar 2021
built on: Sun Apr 4 09:51:25 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -DPIC -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-gcm 37345.55k 48918.29k 53977.69k 55599.79k 56107.01k 56147.97k
root@OpenWrt:~# openssl speed -elapsed -evp AES-128-CBC
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 18884607 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 14584264 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 7414392 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2604928 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 367482 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 186079 aes-128-cbc's in 3.00s
OpenSSL 1.1.1k 25 Mar 2021
built on: Sun Apr 4 09:51:25 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -DPIC -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 100717.90k 311130.97k 632694.78k 889148.76k 1003470.85k 1016239.45k
root@OpenWrt:~# openssl speed -elapsed -evp AES-256-CBC
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 17599572 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 12210670 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 5358006 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1694678 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 229414 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 115320 aes-256-cbc's in 3.00s
OpenSSL 1.1.1k 25 Mar 2021
built on: Sun Apr 4 09:51:25 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -DPIC -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 93864.38k 260494.29k 457216.51k 578450.09k 626453.16k 629800.96k
root@OpenWrt:~# openssl speed -elapsed -evp CHACHA20-POLY1305
You have chosen to measure elapsed time instead of user CPU time.
Doing chacha20-poly1305 for 3s on 16 size blocks: 6956996 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 64 size blocks: 4007395 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 256 size blocks: 2045398 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 1024 size blocks: 586258 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 8192 size blocks: 79114 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 16384 size blocks: 39723 chacha20-poly1305's in 3.00s
OpenSSL 1.1.1k 25 Mar 2021
built on: Sun Apr 4 09:51:25 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -DPIC -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
chacha20-poly1305 37103.98k 85491.09k 174540.63k 200109.40k 216033.96k 216940.54k
root@OpenWrt:~#
Has anyone tested what OpenVPN speeds the Linksys E8450 (aka. Belkin RT3200) can achieve? I'm wondering if it could replace my Asus RT-AC86U, which has hardware AES acceleration (but no OpenWRT support). Just did a speedtest and it reported 125Mbps while connected to my VPN provider with AES-256-GCM encyption. I'm not quite sure how to interpret the SSL benchmarks reported in this thread...
Hey guys, can anyone explain why the GCM performance is so low, and why un-accellerated chacha20-poly1305 wins with cpu's with AES support?
Am I right in thinking that CBC benchmarks won't actually show how fast OpenVPN will go, as it needs to combine them with SHA1 (or other auth) and that GCM benchmarks are the real number we will see with OpenVPN?
I have a NanoPi R2S on 21.02 with aes showing in cpuinfo
My OpenVPN seems to cap out around 120Mbps with 100% CPU load from OpenVPN on one core, so I'm trying to find out how to make it better.
Let me ask another question. Given OpenVPN 2.5 now supports "data-ciphers CHACHA20-POLY1305", should we ever use AES if CHACHA20-POLY1305 benchmarks faster in all the ARM results people above have posted? Isn't AES acceleration broken if it's slower than software CHACHA?