Broadcom is… 'difficult'…
…but there is quite some work in progress (but not finished yet), https://git.openwrt.org/?p=openwrt/openwrt.git;a=tree;f=target/linux/bcm4908/image;hb=HEAD
Broadcom is… 'difficult'…
…but there is quite some work in progress (but not finished yet), https://git.openwrt.org/?p=openwrt/openwrt.git;a=tree;f=target/linux/bcm4908/image;hb=HEAD
openssl from opkg install openssl-util
:
root@OpenWrt:~# openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 4852594 aes-256 cbc's in 2.97s
Doing aes-256 cbc for 3s on 64 size blocks: 1306710 aes-256 cbc's in 2.99s
Doing aes-256 cbc for 3s on 256 size blocks: 333596 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 83857 aes-256 cbc's in 2.99s
Doing aes-256 cbc for 3s on 8192 size blocks: 10495 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16384 size blocks: 5249 aes-256 cbc's in 3.00s
OpenSSL 1.1.1j 16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256 cbc 26141.92k 27969.71k 28466.86k 28718.92k 28658.35k 28666.54k
Manually compiled from another arm64 machine (w/ VPAES_ASM):
root@OpenWrt:~# ./openssl-static speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 6532145 aes-256 cbc's in 2.98s
Doing aes-256 cbc for 3s on 64 size blocks: 1740426 aes-256 cbc's in 2.99s
Doing aes-256 cbc for 3s on 256 size blocks: 445092 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 111835 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 14007 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16384 size blocks: 7000 aes-256 cbc's in 2.99s
OpenSSL 1.1.1j 16 Feb 2021
built on: Thu Mar 18 08:16:31 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: gcc -pthread -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256 cbc 35071.92k 37253.27k 37981.18k 38173.01k 38248.45k 38357.19k
That's less than 2% throughput of my Apple M1 MacBookAir, by the way.
Thanks for the prompt feedback.
I am not an expert, however the results look positively anemic. Not sure if AES instruction set is being used.
How does one verify that AES instruction set is indeed being used by OpenSSL? Would 'openssl engine -t -c' reveal it, as shown in the OpenWrt documentation for crypto engines?
Just for comparison (Xiaomi Mi AIoT Router AX3600, ipq8071a, 4*1.38GHz, cortex a53/ ARMv8, with the rather unusable and unoptimized OEM firmware):
# cat /proc/cpuinfo
processor : 0
BogoMIPS : 38.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 1
BogoMIPS : 38.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 2
BogoMIPS : 38.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 3
BogoMIPS : 38.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
1017600
# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
1382400
# openssl engine -t -c
(dynamic) Dynamic engine loading support
[ unavailable ]
# openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 4630455 aes-256 cbc's in 2.86s
Doing aes-256 cbc for 3s on 64 size blocks: 1313578 aes-256 cbc's in 2.73s
Doing aes-256 cbc for 3s on 256 size blocks: 332944 aes-256 cbc's in 2.59s
Doing aes-256 cbc for 3s on 1024 size blocks: 83464 aes-256 cbc's in 2.70s
Doing aes-256 cbc for 3s on 8192 size blocks: 10427 aes-256 cbc's in 2.75s
OpenSSL 1.0.2q 20 Nov 2018
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,2,int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-gcc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -I/home/jenkins/romdaily_new_openwrt/system/staging_dir/target-aarch64-openwrt-linux_musl/usr/include -I/home/jenkins/romdaily_new_openwrt/system/staging_dir/target-aarch64-openwrt-linux_musl/include -I/home/jenkins/Xiaoqiangtoolchain/toolchain/external_toolchain/toolchain-aarch64_cortex-a53_gcc-5.5.0_musl//usr/include -I/home/jenkins/Xiaoqiangtoolchain/toolchain/external_toolchain/toolchain-aarch64_cortex-a53_gcc-5.5.0_musl//include -specs=/home/jenkins/romdaily_new_openwrt/system/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_SMALL_FOOTPRINT -DMIWIFI_FEATURE -DHAVE_CRYPTODEV -DOPENSSL_NO_ERR -DTERMIOS -Os -pipe -march=armv8-a -mcpu=cortex-a53+crypto -fno-caller-saves -Wformat -fpic -fstack-protector -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -fpic -I/home/jenkins/romdaily_new_openwrt/system/package/libs/openssl/include -ffunction-sections -fdata-sections -fomit-frame-pointer -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256 cbc 25904.64k 30794.50k 32908.75k 31654.49k 31061.09k
ipq8074a would be clocked at 4*2.2 GHz, cortex a53/ ARMv8
Here are the results from IPQ8065 (1.7 GHz, 2 cores), with no NSS support and no AES, since it is 32-bit ARM-v7A
openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 6794589 aes-256-cbc's in 2.95s
Doing aes-256-cbc for 3s on 64 size blocks: 2381327 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 256 size blocks: 634833 aes-256-cbc's in 2.97s
Doing aes-256-cbc for 3s on 1024 size blocks: 161890 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 8192 size blocks: 20269 aes-256-cbc's in 2.98s
Doing aes-256-cbc for 3s on 16384 size blocks: 10131 aes-256-cbc's in 3.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Fri Jan 22 23:53:44 2021 UTC
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -mfloat-abi=hard -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_PREFER_CHACHA_OVER_GCM -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 36852.01k 50971.55k 54719.61k 55443.26k 55719.34k 55328.77k
cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
384000
cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
1725000
cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 0 (v7l)
BogoMIPS : 12.50
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32
CPU implementer : 0x51
CPU architecture: 7
CPU variant : 0x2
CPU part : 0x04d
CPU revision : 0
processor : 1
model name : ARMv7 Processor rev 0 (v7l)
BogoMIPS : 26.04
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32
CPU implementer : 0x51
CPU architecture: 7
CPU variant : 0x2
CPU part : 0x04d
CPU revision : 0
Based on above, I am not confident that the previous benchmarks from E8450 and Xiaomi AX3200, are using AES.
Yes, I am recommending that you go for something with AES-NI. That would be either x86_64 or ARMv8. Because the penalty for shunting the data over the bus to the crypto silicon makes it perform worse for small, synchronous operations than AES-NI.
Correct. It's also a pita to integrate - it required a lot of work to port the code, including a large number of patches, most of it to kernel code, along with crypto drivers, contiguous memory drivers and others, as well as a port of a patched openssl version designed to work with the hardware.
As @slh pointed out, actually using it is also non-trivial and the configuration of the hardware itself, while complicated, is the least of the issues.
To use Intel QuickAssist requires a patched asynchronous version of openssl, which was also a gigantic pita to compile and get working (for some reason Intel likes to write software designed for embedded systems that simply cannot be cross-compiled; pretty bizarre and yes, I pointed this out to the Intel folk responsible for maintaining the software). Using QuickAssist in nginx requires significant patches to nginx as well. It's not an "out of the box" experience by any means
If you're curious to look at what it takes to get hardware like this working, the code is here
For typical Openwrt synchronous workloads on smallish buffers (something like Openvpn), performance using the crypto hardware on AES-CBC was about 70% - 80% of the performance of AES-NI. For larger buffers, the performance started to approach parity. For multiple (36+ threads) asynchronous operations, the speeds was about 10x as fast as you'd get using AES-NI.
It would be real hassle to give you benchmarks, as I compiled the AES acceleration out of the Intel QuickAssist drivers. I'd need to recompile a half dozen kernel modules to be able to get you a benchmark.
The performance on RSA is very good, particularly signing operations, which performs much better than software regardless of whether it's sychronous or not.
On core AES-NI definitely, no doubt in my mind.
@slh and @jiegec, can you run "cat /proc/crypto" and share what is the priority you get under aes section?
For the benchmark I posted above, the priority is 100, which indicates no AES and no crypto engines, as below.
cat /proc/crypto
name : aes
driver : aes-generic
module : kernel
**priority : 100**
refcnt : 7
selftest : passed
internal : no
type : cipher
blocksize : 16
min keysize : 16
max keysize : 32
If AES instruction set was being used on your router, one would expect the priority to be > 100.
ax3600/ ipq8071a:
# cat /proc/crypto
name : hmac(sha512)
driver : nss-hmac-sha512
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 128
digestsize : 64
name : hmac(sha384)
driver : nss-hmac-sha384
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 128
digestsize : 48
name : hmac(sha256)
driver : nss-hmac-sha256
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 64
digestsize : 32
name : hmac(sha1)
driver : nss-hmac-sha1
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 64
digestsize : 20
name : hmac(md5)
driver : nss-hmac-md5
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 64
digestsize : 16
name : sha512
driver : nss-sha512
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 128
digestsize : 64
name : sha384
driver : nss-sha384
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 128
digestsize : 48
name : sha256
driver : nss-sha256
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 64
digestsize : 32
name : sha224
driver : nss-sha224
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 64
digestsize : 28
name : sha1
driver : nss-sha1
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 64
digestsize : 20
name : md5
driver : nss-md5
module : qca_nss_cfi_cryptoapi
priority : 1000
refcnt : 1
selftest : passed
internal : no
type : ahash
async : yes
blocksize : 64
digestsize : 16
name : gcm(aes)
driver : nss-gcm
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 12
maxauthsize : 16
geniv : <none>
name : seqiv(rfc4106(gcm(aes)))
driver : nss-rfc4106-gcm
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 8
maxauthsize : 16
geniv : <none>
name : rfc4106(gcm(aes))
driver : nss-rfc4106-gcm
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 8
maxauthsize : 16
geniv : <none>
name : authenc(hmac(sha256),cbc(des3_ede))
driver : nss-hmac-sha256-cbc-3des
module : qca_nss_cfi_cryptoapi
priority : 300
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 8
ivsize : 8
maxauthsize : 32
geniv : <none>
name : authenc(hmac(sha1),cbc(des3_ede))
driver : nss-hmac-sha1-cbc-3des
module : qca_nss_cfi_cryptoapi
priority : 300
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 8
ivsize : 8
maxauthsize : 20
geniv : <none>
name : authenc(hmac(sha256),cbc(aes))
driver : nss-hmac-sha256-cbc-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 16
maxauthsize : 32
geniv : <none>
name : authenc(hmac(sha1),cbc(aes))
driver : nss-hmac-sha1-cbc-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 16
maxauthsize : 20
geniv : <none>
name : echainiv(authenc(hmac(sha256),cbc(des3_ede)))
driver : nss-hmac-sha256-cbc-3des
module : qca_nss_cfi_cryptoapi
priority : 300
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 8
ivsize : 8
maxauthsize : 32
geniv : <none>
name : echainiv(authenc(hmac(sha1),cbc(des3_ede)))
driver : nss-hmac-sha1-cbc-3des
module : qca_nss_cfi_cryptoapi
priority : 300
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 8
ivsize : 8
maxauthsize : 20
geniv : <none>
name : echainiv(authenc(hmac(md5),cbc(des3_ede)))
driver : nss-hmac-md5-cbc-3des
module : qca_nss_cfi_cryptoapi
priority : 300
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 8
ivsize : 8
maxauthsize : 16
geniv : <none>
name : seqiv(authenc(hmac(sha256),rfc3686(ctr(aes))))
driver : nss-hmac-sha256-rfc3686-ctr-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 8
maxauthsize : 32
geniv : <none>
name : echainiv(authenc(hmac(sha256),cbc(aes)))
driver : nss-hmac-sha256-cbc-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 16
maxauthsize : 32
geniv : <none>
name : seqiv(authenc(hmac(sha1),rfc3686(ctr(aes))))
driver : nss-hmac-sha1-rfc3686-ctr-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 8
maxauthsize : 20
geniv : <none>
name : seqiv(authenc(hmac(md5),rfc3686(ctr(aes))))
driver : nss-hmac-md5-rfc3686-ctr-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 8
maxauthsize : 16
geniv : <none>
name : echainiv(authenc(hmac(sha1),cbc(aes)))
driver : nss-hmac-sha1-cbc-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 16
maxauthsize : 20
geniv : <none>
name : echainiv(authenc(hmac(md5),cbc(aes)))
driver : nss-hmac-md5-cbc-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : aead
async : yes
blocksize : 16
ivsize : 16
maxauthsize : 16
geniv : <none>
name : cbc(des3_ede)
driver : nss-cbc-des-ede
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : ablkcipher
async : yes
blocksize : 8
min keysize : 24
max keysize : 24
ivsize : 8
geniv : <default>
name : ecb(aes)
driver : nss-ecb-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : ablkcipher
async : yes
blocksize : 16
min keysize : 16
max keysize : 32
ivsize : 0
geniv : <default>
name : rfc3686(ctr(aes))
driver : nss-rfc3686-ctr-aes
module : qca_nss_cfi_cryptoapi
priority : 30000
refcnt : 1
selftest : passed
internal : no
type : ablkcipher
async : yes
blocksize : 16
min keysize : 20
max keysize : 36
ivsize : 8
geniv : seqiv
name : cbc(aes)
driver : nss-cbc-aes
module : qca_nss_cfi_cryptoapi
priority : 10000
refcnt : 1
selftest : passed
internal : no
type : ablkcipher
async : yes
blocksize : 16
min keysize : 16
max keysize : 32
ivsize : 16
geniv : <default>
name : hmac(sha512)
driver : hmac(sha512-generic)
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 128
digestsize : 64
name : hmac(sha384)
driver : hmac(sha384-generic)
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 128
digestsize : 48
name : hmac(sha256)
driver : hmac(sha256-generic)
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 64
digestsize : 32
name : cbc(cipher_null)
driver : cbc(cipher_null-generic)
module : cbc
priority : 0
refcnt : 1
selftest : passed
internal : no
type : blkcipher
blocksize : 1
min keysize : 0
max keysize : 0
ivsize : 1
geniv : <default>
name : cbc(aes)
driver : cbc(aes-generic)
module : cbc
priority : 100
refcnt : 1
selftest : passed
internal : no
type : blkcipher
blocksize : 16
min keysize : 16
max keysize : 32
ivsize : 16
geniv : <default>
name : hmac(sha1)
driver : hmac(sha1-generic)
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 64
digestsize : 20
name : sha1
driver : sha1-generic
module : sha1_generic
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 64
digestsize : 20
name : hmac(md5)
driver : hmac(md5-generic)
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 64
digestsize : 16
name : cbc(des3_ede)
driver : cbc(des3_ede-generic)
module : cbc
priority : 100
refcnt : 1
selftest : passed
internal : no
type : blkcipher
blocksize : 8
min keysize : 24
max keysize : 24
ivsize : 8
geniv : <default>
name : cbc(des)
driver : cbc(des-generic)
module : cbc
priority : 100
refcnt : 1
selftest : passed
internal : no
type : blkcipher
blocksize : 8
min keysize : 8
max keysize : 8
ivsize : 8
geniv : <default>
name : md5
driver : md5-generic
module : md5
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 64
digestsize : 16
name : des3_ede
driver : des3_ede-generic
module : des_generic
priority : 100
refcnt : 1
selftest : passed
internal : no
type : cipher
blocksize : 8
min keysize : 24
max keysize : 24
name : des
driver : des-generic
module : des_generic
priority : 100
refcnt : 1
selftest : passed
internal : no
type : cipher
blocksize : 8
min keysize : 8
max keysize : 8
name : ghash
driver : ghash-generic
module : kernel
priority : 100
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 16
digestsize : 16
name : jitterentropy_rng
driver : jitterentropy_rng
module : kernel
priority : 100
refcnt : 1
selftest : passed
internal : no
type : rng
seedsize : 0
name : stdrng
driver : drbg_nopr_hmac_sha256
module : kernel
priority : 207
refcnt : 1
selftest : passed
internal : no
type : rng
seedsize : 0
name : stdrng
driver : drbg_nopr_hmac_sha512
module : kernel
priority : 206
refcnt : 1
selftest : passed
internal : no
type : rng
seedsize : 0
name : stdrng
driver : drbg_nopr_hmac_sha384
module : kernel
priority : 205
refcnt : 1
selftest : passed
internal : no
type : rng
seedsize : 0
name : stdrng
driver : drbg_nopr_hmac_sha1
module : kernel
priority : 204
refcnt : 1
selftest : passed
internal : no
type : rng
seedsize : 0
name : stdrng
driver : drbg_pr_hmac_sha256
module : kernel
priority : 203
refcnt : 1
selftest : passed
internal : no
type : rng
seedsize : 0
name : stdrng
driver : drbg_pr_hmac_sha512
module : kernel
priority : 202
refcnt : 1
selftest : passed
internal : no
type : rng
seedsize : 0
name : stdrng
driver : drbg_pr_hmac_sha384
module : kernel
priority : 201
refcnt : 1
selftest : passed
internal : no
type : rng
seedsize : 0
name : stdrng
driver : drbg_pr_hmac_sha1
module : kernel
priority : 200
refcnt : 1
selftest : passed
internal : no
type : rng
seedsize : 0
name : xz
driver : xz-generic
module : kernel
priority : 0
refcnt : 2
selftest : passed
internal : no
type : compression
name : lzo
driver : lzo-generic
module : kernel
priority : 0
refcnt : 2
selftest : passed
internal : no
type : compression
name : crc32c
driver : crc32c-generic
module : kernel
priority : 100
refcnt : 2
selftest : passed
internal : no
type : shash
blocksize : 1
digestsize : 4
name : deflate
driver : deflate-generic
module : kernel
priority : 0
refcnt : 2
selftest : passed
internal : no
type : compression
name : ecb(arc4)
driver : ecb(arc4)-generic
module : kernel
priority : 100
refcnt : 1
selftest : passed
internal : no
type : blkcipher
blocksize : 1
min keysize : 1
max keysize : 256
ivsize : 0
geniv : <default>
name : arc4
driver : arc4-generic
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : cipher
blocksize : 1
min keysize : 1
max keysize : 256
name : aes
driver : aes-generic
module : kernel
priority : 100
refcnt : 2
selftest : passed
internal : no
type : cipher
blocksize : 16
min keysize : 16
max keysize : 32
name : sha384
driver : sha384-generic
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 128
digestsize : 48
name : sha512
driver : sha512-generic
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 128
digestsize : 64
name : sha224
driver : sha224-generic
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 64
digestsize : 28
name : sha256
driver : sha256-generic
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 64
digestsize : 32
name : digest_null
driver : digest_null-generic
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 1
digestsize : 0
name : compress_null
driver : compress_null-generic
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : compression
name : ecb(cipher_null)
driver : ecb-cipher_null
module : kernel
priority : 100
refcnt : 1
selftest : passed
internal : no
type : blkcipher
blocksize : 1
min keysize : 0
max keysize : 0
ivsize : 0
geniv : <default>
name : cipher_null
driver : cipher_null-generic
module : kernel
priority : 0
refcnt : 1
selftest : passed
internal : no
type : cipher
blocksize : 1
min keysize : 0
max keysize : 0
There are some benchmarks in this thread
Those benchmarks are for the C2000 SoCs. I updated the code for the C3000 SoCs, which perform better than the benchmarks in that thread
Thanks a ton for sharing your detailed insights as well as the rough benchmarks above! You have certainly enlightened me today. Very kind of you. Appreciate it.
Of course, please ignore. I was just hoping to get whatever you had off the top of your mind, which you have already done above.
Thanks for confirming it.
Do you think you might have any insight on how to measure these benchmarks of on-core AES-NI performance impact on openssl? If you'll see in the thread above, I am struggling a bit to do so, since the benchmarks with AES-NI support seem to be poorer than those without AES-NI. It feels to me that either we have a measurement issue, or AES is somehow not being invoked. It is not clear how to figure it out.
openssl -elapsed -evp aes-128-cbc-hmac-sha1
Or with AES-NI enabled
openssl speed -elapsed -evp aes-128-cbc
With AES-NI disabled
OPENSSL_ia32cap=”~0x200000200000000″ openssl speed -elapsed -evp aes-128-cbc
The priority of 10000 certainly seems to confirm that hardware AES is being invoked, but the module of qca_nss_cfi_cryptoapi seems to suggest that it is not the AES-NI but rather the on-silicon crypto engine which is being used. Based on what @dl12345 has shared above, for small buffers, performance of crypto engine is not good, so that could explain why your benchmark is lower than the one I posted for ipq8065.
Is there any way for you to disable nss (perhaps rename the nss driver?), check if that changes /proc/crypto priority and module, and then re-run the openssl benchmark?
You also have to keep in mind that ipq8065 ~= KRAIT300 ~= cortex a15 <-- out of order execution, while cortex a53 is in-order.
No, I can not do a whole lot on the OEM firmware. Only /etc/
is writable, /
is not and I'm not that deep into NSS or openssl, this device is mostly useless in its current state without official OpenWrt support.
Thanks for sharing that. It can be useful to @jiegec for comparing benchmarks with and without AES-NI, assuming he finds that his aes priority in /proc/crypto is greater than 100. In his case, he does not have the complication @slh is facing of the nss engine potentially overriding AES-NI.
On Linksys E8450:
cat /proc/crypto says:
name : aes
driver : aes-generic
module : kernel
priority : 100
refcnt : 4
selftest : passed
internal : no
type : cipher
blocksize : 16
min keysize : 16
max keysize : 32
root@OpenWrt:~# openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 18251273 aes-128-cbc's in 2.99s
Doing aes-128-cbc for 3s on 64 size blocks: 14115514 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 7278493 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2547200 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 363367 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 183837 aes-128-cbc's in 3.00s
OpenSSL 1.1.1j 16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 97665.67k 301130.97k 621098.07k 869444.27k 992234.15k 1003995.14k
Thanks for sharing. I guess might explain the low benchmark. The priority being 100 suggests that AES-NI is not being used. Just to confirm that there are no other aes entries in that list, can you post the full output of /proc/crypt, similar to what @slh posted for his AX3200 above?
That looks like gigantic bump for aes-128! I am not even sure how to interpret this. Can you run it again and then immediately run the same command just replacing aes-128 with aes-256? Thanks.
I've done some benchmarks for you with RSA and ECDH. While it's not AES, it does illustrate what I am talking about, which is to say, the vast gulf between synchronous and asynchronous operation of the crypto hardware.
And the killer is that the application has to be specifically coded to take advantage of the asynchronous mode. It's not transparent and not all workloads map well to it.
Note also how on ECDH, the pure software implementation is faster than the synchronous one where smaller buffers are concerned. It only gets to parity on the synchronous vs software at 571 bits. The asynchronous version on the other hand, is fully 18x faster on 571 bits and 11x faster on 160 bits
With RSA, the sign operations are the most expensive and even synchronous mode beats software, although for verify operations, software beats synchronous mode.
The poorer performance of synchronous mode has everything to do with the comparative inefficiency of shunting data over a bus and and using off-die contiguous main memory. And this is the only mode that can be used for applications that are not explicitly recoded to take advantage of asynchronous mode.
So if you have a web server that uses openssl and it has high traffic, you will benefit greatly from using the accelerator. For a single Openvpn tunnel? No, your performance will be worse.
For reference, this is the board that the benchmarks are run on. It's a Intel C3758 8-core x86_64
# RSA 2K
# asynchronous
# openssl speed -engine qat -elapsed -async_jobs 72 rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 90678 2048 bits private RSA's in 10.01s
Doing 2048 bits public rsa's for 10s: 542922 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.000110s 0.000018s 9058.7 54292.2
# synchronous
# openssl speed -engine qat -elapsed rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 12060 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 70092 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.000829s 0.000143s 1206.0 7009.2
# software
# openssl speed -elapsed rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 3719 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 128740 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.002689s 0.000078s 371.9 12874.0
# ECDH Compute Key
# Asynchronous
# openssl speed -engine qat -elapsed -async_jobs 36 ecdh
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 160 bits ecdh's for 10s: 233455 160-bits ECDH ops in 10.01s
Doing 192 bits ecdh's for 10s: 206296 192-bits ECDH ops in 10.00s
Doing 224 bits ecdh's for 10s: 172498 224-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 163355 256-bits ECDH ops in 10.01s
Doing 384 bits ecdh's for 10s: 95264 384-bits ECDH ops in 10.00s
Doing 521 bits ecdh's for 10s: 70993 521-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 180585 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 134987 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 64415 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 42718 409-bits ECDH ops in 10.01s
Doing 571 bits ecdh's for 10s: 35187 571-bits ECDH ops in 10.02s
Doing 163 bits ecdh's for 10s: 180784 163-bits ECDH ops in 10.01s
Doing 233 bits ecdh's for 10s: 134922 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 60481 283-bits ECDH ops in 10.01s
Doing 409 bits ecdh's for 10s: 45157 409-bits ECDH ops in 10.01s
Doing 571 bits ecdh's for 10s: 35105 571-bits ECDH ops in 10.01s
Doing 256 bits ecdh's for 10s: 163370 256-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 163626 256-bits ECDH ops in 10.01s
Doing 384 bits ecdh's for 10s: 92456 384-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 91757 384-bits ECDH ops in 10.01s
Doing 512 bits ecdh's for 10s: 72111 512-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 72434 512-bits ECDH ops in 10.01s
Doing 253 bits ecdh's for 10s: 79920 253-bits ECDH ops in 10.00s
Doing 448 bits ecdh's for 10s: 6412 448-bits ECDH ops in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
op op/s
160 bits ecdh (secp160r1) 0.0000s 23322.2
192 bits ecdh (nistp192) 0.0000s 20629.6
224 bits ecdh (nistp224) 0.0001s 17249.8
256 bits ecdh (nistp256) 0.0001s 16319.2
384 bits ecdh (nistp384) 0.0001s 9526.4
521 bits ecdh (nistp521) 0.0001s 7099.3
163 bits ecdh (nistk163) 0.0001s 18058.5
233 bits ecdh (nistk233) 0.0001s 13498.7
283 bits ecdh (nistk283) 0.0002s 6441.5
409 bits ecdh (nistk409) 0.0002s 4267.5
571 bits ecdh (nistk571) 0.0003s 3511.7
163 bits ecdh (nistb163) 0.0001s 18060.3
233 bits ecdh (nistb233) 0.0001s 13492.2
283 bits ecdh (nistb283) 0.0002s 6042.1
409 bits ecdh (nistb409) 0.0002s 4511.2
571 bits ecdh (nistb571) 0.0003s 3507.0
256 bits ecdh (brainpoolP256r1) 0.0001s 16337.0
256 bits ecdh (brainpoolP256t1) 0.0001s 16346.3
384 bits ecdh (brainpoolP384r1) 0.0001s 9245.6
384 bits ecdh (brainpoolP384t1) 0.0001s 9166.5
512 bits ecdh (brainpoolP512r1) 0.0001s 7211.1
512 bits ecdh (brainpoolP512t1) 0.0001s 7236.2
253 bits ecdh (X25519) 0.0001s 7992.0
448 bits ecdh (X448) 0.0016s 641.2
# Synchronous
# openssl speed -engine qat -elapsed ecdh
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 160 bits ecdh's for 10s: 14411 160-bits ECDH ops in 10.01s
Doing 192 bits ecdh's for 10s: 13037 192-bits ECDH ops in 10.00s
Doing 224 bits ecdh's for 10s: 11008 224-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 10276 256-bits ECDH ops in 10.01s
Doing 384 bits ecdh's for 10s: 5958 384-bits ECDH ops in 10.00s
Doing 521 bits ecdh's for 10s: 4639 521-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 11409 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 8442 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 4126 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 2749 409-bits ECDH ops in 10.00s
Doing 571 bits ecdh's for 10s: 2312 571-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 11020 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 8207 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 3906 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 2923 409-bits ECDH ops in 10.00s
Doing 571 bits ecdh's for 10s: 2302 571-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 10479 256-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 10427 256-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 5776 384-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 5759 384-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 4729 512-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 4570 512-bits ECDH ops in 10.00s
Doing 253 bits ecdh's for 10s: 79924 253-bits ECDH ops in 10.00s
Doing 448 bits ecdh's for 10s: 6417 448-bits ECDH ops in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
op op/s
160 bits ecdh (secp160r1) 0.0007s 1439.7
192 bits ecdh (nistp192) 0.0008s 1303.7
224 bits ecdh (nistp224) 0.0009s 1100.8
256 bits ecdh (nistp256) 0.0010s 1026.6
384 bits ecdh (nistp384) 0.0017s 595.8
521 bits ecdh (nistp521) 0.0022s 463.9
163 bits ecdh (nistk163) 0.0009s 1140.9
233 bits ecdh (nistk233) 0.0012s 844.2
283 bits ecdh (nistk283) 0.0024s 412.6
409 bits ecdh (nistk409) 0.0036s 274.9
571 bits ecdh (nistk571) 0.0043s 231.2
163 bits ecdh (nistb163) 0.0009s 1102.0
233 bits ecdh (nistb233) 0.0012s 820.7
283 bits ecdh (nistb283) 0.0026s 390.6
409 bits ecdh (nistb409) 0.0034s 292.3
571 bits ecdh (nistb571) 0.0043s 230.2
256 bits ecdh (brainpoolP256r1) 0.0010s 1047.9
256 bits ecdh (brainpoolP256t1) 0.0010s 1042.7
384 bits ecdh (brainpoolP384r1) 0.0017s 577.6
384 bits ecdh (brainpoolP384t1) 0.0017s 575.9
512 bits ecdh (brainpoolP512r1) 0.0021s 472.9
512 bits ecdh (brainpoolP512t1) 0.0022s 457.0
253 bits ecdh (X25519) 0.0001s 7992.4
448 bits ecdh (X448) 0.0016s 641.7
# Software
# openssl speed -elapsed ecdh
You have chosen to measure elapsed time instead of user CPU time.
Doing 160 bits ecdh's for 10s: 19934 160-bits ECDH ops in 10.00s
Doing 192 bits ecdh's for 10s: 16298 192-bits ECDH ops in 10.00s
Doing 224 bits ecdh's for 10s: 10878 224-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 54929 256-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 3968 384-bits ECDH ops in 10.00s
Doing 521 bits ecdh's for 10s: 1634 521-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 16957 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 12276 233-bits ECDH ops in 10.01s
Doing 283 bits ecdh's for 10s: 7125 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 4239 409-bits ECDH ops in 10.00s
Doing 571 bits ecdh's for 10s: 1940 571-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 16276 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 11936 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 6797 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 4020 409-bits ECDH ops in 10.00s
Doing 571 bits ecdh's for 10s: 1809 571-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 9784 256-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 9779 256-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 3976 384-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 4026 384-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 2290 512-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 2192 512-bits ECDH ops in 10.01s
Doing 253 bits ecdh's for 10s: 79917 253-bits ECDH ops in 10.00s
Doing 448 bits ecdh's for 10s: 6411 448-bits ECDH ops in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
op op/s
160 bits ecdh (secp160r1) 0.0005s 1993.4
192 bits ecdh (nistp192) 0.0006s 1629.8
224 bits ecdh (nistp224) 0.0009s 1087.8
256 bits ecdh (nistp256) 0.0002s 5492.9
384 bits ecdh (nistp384) 0.0025s 396.8
521 bits ecdh (nistp521) 0.0061s 163.4
163 bits ecdh (nistk163) 0.0006s 1695.7
233 bits ecdh (nistk233) 0.0008s 1226.4
283 bits ecdh (nistk283) 0.0014s 712.5
409 bits ecdh (nistk409) 0.0024s 423.9
571 bits ecdh (nistk571) 0.0052s 194.0
163 bits ecdh (nistb163) 0.0006s 1627.6
233 bits ecdh (nistb233) 0.0008s 1193.6
283 bits ecdh (nistb283) 0.0015s 679.7
409 bits ecdh (nistb409) 0.0025s 402.0
571 bits ecdh (nistb571) 0.0055s 180.9
256 bits ecdh (brainpoolP256r1) 0.0010s 978.4
256 bits ecdh (brainpoolP256t1) 0.0010s 977.9
384 bits ecdh (brainpoolP384r1) 0.0025s 397.6
384 bits ecdh (brainpoolP384t1) 0.0025s 402.6
512 bits ecdh (brainpoolP512r1) 0.0044s 229.0
512 bits ecdh (brainpoolP512t1) 0.0046s 219.0
253 bits ecdh (X25519) 0.0001s 7991.7
448 bits ecdh (X448) 0.0016s 641.1
I've copy/pasted from the other thread. This benchmark is run on the less capable C2758
It's a crypto-accelerator AES vs AES-NI benchmark
See how anything less than an 8K buffer is faster using the AES-NI version than using the crypto hardware. Only the really large buffers benefit from the crypto hardware
root@OpenWrt:~# openssl -elapsed -engine qat -async_jobs 32 -multi 2 -evp aes-128-cbc-hmac-sha1
90743.27k 199864.47k 298705.41k 353766.06k 524913.32k 613946.71k
root@OpenWrt:~# openssl -elapsed -async_jobs 32 -multi 2 -evp aes-128-cbc-hmac-sha1
153131.80k 257124.37k 329312.17k 364508.16k 376572.59k 377416.36k
Takeaway from all this: don't bother with crypto hardware for Openwrt. The only thing that will make a real difference is AES-NI
AES 128 vs AES 256:
root@OpenWrt:~# openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 18805661 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 14574331 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 7468735 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2609682 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 368604 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 186014 aes-128-cbc's in 3.00s
OpenSSL 1.1.1j 16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 100296.86k 310919.06k 637332.05k 890771.46k 1006534.66k 1015884.46k
root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 17664041 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 12219316 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 5379637 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1696935 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 229447 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 115396 aes-256-cbc's in 3.00s
OpenSSL 1.1.1j 16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 94208.22k 260678.74k 459062.36k 579220.48k 626543.27k 630216.02k
root@OpenWrt:~#
w/ and w/o EVP:
root@OpenWrt:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 17608820 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 64 size blocks: 12195151 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 256 size blocks: 5379602 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1696386 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 229437 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 16384 size blocks: 115389 aes-256-cbc's in 3.00s
OpenSSL 1.1.1j 16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 94227.80k 261033.33k 459059.37k 579033.09k 628611.34k 630177.79k
root@OpenWrt:~# openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 4905394 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 64 size blocks: 1306013 aes-256 cbc's in 2.99s
Doing aes-256 cbc for 3s on 256 size blocks: 333727 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 83839 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 10493 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16384 size blocks: 5243 aes-256 cbc's in 2.99s
OpenSSL 1.1.1j 16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256 cbc 26162.10k 27954.79k 28478.04k 28617.05k 28652.89k 28729.54k
root@OpenWrt:~#
So it seems that the EVP ones are optimised: https://security.stackexchange.com/questions/35036/different-performance-of-openssl-speed-on-the-same-hardware-with-aes-256-evp-an
manually built openssl vs opkg openssl-util
root@OpenWrt:~# ./openssl-static speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 21175430 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 64 size blocks: 13804150 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 5672902 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1724868 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 8192 size blocks: 230127 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 115459 aes-256-cbc's in 3.00s
OpenSSL 1.1.1j 16 Feb 2021
built on: Thu Mar 18 11:09:33 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: gcc -pthread -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 113313.34k 294488.53k 484087.64k 590724.02k 628400.13k 630560.09k
root@OpenWrt:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 17430854 aes-256-cbc's in 2.94s
Doing aes-256-cbc for 3s on 64 size blocks: 12043668 aes-256-cbc's in 2.95s
Doing aes-256-cbc for 3s on 256 size blocks: 5256099 aes-256-cbc's in 2.91s
Doing aes-256-cbc for 3s on 1024 size blocks: 1670277 aes-256-cbc's in 2.96s
Doing aes-256-cbc for 3s on 8192 size blocks: 224852 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 16384 size blocks: 112725 aes-256-cbc's in 2.94s
OpenSSL 1.1.1j 16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 94861.79k 261286.36k 462392.21k 577825.56k 630817.67k 628192.65k
root@OpenWrt:~#