Which WiFi routers have hardware AES encryption support?

Just for comparison (Xiaomi Mi AIoT Router AX3600, ipq8071a, 4*1.38GHz, cortex a53/ ARMv8, with the rather unusable and unoptimized OEM firmware):

# cat /proc/cpuinfo 
processor       : 0
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 1
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 2
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 3
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4
# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq 
1017600

# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 
1382400
# openssl engine -t -c
(dynamic) Dynamic engine loading support
     [ unavailable ]
# openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 4630455 aes-256 cbc's in 2.86s
Doing aes-256 cbc for 3s on 64 size blocks: 1313578 aes-256 cbc's in 2.73s
Doing aes-256 cbc for 3s on 256 size blocks: 332944 aes-256 cbc's in 2.59s
Doing aes-256 cbc for 3s on 1024 size blocks: 83464 aes-256 cbc's in 2.70s
Doing aes-256 cbc for 3s on 8192 size blocks: 10427 aes-256 cbc's in 2.75s
OpenSSL 1.0.2q  20 Nov 2018
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,2,int) aes(partial) blowfish(ptr) 
compiler: aarch64-openwrt-linux-gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -I/home/jenkins/romdaily_new_openwrt/system/staging_dir/target-aarch64-openwrt-linux_musl/usr/include -I/home/jenkins/romdaily_new_openwrt/system/staging_dir/target-aarch64-openwrt-linux_musl/include -I/home/jenkins/Xiaoqiangtoolchain/toolchain/external_toolchain/toolchain-aarch64_cortex-a53_gcc-5.5.0_musl//usr/include -I/home/jenkins/Xiaoqiangtoolchain/toolchain/external_toolchain/toolchain-aarch64_cortex-a53_gcc-5.5.0_musl//include -specs=/home/jenkins/romdaily_new_openwrt/system/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_SMALL_FOOTPRINT -DMIWIFI_FEATURE -DHAVE_CRYPTODEV -DOPENSSL_NO_ERR -DTERMIOS -Os -pipe -march=armv8-a -mcpu=cortex-a53+crypto -fno-caller-saves  -Wformat -fpic -fstack-protector -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -fpic -I/home/jenkins/romdaily_new_openwrt/system/package/libs/openssl/include -ffunction-sections -fdata-sections -fomit-frame-pointer -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc      25904.64k    30794.50k    32908.75k    31654.49k    31061.09k

ipq8074a would be clocked at 4*2.2 GHz, cortex a53/ ARMv8

Here are the results from IPQ8065 (1.7 GHz, 2 cores), with no NSS support and no AES, since it is 32-bit ARM-v7A

openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 6794589 aes-256-cbc's in 2.95s
Doing aes-256-cbc for 3s on 64 size blocks: 2381327 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 256 size blocks: 634833 aes-256-cbc's in 2.97s
Doing aes-256-cbc for 3s on 1024 size blocks: 161890 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 8192 size blocks: 20269 aes-256-cbc's in 2.98s
Doing aes-256-cbc for 3s on 16384 size blocks: 10131 aes-256-cbc's in 3.00s
OpenSSL 1.1.1i  8 Dec 2020
built on: Fri Jan 22 23:53:44 2021 UTC
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -mfloat-abi=hard -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_PREFER_CHACHA_OVER_GCM -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      36852.01k    50971.55k    54719.61k    55443.26k    55719.34k    55328.77k

cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
384000

cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
1725000

cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 0 (v7l)
BogoMIPS        : 12.50
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32
CPU implementer : 0x51
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0x04d
CPU revision    : 0

processor       : 1
model name      : ARMv7 Processor rev 0 (v7l)
BogoMIPS        : 26.04
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32
CPU implementer : 0x51
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0x04d
CPU revision    : 0



Based on above, I am not confident that the previous benchmarks from E8450 and Xiaomi AX3200, are using AES.

Yes, I am recommending that you go for something with AES-NI. That would be either x86_64 or ARMv8. Because the penalty for shunting the data over the bus to the crypto silicon makes it perform worse for small, synchronous operations than AES-NI.

Correct. It's also a pita to integrate - it required a lot of work to port the code, including a large number of patches, most of it to kernel code, along with crypto drivers, contiguous memory drivers and others, as well as a port of a patched openssl version designed to work with the hardware.

As @slh pointed out, actually using it is also non-trivial and the configuration of the hardware itself, while complicated, is the least of the issues.

To use Intel QuickAssist requires a patched asynchronous version of openssl, which was also a gigantic pita to compile and get working (for some reason Intel likes to write software designed for embedded systems that simply cannot be cross-compiled; pretty bizarre and yes, I pointed this out to the Intel folk responsible for maintaining the software). Using QuickAssist in nginx requires significant patches to nginx as well. It's not an "out of the box" experience by any means

If you're curious to look at what it takes to get hardware like this working, the code is here

For typical Openwrt synchronous workloads on smallish buffers (something like Openvpn), performance using the crypto hardware on AES-CBC was about 70% - 80% of the performance of AES-NI. For larger buffers, the performance started to approach parity. For multiple (36+ threads) asynchronous operations, the speeds was about 10x as fast as you'd get using AES-NI.

It would be real hassle to give you benchmarks, as I compiled the AES acceleration out of the Intel QuickAssist drivers. I'd need to recompile a half dozen kernel modules to be able to get you a benchmark.

The performance on RSA is very good, particularly signing operations, which performs much better than software regardless of whether it's sychronous or not.

On core AES-NI definitely, no doubt in my mind.

4 Likes

@slh and @jiegec, can you run "cat /proc/crypto" and share what is the priority you get under aes section?

For the benchmark I posted above, the priority is 100, which indicates no AES and no crypto engines, as below.

cat /proc/crypto

name         : aes
driver       : aes-generic
module       : kernel
**priority     : 100**
refcnt       : 7
selftest     : passed
internal     : no
type         : cipher
blocksize    : 16
min keysize  : 16
max keysize  : 32

If AES instruction set was being used on your router, one would expect the priority to be > 100.

ax3600/ ipq8071a:

# cat /proc/crypto 
name         : hmac(sha512)
driver       : nss-hmac-sha512
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 128
digestsize   : 64

name         : hmac(sha384)
driver       : nss-hmac-sha384
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 128
digestsize   : 48

name         : hmac(sha256)
driver       : nss-hmac-sha256
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 64
digestsize   : 32

name         : hmac(sha1)
driver       : nss-hmac-sha1
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 64
digestsize   : 20

name         : hmac(md5)
driver       : nss-hmac-md5
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 64
digestsize   : 16

name         : sha512
driver       : nss-sha512
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 128
digestsize   : 64

name         : sha384
driver       : nss-sha384
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 128
digestsize   : 48

name         : sha256
driver       : nss-sha256
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 64
digestsize   : 32

name         : sha224
driver       : nss-sha224
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 64
digestsize   : 28

name         : sha1
driver       : nss-sha1
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 64
digestsize   : 20

name         : md5
driver       : nss-md5
module       : qca_nss_cfi_cryptoapi
priority     : 1000
refcnt       : 1
selftest     : passed
internal     : no
type         : ahash
async        : yes
blocksize    : 64
digestsize   : 16

name         : gcm(aes)
driver       : nss-gcm
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 12
maxauthsize  : 16
geniv        : <none>

name         : seqiv(rfc4106(gcm(aes)))
driver       : nss-rfc4106-gcm
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 8
maxauthsize  : 16
geniv        : <none>

name         : rfc4106(gcm(aes))
driver       : nss-rfc4106-gcm
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 8
maxauthsize  : 16
geniv        : <none>

name         : authenc(hmac(sha256),cbc(des3_ede))
driver       : nss-hmac-sha256-cbc-3des
module       : qca_nss_cfi_cryptoapi
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 8
ivsize       : 8
maxauthsize  : 32
geniv        : <none>

name         : authenc(hmac(sha1),cbc(des3_ede))
driver       : nss-hmac-sha1-cbc-3des
module       : qca_nss_cfi_cryptoapi
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 8
ivsize       : 8
maxauthsize  : 20
geniv        : <none>

name         : authenc(hmac(sha256),cbc(aes))
driver       : nss-hmac-sha256-cbc-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 16
maxauthsize  : 32
geniv        : <none>

name         : authenc(hmac(sha1),cbc(aes))
driver       : nss-hmac-sha1-cbc-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 16
maxauthsize  : 20
geniv        : <none>

name         : echainiv(authenc(hmac(sha256),cbc(des3_ede)))
driver       : nss-hmac-sha256-cbc-3des
module       : qca_nss_cfi_cryptoapi
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 8
ivsize       : 8
maxauthsize  : 32
geniv        : <none>

name         : echainiv(authenc(hmac(sha1),cbc(des3_ede)))
driver       : nss-hmac-sha1-cbc-3des
module       : qca_nss_cfi_cryptoapi
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 8
ivsize       : 8
maxauthsize  : 20
geniv        : <none>

name         : echainiv(authenc(hmac(md5),cbc(des3_ede)))
driver       : nss-hmac-md5-cbc-3des
module       : qca_nss_cfi_cryptoapi
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 8
ivsize       : 8
maxauthsize  : 16
geniv        : <none>

name         : seqiv(authenc(hmac(sha256),rfc3686(ctr(aes))))
driver       : nss-hmac-sha256-rfc3686-ctr-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 8
maxauthsize  : 32
geniv        : <none>

name         : echainiv(authenc(hmac(sha256),cbc(aes)))
driver       : nss-hmac-sha256-cbc-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 16
maxauthsize  : 32
geniv        : <none>

name         : seqiv(authenc(hmac(sha1),rfc3686(ctr(aes))))
driver       : nss-hmac-sha1-rfc3686-ctr-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 8
maxauthsize  : 20
geniv        : <none>

name         : seqiv(authenc(hmac(md5),rfc3686(ctr(aes))))
driver       : nss-hmac-md5-rfc3686-ctr-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 8
maxauthsize  : 16
geniv        : <none>

name         : echainiv(authenc(hmac(sha1),cbc(aes)))
driver       : nss-hmac-sha1-cbc-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 16
maxauthsize  : 20
geniv        : <none>

name         : echainiv(authenc(hmac(md5),cbc(aes)))
driver       : nss-hmac-md5-cbc-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : aead
async        : yes
blocksize    : 16
ivsize       : 16
maxauthsize  : 16
geniv        : <none>

name         : cbc(des3_ede)
driver       : nss-cbc-des-ede
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : ablkcipher
async        : yes
blocksize    : 8
min keysize  : 24
max keysize  : 24
ivsize       : 8
geniv        : <default>

name         : ecb(aes)
driver       : nss-ecb-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : ablkcipher
async        : yes
blocksize    : 16
min keysize  : 16
max keysize  : 32
ivsize       : 0
geniv        : <default>

name         : rfc3686(ctr(aes))
driver       : nss-rfc3686-ctr-aes
module       : qca_nss_cfi_cryptoapi
priority     : 30000
refcnt       : 1
selftest     : passed
internal     : no
type         : ablkcipher
async        : yes
blocksize    : 16
min keysize  : 20
max keysize  : 36
ivsize       : 8
geniv        : seqiv

name         : cbc(aes)
driver       : nss-cbc-aes
module       : qca_nss_cfi_cryptoapi
priority     : 10000
refcnt       : 1
selftest     : passed
internal     : no
type         : ablkcipher
async        : yes
blocksize    : 16
min keysize  : 16
max keysize  : 32
ivsize       : 16
geniv        : <default>

name         : hmac(sha512)
driver       : hmac(sha512-generic)
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 128
digestsize   : 64

name         : hmac(sha384)
driver       : hmac(sha384-generic)
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 128
digestsize   : 48

name         : hmac(sha256)
driver       : hmac(sha256-generic)
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 64
digestsize   : 32

name         : cbc(cipher_null)
driver       : cbc(cipher_null-generic)
module       : cbc
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : blkcipher
blocksize    : 1
min keysize  : 0
max keysize  : 0
ivsize       : 1
geniv        : <default>

name         : cbc(aes)
driver       : cbc(aes-generic)
module       : cbc
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : blkcipher
blocksize    : 16
min keysize  : 16
max keysize  : 32
ivsize       : 16
geniv        : <default>

name         : hmac(sha1)
driver       : hmac(sha1-generic)
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 64
digestsize   : 20

name         : sha1
driver       : sha1-generic
module       : sha1_generic
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 64
digestsize   : 20

name         : hmac(md5)
driver       : hmac(md5-generic)
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 64
digestsize   : 16

name         : cbc(des3_ede)
driver       : cbc(des3_ede-generic)
module       : cbc
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : blkcipher
blocksize    : 8
min keysize  : 24
max keysize  : 24
ivsize       : 8
geniv        : <default>

name         : cbc(des)
driver       : cbc(des-generic)
module       : cbc
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : blkcipher
blocksize    : 8
min keysize  : 8
max keysize  : 8
ivsize       : 8
geniv        : <default>

name         : md5
driver       : md5-generic
module       : md5
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 64
digestsize   : 16

name         : des3_ede
driver       : des3_ede-generic
module       : des_generic
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : cipher
blocksize    : 8
min keysize  : 24
max keysize  : 24

name         : des
driver       : des-generic
module       : des_generic
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : cipher
blocksize    : 8
min keysize  : 8
max keysize  : 8

name         : ghash
driver       : ghash-generic
module       : kernel
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 16
digestsize   : 16

name         : jitterentropy_rng
driver       : jitterentropy_rng
module       : kernel
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : rng
seedsize     : 0

name         : stdrng
driver       : drbg_nopr_hmac_sha256
module       : kernel
priority     : 207
refcnt       : 1
selftest     : passed
internal     : no
type         : rng
seedsize     : 0

name         : stdrng
driver       : drbg_nopr_hmac_sha512
module       : kernel
priority     : 206
refcnt       : 1
selftest     : passed
internal     : no
type         : rng
seedsize     : 0

name         : stdrng
driver       : drbg_nopr_hmac_sha384
module       : kernel
priority     : 205
refcnt       : 1
selftest     : passed
internal     : no
type         : rng
seedsize     : 0

name         : stdrng
driver       : drbg_nopr_hmac_sha1
module       : kernel
priority     : 204
refcnt       : 1
selftest     : passed
internal     : no
type         : rng
seedsize     : 0

name         : stdrng
driver       : drbg_pr_hmac_sha256
module       : kernel
priority     : 203
refcnt       : 1
selftest     : passed
internal     : no
type         : rng
seedsize     : 0

name         : stdrng
driver       : drbg_pr_hmac_sha512
module       : kernel
priority     : 202
refcnt       : 1
selftest     : passed
internal     : no
type         : rng
seedsize     : 0

name         : stdrng
driver       : drbg_pr_hmac_sha384
module       : kernel
priority     : 201
refcnt       : 1
selftest     : passed
internal     : no
type         : rng
seedsize     : 0

name         : stdrng
driver       : drbg_pr_hmac_sha1
module       : kernel
priority     : 200
refcnt       : 1
selftest     : passed
internal     : no
type         : rng
seedsize     : 0

name         : xz
driver       : xz-generic
module       : kernel
priority     : 0
refcnt       : 2
selftest     : passed
internal     : no
type         : compression

name         : lzo
driver       : lzo-generic
module       : kernel
priority     : 0
refcnt       : 2
selftest     : passed
internal     : no
type         : compression

name         : crc32c
driver       : crc32c-generic
module       : kernel
priority     : 100
refcnt       : 2
selftest     : passed
internal     : no
type         : shash
blocksize    : 1
digestsize   : 4

name         : deflate
driver       : deflate-generic
module       : kernel
priority     : 0
refcnt       : 2
selftest     : passed
internal     : no
type         : compression

name         : ecb(arc4)
driver       : ecb(arc4)-generic
module       : kernel
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : blkcipher
blocksize    : 1
min keysize  : 1
max keysize  : 256
ivsize       : 0
geniv        : <default>

name         : arc4
driver       : arc4-generic
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : cipher
blocksize    : 1
min keysize  : 1
max keysize  : 256

name         : aes
driver       : aes-generic
module       : kernel
priority     : 100
refcnt       : 2
selftest     : passed
internal     : no
type         : cipher
blocksize    : 16
min keysize  : 16
max keysize  : 32

name         : sha384
driver       : sha384-generic
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 128
digestsize   : 48

name         : sha512
driver       : sha512-generic
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 128
digestsize   : 64

name         : sha224
driver       : sha224-generic
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 64
digestsize   : 28

name         : sha256
driver       : sha256-generic
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 64
digestsize   : 32

name         : digest_null
driver       : digest_null-generic
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 1
digestsize   : 0

name         : compress_null
driver       : compress_null-generic
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : compression

name         : ecb(cipher_null)
driver       : ecb-cipher_null
module       : kernel
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : blkcipher
blocksize    : 1
min keysize  : 0
max keysize  : 0
ivsize       : 0
geniv        : <default>

name         : cipher_null
driver       : cipher_null-generic
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : cipher
blocksize    : 1
min keysize  : 0
max keysize  : 0

There are some benchmarks in this thread

Those benchmarks are for the C2000 SoCs. I updated the code for the C3000 SoCs, which perform better than the benchmarks in that thread

Thanks a ton for sharing your detailed insights as well as the rough benchmarks above! You have certainly enlightened me today. Very kind of you. Appreciate it.

Of course, please ignore. I was just hoping to get whatever you had off the top of your mind, which you have already done above.

Thanks for confirming it.

Do you think you might have any insight on how to measure these benchmarks of on-core AES-NI performance impact on openssl? If you'll see in the thread above, I am struggling a bit to do so, since the benchmarks with AES-NI support seem to be poorer than those without AES-NI. It feels to me that either we have a measurement issue, or AES is somehow not being invoked. It is not clear how to figure it out.

openssl -elapsed -evp aes-128-cbc-hmac-sha1 
Or with AES-NI enabled
openssl speed -elapsed -evp aes-128-cbc

With AES-NI disabled
OPENSSL_ia32cap=”~0x200000200000000″ openssl speed -elapsed -evp aes-128-cbc

The priority of 10000 certainly seems to confirm that hardware AES is being invoked, but the module of qca_nss_cfi_cryptoapi seems to suggest that it is not the AES-NI but rather the on-silicon crypto engine which is being used. Based on what @dl12345 has shared above, for small buffers, performance of crypto engine is not good, so that could explain why your benchmark is lower than the one I posted for ipq8065.

Is there any way for you to disable nss (perhaps rename the nss driver?), check if that changes /proc/crypto priority and module, and then re-run the openssl benchmark?

You also have to keep in mind that ipq8065 ~= KRAIT300 ~= cortex a15 <-- out of order execution, while cortex a53 is in-order.

No, I can not do a whole lot on the OEM firmware. Only /etc/ is writable, / is not and I'm not that deep into NSS or openssl, this device is mostly useless in its current state without official OpenWrt support.

Understood.

Let's see if @jiegec has better luck on getting the above information for the E8450.

Thanks for sharing that. It can be useful to @jiegec for comparing benchmarks with and without AES-NI, assuming he finds that his aes priority in /proc/crypto is greater than 100. In his case, he does not have the complication @slh is facing of the nss engine potentially overriding AES-NI.

On Linksys E8450:

cat /proc/crypto says:

name         : aes
driver       : aes-generic
module       : kernel
priority     : 100
refcnt       : 4
selftest     : passed
internal     : no
type         : cipher
blocksize    : 16
min keysize  : 16
max keysize  : 32
root@OpenWrt:~# openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 18251273 aes-128-cbc's in 2.99s
Doing aes-128-cbc for 3s on 64 size blocks: 14115514 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 7278493 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2547200 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 363367 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 183837 aes-128-cbc's in 3.00s
OpenSSL 1.1.1j  16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc      97665.67k   301130.97k   621098.07k   869444.27k   992234.15k  1003995.14k

Thanks for sharing. I guess might explain the low benchmark. The priority being 100 suggests that AES-NI is not being used. Just to confirm that there are no other aes entries in that list, can you post the full output of /proc/crypt, similar to what @slh posted for his AX3200 above?

That looks like gigantic bump for aes-128! I am not even sure how to interpret this. Can you run it again and then immediately run the same command just replacing aes-128 with aes-256? Thanks.

I've done some benchmarks for you with RSA and ECDH. While it's not AES, it does illustrate what I am talking about, which is to say, the vast gulf between synchronous and asynchronous operation of the crypto hardware.

And the killer is that the application has to be specifically coded to take advantage of the asynchronous mode. It's not transparent and not all workloads map well to it.

Note also how on ECDH, the pure software implementation is faster than the synchronous one where smaller buffers are concerned. It only gets to parity on the synchronous vs software at 571 bits. The asynchronous version on the other hand, is fully 18x faster on 571 bits and 11x faster on 160 bits

With RSA, the sign operations are the most expensive and even synchronous mode beats software, although for verify operations, software beats synchronous mode.

The poorer performance of synchronous mode has everything to do with the comparative inefficiency of shunting data over a bus and and using off-die contiguous main memory. And this is the only mode that can be used for applications that are not explicitly recoded to take advantage of asynchronous mode.

So if you have a web server that uses openssl and it has high traffic, you will benefit greatly from using the accelerator. For a single Openvpn tunnel? No, your performance will be worse.

For reference, this is the board that the benchmarks are run on. It's a Intel C3758 8-core x86_64

# RSA 2K

# asynchronous
# openssl speed -engine qat -elapsed -async_jobs 72 rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 90678 2048 bits private RSA's in 10.01s
Doing 2048 bits public rsa's for 10s: 542922 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1i  8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000110s 0.000018s   9058.7  54292.2

# synchronous
# openssl speed -engine qat -elapsed rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 12060 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 70092 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1i  8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000829s 0.000143s   1206.0   7009.2

# software
# openssl speed -elapsed rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 3719 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 128740 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1i  8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.002689s 0.000078s    371.9  12874.0

# ECDH Compute Key

# Asynchronous
# openssl speed -engine qat -elapsed -async_jobs 36 ecdh
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 160 bits  ecdh's for 10s: 233455 160-bits ECDH ops in 10.01s
Doing 192 bits  ecdh's for 10s: 206296 192-bits ECDH ops in 10.00s
Doing 224 bits  ecdh's for 10s: 172498 224-bits ECDH ops in 10.00s
Doing 256 bits  ecdh's for 10s: 163355 256-bits ECDH ops in 10.01s
Doing 384 bits  ecdh's for 10s: 95264 384-bits ECDH ops in 10.00s
Doing 521 bits  ecdh's for 10s: 70993 521-bits ECDH ops in 10.00s
Doing 163 bits  ecdh's for 10s: 180585 163-bits ECDH ops in 10.00s
Doing 233 bits  ecdh's for 10s: 134987 233-bits ECDH ops in 10.00s
Doing 283 bits  ecdh's for 10s: 64415 283-bits ECDH ops in 10.00s
Doing 409 bits  ecdh's for 10s: 42718 409-bits ECDH ops in 10.01s
Doing 571 bits  ecdh's for 10s: 35187 571-bits ECDH ops in 10.02s
Doing 163 bits  ecdh's for 10s: 180784 163-bits ECDH ops in 10.01s
Doing 233 bits  ecdh's for 10s: 134922 233-bits ECDH ops in 10.00s
Doing 283 bits  ecdh's for 10s: 60481 283-bits ECDH ops in 10.01s
Doing 409 bits  ecdh's for 10s: 45157 409-bits ECDH ops in 10.01s
Doing 571 bits  ecdh's for 10s: 35105 571-bits ECDH ops in 10.01s
Doing 256 bits  ecdh's for 10s: 163370 256-bits ECDH ops in 10.00s
Doing 256 bits  ecdh's for 10s: 163626 256-bits ECDH ops in 10.01s
Doing 384 bits  ecdh's for 10s: 92456 384-bits ECDH ops in 10.00s
Doing 384 bits  ecdh's for 10s: 91757 384-bits ECDH ops in 10.01s
Doing 512 bits  ecdh's for 10s: 72111 512-bits ECDH ops in 10.00s
Doing 512 bits  ecdh's for 10s: 72434 512-bits ECDH ops in 10.01s
Doing 253 bits  ecdh's for 10s: 79920 253-bits ECDH ops in 10.00s
Doing 448 bits  ecdh's for 10s: 6412 448-bits ECDH ops in 10.00s
OpenSSL 1.1.1i  8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                              op      op/s
 160 bits ecdh (secp160r1)   0.0000s  23322.2
 192 bits ecdh (nistp192)   0.0000s  20629.6
 224 bits ecdh (nistp224)   0.0001s  17249.8
 256 bits ecdh (nistp256)   0.0001s  16319.2
 384 bits ecdh (nistp384)   0.0001s   9526.4
 521 bits ecdh (nistp521)   0.0001s   7099.3
 163 bits ecdh (nistk163)   0.0001s  18058.5
 233 bits ecdh (nistk233)   0.0001s  13498.7
 283 bits ecdh (nistk283)   0.0002s   6441.5
 409 bits ecdh (nistk409)   0.0002s   4267.5
 571 bits ecdh (nistk571)   0.0003s   3511.7
 163 bits ecdh (nistb163)   0.0001s  18060.3
 233 bits ecdh (nistb233)   0.0001s  13492.2
 283 bits ecdh (nistb283)   0.0002s   6042.1
 409 bits ecdh (nistb409)   0.0002s   4511.2
 571 bits ecdh (nistb571)   0.0003s   3507.0
 256 bits ecdh (brainpoolP256r1)   0.0001s  16337.0
 256 bits ecdh (brainpoolP256t1)   0.0001s  16346.3
 384 bits ecdh (brainpoolP384r1)   0.0001s   9245.6
 384 bits ecdh (brainpoolP384t1)   0.0001s   9166.5
 512 bits ecdh (brainpoolP512r1)   0.0001s   7211.1
 512 bits ecdh (brainpoolP512t1)   0.0001s   7236.2
 253 bits ecdh (X25519)   0.0001s   7992.0
 448 bits ecdh (X448)   0.0016s    641.2

 
# Synchronous
# openssl speed -engine qat -elapsed ecdh
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 160 bits  ecdh's for 10s: 14411 160-bits ECDH ops in 10.01s
Doing 192 bits  ecdh's for 10s: 13037 192-bits ECDH ops in 10.00s
Doing 224 bits  ecdh's for 10s: 11008 224-bits ECDH ops in 10.00s
Doing 256 bits  ecdh's for 10s: 10276 256-bits ECDH ops in 10.01s
Doing 384 bits  ecdh's for 10s: 5958 384-bits ECDH ops in 10.00s
Doing 521 bits  ecdh's for 10s: 4639 521-bits ECDH ops in 10.00s
Doing 163 bits  ecdh's for 10s: 11409 163-bits ECDH ops in 10.00s
Doing 233 bits  ecdh's for 10s: 8442 233-bits ECDH ops in 10.00s
Doing 283 bits  ecdh's for 10s: 4126 283-bits ECDH ops in 10.00s
Doing 409 bits  ecdh's for 10s: 2749 409-bits ECDH ops in 10.00s
Doing 571 bits  ecdh's for 10s: 2312 571-bits ECDH ops in 10.00s
Doing 163 bits  ecdh's for 10s: 11020 163-bits ECDH ops in 10.00s
Doing 233 bits  ecdh's for 10s: 8207 233-bits ECDH ops in 10.00s
Doing 283 bits  ecdh's for 10s: 3906 283-bits ECDH ops in 10.00s
Doing 409 bits  ecdh's for 10s: 2923 409-bits ECDH ops in 10.00s
Doing 571 bits  ecdh's for 10s: 2302 571-bits ECDH ops in 10.00s
Doing 256 bits  ecdh's for 10s: 10479 256-bits ECDH ops in 10.00s
Doing 256 bits  ecdh's for 10s: 10427 256-bits ECDH ops in 10.00s
Doing 384 bits  ecdh's for 10s: 5776 384-bits ECDH ops in 10.00s
Doing 384 bits  ecdh's for 10s: 5759 384-bits ECDH ops in 10.00s
Doing 512 bits  ecdh's for 10s: 4729 512-bits ECDH ops in 10.00s
Doing 512 bits  ecdh's for 10s: 4570 512-bits ECDH ops in 10.00s
Doing 253 bits  ecdh's for 10s: 79924 253-bits ECDH ops in 10.00s
Doing 448 bits  ecdh's for 10s: 6417 448-bits ECDH ops in 10.00s
OpenSSL 1.1.1i  8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                              op      op/s
 160 bits ecdh (secp160r1)   0.0007s   1439.7
 192 bits ecdh (nistp192)   0.0008s   1303.7
 224 bits ecdh (nistp224)   0.0009s   1100.8
 256 bits ecdh (nistp256)   0.0010s   1026.6
 384 bits ecdh (nistp384)   0.0017s    595.8
 521 bits ecdh (nistp521)   0.0022s    463.9
 163 bits ecdh (nistk163)   0.0009s   1140.9
 233 bits ecdh (nistk233)   0.0012s    844.2
 283 bits ecdh (nistk283)   0.0024s    412.6
 409 bits ecdh (nistk409)   0.0036s    274.9
 571 bits ecdh (nistk571)   0.0043s    231.2
 163 bits ecdh (nistb163)   0.0009s   1102.0
 233 bits ecdh (nistb233)   0.0012s    820.7
 283 bits ecdh (nistb283)   0.0026s    390.6
 409 bits ecdh (nistb409)   0.0034s    292.3
 571 bits ecdh (nistb571)   0.0043s    230.2
 256 bits ecdh (brainpoolP256r1)   0.0010s   1047.9
 256 bits ecdh (brainpoolP256t1)   0.0010s   1042.7
 384 bits ecdh (brainpoolP384r1)   0.0017s    577.6
 384 bits ecdh (brainpoolP384t1)   0.0017s    575.9
 512 bits ecdh (brainpoolP512r1)   0.0021s    472.9
 512 bits ecdh (brainpoolP512t1)   0.0022s    457.0
 253 bits ecdh (X25519)   0.0001s   7992.4
 448 bits ecdh (X448)   0.0016s    641.7

 
# Software
# openssl speed -elapsed ecdh
You have chosen to measure elapsed time instead of user CPU time.
Doing 160 bits  ecdh's for 10s: 19934 160-bits ECDH ops in 10.00s
Doing 192 bits  ecdh's for 10s: 16298 192-bits ECDH ops in 10.00s
Doing 224 bits  ecdh's for 10s: 10878 224-bits ECDH ops in 10.00s
Doing 256 bits  ecdh's for 10s: 54929 256-bits ECDH ops in 10.00s
Doing 384 bits  ecdh's for 10s: 3968 384-bits ECDH ops in 10.00s
Doing 521 bits  ecdh's for 10s: 1634 521-bits ECDH ops in 10.00s
Doing 163 bits  ecdh's for 10s: 16957 163-bits ECDH ops in 10.00s
Doing 233 bits  ecdh's for 10s: 12276 233-bits ECDH ops in 10.01s
Doing 283 bits  ecdh's for 10s: 7125 283-bits ECDH ops in 10.00s
Doing 409 bits  ecdh's for 10s: 4239 409-bits ECDH ops in 10.00s
Doing 571 bits  ecdh's for 10s: 1940 571-bits ECDH ops in 10.00s
Doing 163 bits  ecdh's for 10s: 16276 163-bits ECDH ops in 10.00s
Doing 233 bits  ecdh's for 10s: 11936 233-bits ECDH ops in 10.00s
Doing 283 bits  ecdh's for 10s: 6797 283-bits ECDH ops in 10.00s
Doing 409 bits  ecdh's for 10s: 4020 409-bits ECDH ops in 10.00s
Doing 571 bits  ecdh's for 10s: 1809 571-bits ECDH ops in 10.00s
Doing 256 bits  ecdh's for 10s: 9784 256-bits ECDH ops in 10.00s
Doing 256 bits  ecdh's for 10s: 9779 256-bits ECDH ops in 10.00s
Doing 384 bits  ecdh's for 10s: 3976 384-bits ECDH ops in 10.00s
Doing 384 bits  ecdh's for 10s: 4026 384-bits ECDH ops in 10.00s
Doing 512 bits  ecdh's for 10s: 2290 512-bits ECDH ops in 10.00s
Doing 512 bits  ecdh's for 10s: 2192 512-bits ECDH ops in 10.01s
Doing 253 bits  ecdh's for 10s: 79917 253-bits ECDH ops in 10.00s
Doing 448 bits  ecdh's for 10s: 6411 448-bits ECDH ops in 10.00s
OpenSSL 1.1.1i  8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                              op      op/s
 160 bits ecdh (secp160r1)   0.0005s   1993.4
 192 bits ecdh (nistp192)   0.0006s   1629.8
 224 bits ecdh (nistp224)   0.0009s   1087.8
 256 bits ecdh (nistp256)   0.0002s   5492.9
 384 bits ecdh (nistp384)   0.0025s    396.8
 521 bits ecdh (nistp521)   0.0061s    163.4
 163 bits ecdh (nistk163)   0.0006s   1695.7
 233 bits ecdh (nistk233)   0.0008s   1226.4
 283 bits ecdh (nistk283)   0.0014s    712.5
 409 bits ecdh (nistk409)   0.0024s    423.9
 571 bits ecdh (nistk571)   0.0052s    194.0
 163 bits ecdh (nistb163)   0.0006s   1627.6
 233 bits ecdh (nistb233)   0.0008s   1193.6
 283 bits ecdh (nistb283)   0.0015s    679.7
 409 bits ecdh (nistb409)   0.0025s    402.0
 571 bits ecdh (nistb571)   0.0055s    180.9
 256 bits ecdh (brainpoolP256r1)   0.0010s    978.4
 256 bits ecdh (brainpoolP256t1)   0.0010s    977.9
 384 bits ecdh (brainpoolP384r1)   0.0025s    397.6
 384 bits ecdh (brainpoolP384t1)   0.0025s    402.6
 512 bits ecdh (brainpoolP512r1)   0.0044s    229.0
 512 bits ecdh (brainpoolP512t1)   0.0046s    219.0
 253 bits ecdh (X25519)   0.0001s   7991.7
 448 bits ecdh (X448)   0.0016s    641.1
3 Likes

I've copy/pasted from the other thread. This benchmark is run on the less capable C2758

It's a crypto-accelerator AES vs AES-NI benchmark

See how anything less than an 8K buffer is faster using the AES-NI version than using the crypto hardware. Only the really large buffers benefit from the crypto hardware

root@OpenWrt:~# openssl -elapsed -engine qat -async_jobs 32 -multi 2 -evp aes-128-cbc-hmac-sha1 
 90743.27k   199864.47k   298705.41k   353766.06k   524913.32k   613946.71k 
 
 root@OpenWrt:~# openssl -elapsed -async_jobs 32 -multi 2 -evp aes-128-cbc-hmac-sha1 
153131.80k   257124.37k   329312.17k   364508.16k   376572.59k   377416.36k

Takeaway from all this: don't bother with crypto hardware for Openwrt. The only thing that will make a real difference is AES-NI

AES 128 vs AES 256:

root@OpenWrt:~# openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 18805661 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 14574331 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 7468735 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2609682 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 368604 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 186014 aes-128-cbc's in 3.00s
OpenSSL 1.1.1j  16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc     100296.86k   310919.06k   637332.05k   890771.46k  1006534.66k  1015884.46k
root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 17664041 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 12219316 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 5379637 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1696935 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 229447 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 115396 aes-256-cbc's in 3.00s
OpenSSL 1.1.1j  16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      94208.22k   260678.74k   459062.36k   579220.48k   626543.27k   630216.02k
root@OpenWrt:~#

w/ and w/o EVP:

root@OpenWrt:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 17608820 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 64 size blocks: 12195151 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 256 size blocks: 5379602 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1696386 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 229437 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 16384 size blocks: 115389 aes-256-cbc's in 3.00s
OpenSSL 1.1.1j  16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      94227.80k   261033.33k   459059.37k   579033.09k   628611.34k   630177.79k
root@OpenWrt:~# openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 4905394 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 64 size blocks: 1306013 aes-256 cbc's in 2.99s
Doing aes-256 cbc for 3s on 256 size blocks: 333727 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 83839 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 10493 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16384 size blocks: 5243 aes-256 cbc's in 2.99s
OpenSSL 1.1.1j  16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256 cbc      26162.10k    27954.79k    28478.04k    28617.05k    28652.89k    28729.54k
root@OpenWrt:~#

So it seems that the EVP ones are optimised: https://security.stackexchange.com/questions/35036/different-performance-of-openssl-speed-on-the-same-hardware-with-aes-256-evp-an

manually built openssl vs opkg openssl-util

root@OpenWrt:~# ./openssl-static speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 21175430 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 64 size blocks: 13804150 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 5672902 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1724868 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 8192 size blocks: 230127 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 115459 aes-256-cbc's in 3.00s
OpenSSL 1.1.1j  16 Feb 2021
built on: Thu Mar 18 11:09:33 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) idea(int) blowfish(ptr)
compiler: gcc -pthread -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc     113313.34k   294488.53k   484087.64k   590724.02k   628400.13k   630560.09k
root@OpenWrt:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 17430854 aes-256-cbc's in 2.94s
Doing aes-256-cbc for 3s on 64 size blocks: 12043668 aes-256-cbc's in 2.95s
Doing aes-256-cbc for 3s on 256 size blocks: 5256099 aes-256-cbc's in 2.91s
Doing aes-256-cbc for 3s on 1024 size blocks: 1670277 aes-256-cbc's in 2.96s
Doing aes-256-cbc for 3s on 8192 size blocks: 224852 aes-256-cbc's in 2.92s
Doing aes-256-cbc for 3s on 16384 size blocks: 112725 aes-256-cbc's in 2.94s
OpenSSL 1.1.1j  16 Feb 2021
built on: Tue Mar 16 11:27:55 2021 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=cortex-a53 -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_SMALL_FOOTPRINT
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      94861.79k   261286.36k   462392.21k   577825.56k   630817.67k   628192.65k
root@OpenWrt:~#

2 Likes

Sure - non evp does not use AES-NI. It's a pure software implementation

Aha, that makes more sense now. I scrolled up and just noticed that your very first benchmarks were without the evp parameter. My mistake, I should have noticed earlier and asked you to correct it, when I said that the latter benchmark did not make sense to me.

These look like AES-NI optimized numbers!

Thank you so much for all your efforts in running and sharing these benchmarks! Very helpful in understanding what is going on.

Can you also share the full output of 'cat /proc/crypto'? I am still a bit mystified on why it showed a priority of 100 on aes, for that single list you posted? As per OpenWrt documentation, we should have seen at least one entry for aes with > 100 priority due to AES-NI support.

I think your data is indeed quite compelling. Thank you again, for all your hard work in digging up the benchmarks and sharing them here.

In addition, the data posted by @jiegec for a router with no crypto engine but with AES support, fit very nicely with your point. I think perhaps the only piece missing would be for somebody to post similar data for a Linksys WRT family router (WRT1200, WRT1900, WRT3200, WRT32X) which are Marvell Armada based routers with crypto engine but with no AES support. It would be very interesting to see their numbers.

If anybody reading this post on these forums has any of these Linksys routers with Marvell Armada CPU, and can run these two benchmarks of "openssl speed -elapsed -evp aes-256-cbc" and "openssl speed -elapsed -evp aes-128-cbc" as well as the output of their "cat /proc/crypto", I would appreciate it. Thanks.