I've done some benchmarks for you with RSA and ECDH. While it's not AES, it does illustrate what I am talking about, which is to say, the vast gulf between synchronous and asynchronous operation of the crypto hardware.
And the killer is that the application has to be specifically coded to take advantage of the asynchronous mode. It's not transparent and not all workloads map well to it.
Note also how on ECDH, the pure software implementation is faster than the synchronous one where smaller buffers are concerned. It only gets to parity on the synchronous vs software at 571 bits. The asynchronous version on the other hand, is fully 18x faster on 571 bits and 11x faster on 160 bits
With RSA, the sign operations are the most expensive and even synchronous mode beats software, although for verify operations, software beats synchronous mode.
The poorer performance of synchronous mode has everything to do with the comparative inefficiency of shunting data over a bus and and using off-die contiguous main memory. And this is the only mode that can be used for applications that are not explicitly recoded to take advantage of asynchronous mode.
So if you have a web server that uses openssl and it has high traffic, you will benefit greatly from using the accelerator. For a single Openvpn tunnel? No, your performance will be worse.
For reference, this is the board that the benchmarks are run on. It's a Intel C3758 8-core x86_64
# RSA 2K
# asynchronous
# openssl speed -engine qat -elapsed -async_jobs 72 rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 90678 2048 bits private RSA's in 10.01s
Doing 2048 bits public rsa's for 10s: 542922 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.000110s 0.000018s 9058.7 54292.2
# synchronous
# openssl speed -engine qat -elapsed rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 12060 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 70092 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.000829s 0.000143s 1206.0 7009.2
# software
# openssl speed -elapsed rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 3719 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 128740 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.002689s 0.000078s 371.9 12874.0
# ECDH Compute Key
# Asynchronous
# openssl speed -engine qat -elapsed -async_jobs 36 ecdh
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 160 bits ecdh's for 10s: 233455 160-bits ECDH ops in 10.01s
Doing 192 bits ecdh's for 10s: 206296 192-bits ECDH ops in 10.00s
Doing 224 bits ecdh's for 10s: 172498 224-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 163355 256-bits ECDH ops in 10.01s
Doing 384 bits ecdh's for 10s: 95264 384-bits ECDH ops in 10.00s
Doing 521 bits ecdh's for 10s: 70993 521-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 180585 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 134987 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 64415 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 42718 409-bits ECDH ops in 10.01s
Doing 571 bits ecdh's for 10s: 35187 571-bits ECDH ops in 10.02s
Doing 163 bits ecdh's for 10s: 180784 163-bits ECDH ops in 10.01s
Doing 233 bits ecdh's for 10s: 134922 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 60481 283-bits ECDH ops in 10.01s
Doing 409 bits ecdh's for 10s: 45157 409-bits ECDH ops in 10.01s
Doing 571 bits ecdh's for 10s: 35105 571-bits ECDH ops in 10.01s
Doing 256 bits ecdh's for 10s: 163370 256-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 163626 256-bits ECDH ops in 10.01s
Doing 384 bits ecdh's for 10s: 92456 384-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 91757 384-bits ECDH ops in 10.01s
Doing 512 bits ecdh's for 10s: 72111 512-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 72434 512-bits ECDH ops in 10.01s
Doing 253 bits ecdh's for 10s: 79920 253-bits ECDH ops in 10.00s
Doing 448 bits ecdh's for 10s: 6412 448-bits ECDH ops in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
op op/s
160 bits ecdh (secp160r1) 0.0000s 23322.2
192 bits ecdh (nistp192) 0.0000s 20629.6
224 bits ecdh (nistp224) 0.0001s 17249.8
256 bits ecdh (nistp256) 0.0001s 16319.2
384 bits ecdh (nistp384) 0.0001s 9526.4
521 bits ecdh (nistp521) 0.0001s 7099.3
163 bits ecdh (nistk163) 0.0001s 18058.5
233 bits ecdh (nistk233) 0.0001s 13498.7
283 bits ecdh (nistk283) 0.0002s 6441.5
409 bits ecdh (nistk409) 0.0002s 4267.5
571 bits ecdh (nistk571) 0.0003s 3511.7
163 bits ecdh (nistb163) 0.0001s 18060.3
233 bits ecdh (nistb233) 0.0001s 13492.2
283 bits ecdh (nistb283) 0.0002s 6042.1
409 bits ecdh (nistb409) 0.0002s 4511.2
571 bits ecdh (nistb571) 0.0003s 3507.0
256 bits ecdh (brainpoolP256r1) 0.0001s 16337.0
256 bits ecdh (brainpoolP256t1) 0.0001s 16346.3
384 bits ecdh (brainpoolP384r1) 0.0001s 9245.6
384 bits ecdh (brainpoolP384t1) 0.0001s 9166.5
512 bits ecdh (brainpoolP512r1) 0.0001s 7211.1
512 bits ecdh (brainpoolP512t1) 0.0001s 7236.2
253 bits ecdh (X25519) 0.0001s 7992.0
448 bits ecdh (X448) 0.0016s 641.2
# Synchronous
# openssl speed -engine qat -elapsed ecdh
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 160 bits ecdh's for 10s: 14411 160-bits ECDH ops in 10.01s
Doing 192 bits ecdh's for 10s: 13037 192-bits ECDH ops in 10.00s
Doing 224 bits ecdh's for 10s: 11008 224-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 10276 256-bits ECDH ops in 10.01s
Doing 384 bits ecdh's for 10s: 5958 384-bits ECDH ops in 10.00s
Doing 521 bits ecdh's for 10s: 4639 521-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 11409 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 8442 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 4126 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 2749 409-bits ECDH ops in 10.00s
Doing 571 bits ecdh's for 10s: 2312 571-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 11020 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 8207 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 3906 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 2923 409-bits ECDH ops in 10.00s
Doing 571 bits ecdh's for 10s: 2302 571-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 10479 256-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 10427 256-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 5776 384-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 5759 384-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 4729 512-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 4570 512-bits ECDH ops in 10.00s
Doing 253 bits ecdh's for 10s: 79924 253-bits ECDH ops in 10.00s
Doing 448 bits ecdh's for 10s: 6417 448-bits ECDH ops in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
op op/s
160 bits ecdh (secp160r1) 0.0007s 1439.7
192 bits ecdh (nistp192) 0.0008s 1303.7
224 bits ecdh (nistp224) 0.0009s 1100.8
256 bits ecdh (nistp256) 0.0010s 1026.6
384 bits ecdh (nistp384) 0.0017s 595.8
521 bits ecdh (nistp521) 0.0022s 463.9
163 bits ecdh (nistk163) 0.0009s 1140.9
233 bits ecdh (nistk233) 0.0012s 844.2
283 bits ecdh (nistk283) 0.0024s 412.6
409 bits ecdh (nistk409) 0.0036s 274.9
571 bits ecdh (nistk571) 0.0043s 231.2
163 bits ecdh (nistb163) 0.0009s 1102.0
233 bits ecdh (nistb233) 0.0012s 820.7
283 bits ecdh (nistb283) 0.0026s 390.6
409 bits ecdh (nistb409) 0.0034s 292.3
571 bits ecdh (nistb571) 0.0043s 230.2
256 bits ecdh (brainpoolP256r1) 0.0010s 1047.9
256 bits ecdh (brainpoolP256t1) 0.0010s 1042.7
384 bits ecdh (brainpoolP384r1) 0.0017s 577.6
384 bits ecdh (brainpoolP384t1) 0.0017s 575.9
512 bits ecdh (brainpoolP512r1) 0.0021s 472.9
512 bits ecdh (brainpoolP512t1) 0.0022s 457.0
253 bits ecdh (X25519) 0.0001s 7992.4
448 bits ecdh (X448) 0.0016s 641.7
# Software
# openssl speed -elapsed ecdh
You have chosen to measure elapsed time instead of user CPU time.
Doing 160 bits ecdh's for 10s: 19934 160-bits ECDH ops in 10.00s
Doing 192 bits ecdh's for 10s: 16298 192-bits ECDH ops in 10.00s
Doing 224 bits ecdh's for 10s: 10878 224-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 54929 256-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 3968 384-bits ECDH ops in 10.00s
Doing 521 bits ecdh's for 10s: 1634 521-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 16957 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 12276 233-bits ECDH ops in 10.01s
Doing 283 bits ecdh's for 10s: 7125 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 4239 409-bits ECDH ops in 10.00s
Doing 571 bits ecdh's for 10s: 1940 571-bits ECDH ops in 10.00s
Doing 163 bits ecdh's for 10s: 16276 163-bits ECDH ops in 10.00s
Doing 233 bits ecdh's for 10s: 11936 233-bits ECDH ops in 10.00s
Doing 283 bits ecdh's for 10s: 6797 283-bits ECDH ops in 10.00s
Doing 409 bits ecdh's for 10s: 4020 409-bits ECDH ops in 10.00s
Doing 571 bits ecdh's for 10s: 1809 571-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 9784 256-bits ECDH ops in 10.00s
Doing 256 bits ecdh's for 10s: 9779 256-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 3976 384-bits ECDH ops in 10.00s
Doing 384 bits ecdh's for 10s: 4026 384-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 2290 512-bits ECDH ops in 10.00s
Doing 512 bits ecdh's for 10s: 2192 512-bits ECDH ops in 10.01s
Doing 253 bits ecdh's for 10s: 79917 253-bits ECDH ops in 10.00s
Doing 448 bits ecdh's for 10s: 6411 448-bits ECDH ops in 10.00s
OpenSSL 1.1.1i 8 Dec 2020
built on: Sat Jan 30 15:32:43 2021 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fpic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -fpic -specs=/opt/openwrt/x86/master/openwrt/include/hardened-ld-pie.specs -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
op op/s
160 bits ecdh (secp160r1) 0.0005s 1993.4
192 bits ecdh (nistp192) 0.0006s 1629.8
224 bits ecdh (nistp224) 0.0009s 1087.8
256 bits ecdh (nistp256) 0.0002s 5492.9
384 bits ecdh (nistp384) 0.0025s 396.8
521 bits ecdh (nistp521) 0.0061s 163.4
163 bits ecdh (nistk163) 0.0006s 1695.7
233 bits ecdh (nistk233) 0.0008s 1226.4
283 bits ecdh (nistk283) 0.0014s 712.5
409 bits ecdh (nistk409) 0.0024s 423.9
571 bits ecdh (nistk571) 0.0052s 194.0
163 bits ecdh (nistb163) 0.0006s 1627.6
233 bits ecdh (nistb233) 0.0008s 1193.6
283 bits ecdh (nistb283) 0.0015s 679.7
409 bits ecdh (nistb409) 0.0025s 402.0
571 bits ecdh (nistb571) 0.0055s 180.9
256 bits ecdh (brainpoolP256r1) 0.0010s 978.4
256 bits ecdh (brainpoolP256t1) 0.0010s 977.9
384 bits ecdh (brainpoolP384r1) 0.0025s 397.6
384 bits ecdh (brainpoolP384t1) 0.0025s 402.6
512 bits ecdh (brainpoolP512r1) 0.0044s 229.0
512 bits ecdh (brainpoolP512t1) 0.0046s 219.0
253 bits ecdh (X25519) 0.0001s 7991.7
448 bits ecdh (X448) 0.0016s 641.1