For those of you using boards based on Intel's Rangeley c2xxx SoCs, I've posted a feed for Intel Quick Assist kernel drivers, firmware and openssl acceleration engine on my github.
Quickassist
Openwrt 19.07.2 packages for Intel Quick Assist on c2xxx SoCs
Functionality
- This repository contains kernel drivers and firmware for the Intel QAT v1.5 drivers for Rangeley SoCs (C2358, C2558, C2758) and the Intel openssl QAT engine, which allows hardware accelerated openssl
- The drivers are patched with an Intel-supplied unpublished patch to allow compilation against 4.x kernels
- The QAT 1.7 usdm_drv contiguous pinned memory driver is backported from QAT 1.7 (the version of QAT for Denverton SoCs). This allows use of a production quality contiguous pinned memory driver rather than the example driver shipped with the QAT engine
- Configuration files and services to load and configure the drivers
Requirements
- Openwrt 19.0.7 SNAPSHOT (openssl 1.1.1e is required due to broken engine behaviour in 1.1.1d)
- Kernel configuration symbols: CONFIG_HUGETLBFS=y and CONFIG_HUGETLB_PAGE=y
- glibc
Notes
- Tested on a Supermicro A1SRi C2558F motherboad
- This feed will support QAT accelerated openvpn if openvpn is compiled with openssl (invoked with --engine qat)
- Please refer to https://github.com/intel/QAT_Engine for information on how to use and test
I have disabled symmetric cipher acceleration in the openssl QAT engine. The QAT engine does not offload less than 2KB to the hardware anyway, as the time to transfer over the pci bus makes it less efficient than using AESNI.
In real terms, using QAT offloading for anything less than an 8KB buffer results in slower performance than using AESNI. Additionally, the speed of the software implementation suffers considerably when not offloading due to engine related polling, resulting in up to a 40% performance hit for small buffer sized, software-based encryption/decryption.
Since most use cases of this will be for network-acceleration such as Openvpn, which uses small buffers, it is better to completely disable symmetric cipher acceleration and retain the higher performance AESNI. Enabling symmetric ciphers in the openssl QAT engine would only make sense if using it to encrypt large blocks, for example some type of file-related encryption. You can always turn it back on by removing the relevant switch in the Makefile
Note that this is for Rangeley SoCs only. C3000 SoCs have an upstream driver. You could probably use this package as a base to create a similar level of functionality for Denverton boards (turn on the upstream driver in the kernel config and build only the usdm driver in the QAT 1.7 package; the configuration options for the QAT Engine would also need to be modified)
See below for the performance of AES if using QAT offloading on small buffers. This explains why it's been disabled
root@OpenWrt:~# openssl -elapsed -engine qat -async_jobs 32 -multi 2 -evp aes-128-cbc-hmac-sha1
90743.27k 199864.47k 298705.41k 353766.06k 524913.32k 613946.71k
and with symmetric ciphers disabled (ie., using the aesni instructions only with no QAT offloading). It's faster on anything less than an 8KB block
root@OpenWrt:~# openssl -elapsed -async_jobs 32 -multi 2 -evp aes-128-cbc-hmac-sha1
153131.80k 257124.37k 329312.17k 364508.16k 376572.59k 377416.36k
Here's the performance on rsa2048
root@OpenWrt:~# openssl speed -engine qat -elapsed rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 9120 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 100570 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1e 17 Mar 2020
built on: Wed Mar 25 21:03:53 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.001096s 0.000099s 912.0 10057.0
Software only performance:
root@OpenWrt:~# openssl speed -elapsed rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 2067 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 70977 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1e 17 Mar 2020
built on: Wed Mar 25 21:03:53 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.004838s 0.000141s 206.7 7097.7
and engine performance with async jobs enabled:
root@OpenWrt:~# openssl speed -async_jobs 32 -multi 2 -engine qat -elapsed rsa2048
Forked child 0
Forked child 1
engine "qat" set.
engine "qat" set.
+DTP:2048:private:rsa:10
+DTP:2048:private:rsa:10
+R1:7053:2048:10.04
+R1:5903:2048:10.05
+DTP:2048:public:rsa:10
+DTP:2048:public:rsa:10
+R2:205988:2048:10.00
+R2:73227:2048:10.01
Got: +F2:2:2048:702.490040:20598.800000 from 0
Got: +F2:2:2048:587.363184:7315.384615 from 1
OpenSSL 1.1.1e 17 Mar 2020
built on: Wed Mar 25 21:03:53 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.000775s 0.000036s 1289.9 27914.2