Intel Quick Assist v1.5 drivers and openssl 1.1.1e acceleration engine for 19.07.2

For those of you using boards based on Intel's Rangeley c2xxx SoCs, I've posted a feed for Intel Quick Assist kernel drivers, firmware and openssl acceleration engine on my github.

Quickassist

Openwrt 19.07.2 packages for Intel Quick Assist on c2xxx SoCs

Functionality

  • This repository contains kernel drivers and firmware for the Intel QAT v1.5 drivers for Rangeley SoCs (C2358, C2558, C2758) and the Intel openssl QAT engine, which allows hardware accelerated openssl
  • The drivers are patched with an Intel-supplied unpublished patch to allow compilation against 4.x kernels
  • The QAT 1.7 usdm_drv contiguous pinned memory driver is backported from QAT 1.7 (the version of QAT for Denverton SoCs). This allows use of a production quality contiguous pinned memory driver rather than the example driver shipped with the QAT engine
  • Configuration files and services to load and configure the drivers

Requirements

  • Openwrt 19.0.7 SNAPSHOT (openssl 1.1.1e is required due to broken engine behaviour in 1.1.1d)
  • Kernel configuration symbols: CONFIG_HUGETLBFS=y and CONFIG_HUGETLB_PAGE=y
  • glibc

Notes

  • Tested on a Supermicro A1SRi C2558F motherboad
  • This feed will support QAT accelerated openvpn if openvpn is compiled with openssl (invoked with --engine qat)
  • Please refer to https://github.com/intel/QAT_Engine for information on how to use and test

I have disabled symmetric cipher acceleration in the openssl QAT engine. The QAT engine does not offload less than 2KB to the hardware anyway, as the time to transfer over the pci bus makes it less efficient than using AESNI.

In real terms, using QAT offloading for anything less than an 8KB buffer results in slower performance than using AESNI. Additionally, the speed of the software implementation suffers considerably when not offloading due to engine related polling, resulting in up to a 40% performance hit for small buffer sized, software-based encryption/decryption.

Since most use cases of this will be for network-acceleration such as Openvpn, which uses small buffers, it is better to completely disable symmetric cipher acceleration and retain the higher performance AESNI. Enabling symmetric ciphers in the openssl QAT engine would only make sense if using it to encrypt large blocks, for example some type of file-related encryption. You can always turn it back on by removing the relevant switch in the Makefile

Note that this is for Rangeley SoCs only. C3000 SoCs have an upstream driver. You could probably use this package as a base to create a similar level of functionality for Denverton boards (turn on the upstream driver in the kernel config and build only the usdm driver in the QAT 1.7 package; the configuration options for the QAT Engine would also need to be modified)

See below for the performance of AES if using QAT offloading on small buffers. This explains why it's been disabled

root@OpenWrt:~# openssl -elapsed -engine qat -async_jobs 32 -multi 2 -evp aes-128-cbc-hmac-sha1 
 90743.27k   199864.47k   298705.41k   353766.06k   524913.32k   613946.71k 

and with symmetric ciphers disabled (ie., using the aesni instructions only with no QAT offloading). It's faster on anything less than an 8KB block

root@OpenWrt:~# openssl -elapsed -async_jobs 32 -multi 2 -evp aes-128-cbc-hmac-sha1 
153131.80k   257124.37k   329312.17k   364508.16k   376572.59k   377416.36k

Here's the performance on rsa2048

root@OpenWrt:~# openssl speed -engine qat -elapsed rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 9120 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 100570 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1e  17 Mar 2020
built on: Wed Mar 25 21:03:53 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.001096s 0.000099s    912.0  10057.0

Software only performance:

root@OpenWrt:~# openssl speed -elapsed rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 2067 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 70977 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1e  17 Mar 2020
built on: Wed Mar 25 21:03:53 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.004838s 0.000141s    206.7   7097.7

and engine performance with async jobs enabled:

root@OpenWrt:~# openssl speed -async_jobs 32 -multi 2 -engine qat -elapsed rsa2048
Forked child 0
Forked child 1
engine "qat" set.
engine "qat" set.
+DTP:2048:private:rsa:10
+DTP:2048:private:rsa:10
+R1:7053:2048:10.04
+R1:5903:2048:10.05
+DTP:2048:public:rsa:10
+DTP:2048:public:rsa:10
+R2:205988:2048:10.00
+R2:73227:2048:10.01
Got: +F2:2:2048:702.490040:20598.800000 from 0
Got: +F2:2:2048:587.363184:7315.384615 from 1
OpenSSL 1.1.1e  17 Mar 2020
built on: Wed Mar 25 21:03:53 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000775s 0.000036s   1289.9  27914.2