This is entirely unsurprising.
I have some experience of porting hardware crypto engines to Openwrt, since I've done it for Intel's C2000 series SoCs (C2558) and for their C3000 series SoCs (C3758). These are the processors known as Rangeley and Denverton and are substantially more powerful than the one you refer to here.
There are a number of caveats with regard to performance of these accelerators.
To get the best performance, the application needs to be aware that it is offloading to an accelerator. Everything you find on Openwrt is not aware and so not likely to benefit from the accelerator.
The reason for this is simple: bus transfer time. Data from a memory buffer needs to be transferred across the lookaside pci-e bus to the accelerator, the accelerator performs the operation on the buffer and the results are then transferred back. Granted, this is typically accomplished by DMA on contiguous pinned memory blocks, but there's still a substantial penalty for this transfer over the bus compared to just operating on it using CPU instructions.
For single, synchronous requests, it is slower than performing the operation in software using the CPU instructions.
This state of affairs is aggravated by smaller buffer sizes where the data transfer time occupies a higher proportion of the overall execution time. The larger the buffer, the more the application would benefit from the accelerator. Typical network applications are going to be using small buffer sizes and many are unlikely to be able to do so asynchronously, so likely to get worse performance by offloading.
The Intel quickassist devices do not even attempt to offload buffers < 2K in size and even at 4K size buffers, performance with chained symmetric ciphers is barely on parity with a pure software implementation.
openssl 1.1.0 brought in the notion of asynchronous operations, where the application can offload multiple non-blocking requests to to the acceleration device, allowing the device to process them all in parallel. This is where the performance gap between software/CPU implementations and a hardware accelerator becomes really substantial
For example, look at the performance of the C3758 hardware crypto on RSA below. The first two are purely synchronous operations, one in software and one in hardware. You'll note that the offloaded one is actually slower than software on the faster verify operation, while being faster on the much more computationally expensive signing operation which benefits more from the hardware offload.
root@OpenWrt:~# openssl speed -elapsed rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 3684 2048 bits private RSA's in 10.01s
Doing 2048 bits public rsa's for 10s: 127429 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1f 31 Mar 2020
built on: Mon Apr 13 15:35:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.002717s 0.000078s 368.0 12742.9
root@OpenWrt:~# openssl speed -elapsed -engine qat rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 12105 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 71326 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1f 31 Mar 2020
built on: Mon Apr 13 15:35:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.000826s 0.000140s 1210.5 7132.6
For chained symmetric ciphers, synchronous operation on this platform is always slower on the crypto accelerator than using AESNI on the CPU. Even asynchronously, the chained symmetric ciphers are only faster on buffer sizes > 8K, which is much larger than the typical packet size and really only something likely to be encountered in file-based IO on an encrypted file system or NAS application.
Below are the results for asynchronous operation so you can see how having the application aware of the accelerator makes a difference.
root@OpenWrt:~# openssl speed -elapsed -async_jobs 72 rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 3671 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 127503 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1f 31 Mar 2020
built on: Mon Apr 13 15:35:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.002724s 0.000078s 367.1 12750.3
root@OpenWrt:~# openssl speed -elapsed -engine qat -async_jobs 72 rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 93879 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 743979 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1f 31 Mar 2020
built on: Mon Apr 13 15:35:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
sign verify sign/s verify/s
rsa 2048 bits 0.000107s 0.000013s 9387.9 74397.9
Something like nginx modified to be able to use asynchronous operations can benefit from an accelerator (Intel has patches for nginx to allow this). Openvpn or IPSEC mostly likely cannot yield a huge performance improvement unless passing a high volume of data, even if doing so asynchronously.
So, in summary, unless the application is optimized for use on a hardware accelerator by making asynchronous openssl calls or by using the accelerator directly with asynchronous calls, it's likely to experience worse performance than a software implemention