4526.87k is actually 4.5 mbps not MB/s so 9mbps looks like CPU limitation
I upgraded to 18.06-rc1 and I repeated the benchmark in order to check the differences with the previous version 17.04.1.
openssl speed -elapsed md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc rsa2048 dsa2048
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md5 396.41k 1770.35k 3212.71k 9018.37k 15018.67k
des ede3 1223.26k 1394.79k 1397.33k 1404.59k 1401.62k
aes-192 cbc 5197.93k 5690.41k 5824.94k 5859.33k 5849.09k
aes-256 cbc 4642.82k 5023.49k 5124.18k 5153.79k 5147.31k
sha256 845.22k 2547.22k 3001.75k 5256.42k 3140.27k
sha512 143.48k 623.78k 842.84k 1657.45k 1723.05
Then I followed the procedure of @mpa (I had to install also kmod-crypto-authenc_4.9.109-1_mips_24kc
) and I repeated the benchmarks.
rmmod cryptodev
openssl speed -elapsed md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc rsa2048 dsa2048
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md5 993.93k 3667.86k 11772.25k 26215.08k 40949.08k
des ede3 1380.57k 1399.94k 1401.51k 1394.01k 1355.35k
aes-192 cbc 5157.39k 5663.87k 5775.96k 5835.09k 5860.01k
aes-256 cbc 4627.19k 5013.55k 5127.08k 5132.29k 5158.23k
sha256 2350.09k 5415.74k 9545.39k 11777.37k 12670.29k
sha512 495.86k 1976.55k 2748.25k 3709.27k 4134.23
and
modprobe cryptodev
for a in md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc; do openssl speed -elapsed -evp $a; done
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md5 158.66k 596.12k 2288.81k 7885.48k 27560.62k
des-ede3 1349.11k 1375.53k 1383.34k 1382.40k 1381.72k
aes-192-cbc 1334.17k 3452.44k 10609.83k 22038.19k 31812.27k
aes-256-cbc 1325.06k 3382.61k 10154.84k 20216.83k 28292.44k
sha256 547.17k 1854.81k 5163.35k 9316.35k 12119.26k
sha512 287.68k 1179.24k 2206.29k 3408.21k 4033.40k
As expected improvements come for data size equal or above 256 bytes and for supported algorithm (aes-192-cbc and aes-256-cbc).
4526.87k is actually 4.5 mbps not MB/s so 9mbps looks like CPU limitation
No the values are in bytes, you can read here.
didn;t know that but in practice i've noticed openvpn speeds to be around the values of openssl speed result but in mbps
fwiw, I asked one of the devs, mkresin, to take a quick look at this interesting thread. He asked me to post this response:
First of all, cryptodev is a 3rd party kernel module which wasn't accepted by the kernel devs. Instead the Crypto API was added with a 2.6-ish linux kernel [0]. My opinion is the corresponding kernel modules are already packaged for OpenWrt [1].
Support for the Crypto API was added to OpenSSL 1.1.0 [2].
I have no idea whether or not OpenSSL is compiled with Crypto API support by default. Perhaps further special Kernel options need to be selected to enable the base Crypto API support. I donât know whether the Lantiq DEU (Data Encryption Unit) driver supports the Crypto API. In best case scenario, it might be a matter of loading the correct kernel modules to get hardware accelerated cryptography working.
In my opinion using the cryptodev approach + the 3rd party module is the wrong way. I can only suggest perhaps someone picks this up as a task to have a look at Crypto API based acceleration.
Mathias
[0] https://en.wikipedia.org/wiki/Crypto_API_(Linux)
[1] https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/kernel/linux/modules/crypto.mk
[2] https://github.com/openssl/openssl/commit/7f458a48ff3a231d5841466525d2aacbcd4f6b77
The hardware driver natively uses the Linux crypto API. Via the AF_ALG socket it will work from userspace. OpenSSL didnât officially support it until 1.1.? OpenWRT is now getting the updated OpenSSL version as soon as all the patches are reviewed/added to Master.
Cryptodev is a third party module, but is for now the only way to get user space apps to use the hardware driver together with OpenSSL. The (also third party) AF_ALG engine for the older OpenSSL versions were never officially intergrated. This happened as said before with the 1.1 version.
I never got the separate OpenSSL engine to work with OpenSSL (didnât try very hard). But according to the older benchmarks done by the cryptodev people, the BSD approach to the /dev/crypto was a lot faster compared to the AF_ALG socket.
As soon as OpenSSL 1.1 is officially in OpenWRT both approaches should work and switching between the two options is just a matter of changing engine. The AF_ALG engine in OpenSSL is a little more flexible in terms of selecting which encryption method or hash is using the engine.
I made some searches on the Web. According to this, cryptodev has better performance than AF_ALG API. However, according to more recent sources here 2014, here 2017 and here, software implementation outperforms both cryptodev and AF_ALG especially for small data size (TCP/UDP/IP about 64kBytes, ethernet between 1.5kBytes for standard frame and 9kBytes for jumbo frame). Moreover, according to latest benchmark here the difference between cryptodev and AF_ALG is not so high as showed here.
In my opinion using the cryptodev approach + the 3rd party module is the wrong way. I can only suggest perhaps someone picks this up as a task to have a look at Crypto API based acceleration.
Since AF_ALG is into the kernel and since OpenSSL 1.1 supports it, I agree that is the right way to go.
As soon as OpenSSL 1.1 is officially in OpenWRT both approaches should work and switching between the two options is just a matter of changing engine. The AF_ALG engine in OpenSSL is a little more flexible in terms of selecting which encryption method or hash is using the engine.
I hope that OpenSSL 1.1 package will come soon after openWRT 18.06 release. I read your [thread](Status of OpenSSL 1.1 Lede/OpenWrt? kmod-crypto-test).
Finally, the performance of various OpenSSL on my PC (i7- i7-3537U) are reported below.
openssl speed -elapsed md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc
OpenSSL 1.0.2g 1 Mar 2016
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md5 53154.38k 153379.90k 340930.30k 496694.95k 586377.90k
des ede3 23009.45k 23092.95k 23529.30k 23612.07k 23811.41k
aes-192 cbc 93006.03k 98322.52k 98695.85k 102235.14k 100177.24k
aes-256 cbc 80941.49k 83786.88k 83017.22k 77857.45k 82927.62k
sha256 55436.81k 124654.14k 219929.43k 266457.43k 277113.51k
sha512 38305.91k 155056.85k 249469.61k 344811.52k 395487.91
OpenSSL 1.0.2o 27 Mar 2018
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md5 36815.35k 115230.29k 304914.77k 481625.43k 585973.76k
des ede3 22890.91k 23685.12k 23256.23k 23674.88k 23568.38k
aes-192 cbc 88807.08k 95228.42k 99297.11k 95685.97k 98899.29k
aes-256 cbc 79682.97k 85615.96k 85707.69k 86874.79k 86876.16k
sha256 59590.05k 132226.45k 227302.66k 277670.23k 293915.31k
sha512 41993.43k 165328.75k 238422.27k 356699.48k 410555.73k
OpenSSL 1.1.0h 27 Mar 2018
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 106596.49k 246759.36k 441293.65k 554538.33k 596413.10k 592942.42k
des ede3 22817.63k 23257.19k 22951.51k 23186.09k 23358.12k 23358.12k
aes-192 cbc 89113.88k 91224.68k 99162.88k 100212.05k 93547.18k 90030.08k
aes-256 cbc 77074.08k 80776.85k 84904.96k 82820.44k 87135.57k 87610.71k
sha256 57051.51k 130352.60k 226432.43k 276851.71k 291703.47k 297451.52k
sha512 39645.48k 160842.07k 248906.67k 362397.35k 416626.01k 414302.21k
Excluding md5 cipher, there is not great difference between latest OpenSSL 1.0.2o and OpenSSL 1.1.0h. So I do not expect any performance improvement on 18.06 and OpenSSL 1.1.0h. However, only benchmark can tell the truth.
Single thread
Multi thread
First an Intel i7 comes with the AES-NI. I wouldnât call that a âsoftware â solution, but a hardware solution build-in. Since itâs part of the instruction set of the processor it can be used from user space without any restrictions. No need for expensive context switching. OpenSSL has some optimization for this and OpenVPN will use it via the EVP API.
Second, comparing an i7 with a the lantiq SoC is not realistic. It will just show that for testing purposes between the router and the i7, the bottleneck should be on the SoC side, so thatâs the side which should be improved to get better overall performance.
Third, to do a 128 thread benchmark is nice on paper, but OpenVPN is a single thread implementation. It would be nice if the OpenVPN people would update/upgrade their code, but until then, the benchmark is âmeaninglessâ.
I do hope as well that we get OpenSSL 1.1 soon. This means we can do without the cryptodev module. Less steps to do should still help a little. But to have the AF_ALG solution do a better job it should be combined with splice to get a zero copy implementation. From some benchmarks on the MT7628 I noticed 40-50% performance decrease on bigger blocks just because of the copy action.
First an Intel i7 comes with the AES-NI. I wouldnât call that a âsoftware â solution, but a hardware solution build-in. Since itâs part of the instruction set of the processor it can be used from user space without any restrictions. No need for expensive context switching. OpenSSL has some optimization for this and OpenVPN will use it via the EVP API.
Of course, but for small data size 16-1024 bytes, software solution are always better than hardware solutions. For the other there is not a great difference. The exception is for aes algorithm that exploits AES-NI instruction set.
lsmod | grep crypto
crypto_simd 16384 1 aesni_intel
cryptd 24576 3 crypto_simd,ghash_clmulni_intel,aesni_intel
openssl speed -elapsed md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc
OpenSSL 1.1.0h 27 Mar 2018
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 105415.78k 247264.75k 431623.94k 551934.98k 591801.00k 597240.49k
des ede3 22963.75k 23107.31k 23262.38k 22519.13k 23052.29k 22500.69k
aes-192 cbc 86319.39k 95674.18k 97212.42k 101870.25k 102765.91k 101908.48k
aes-256 cbc 75856.45k 86094.14k 87158.95k 87806.63k 87812.78k 87435.95k
sha256 59242.62k 132046.87k 223446.95k 263048.53k 289035.61k 298467.33k
sha512 39790.47k 161326.63k 233818.71k 348111.87k 415061.33k 419790.85k
for a in md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc; do openssl speed -elapsed -evp $a; done
OpenSSL 1.1.0h 27 Mar 2018
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 54150.99k 159543.49k 357452.54k 513594.71k 590310.06k 596940.12k
des-ede3 22604.08k 22867.03k 22960.81k 22320.47k 22544.38k 21653.01k
aes-192-cbc 435918.46k 480810.82k 499365.55k 500524.71k 502666.58k 502475.43k
aes-256-cbc 378546.90k 419011.65k 428491.86k 429825.71k 430929.24k 428184.92k
sha256 37964.67k 100282.84k 200767.91k 256311.64k 292585.47k 296452.10k
sha512 24929.65k 101383.83k 205816.49k 332540.93k 409840.30k 416956.42k
Second, comparing an i7 with a the lantiq SoC is not realistic. It will just show that for testing purposes between the router and the i7, the bottleneck should be on the SoC side, so thatâs the side which should be improved to get better overall performance.
Sure, for this reason I added the benchmark in the previous message and I said that only benchmark can tell the truth. Even if I do not expect a great difference.
Third, to do a 128 thread benchmark is nice on paper, but OpenVPN is a single thread implementation. It would be nice if the OpenVPN people would update/upgrade their code, but until then, the benchmark is âmeaninglessâ.
If you look the benchmarks, even the single thread version, shows that software implementation is better that hardware one and that there is not a great difference between cryptodev and AF_ALG. As I said before, different hardware can achieve different result and we need real benchmark.
I do hope as well that we get OpenSSL 1.1 soon. This means we can do without the cryptodev module. Less steps to do should still help a little. But to have the AF_ALG solution do a better job it should be combined with splice to get a zero copy implementation. From some benchmarks on the MT7628 I noticed 40-50% performance decrease on bigger blocks just because of the copy action.
This will be very interesting. Keep us updated if you have any news.
What about mbedtls? It should have AES-NI support since it was still called PolarSSL. What would it take to make it AES-NI aware, because clearly at the moment it isn't, at least not by default.
I'm sorry. I was under the impression that it would somehow fit thematically since most of this thread was spent talking about hardware acceleration in ssl libraries, and mbedtls is the defacto default on OpenWrt now. Disregard me then, I didn't intend to derail anything.
I managed to compile OpenSSL 1.1.0h on a 4.14.44 snapshot build for my MT7628. I don't know what else this breaks, so for now just for benchmarking.
root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 356737 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 181242 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 61366 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 16816 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 2041 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 958 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h 27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr)
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\"" -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 1902.60k 3866.50k 5236.57k 5739.86k 5573.29k 5231.96k
root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc -engine afalg
engine "afalg" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 84442 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 84307 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 82663 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 72903 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 30576 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 18678 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h 27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr)
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\"" -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 450.36k 1798.55k 7053.91k 24884.22k 83492.86k 102006.78k
root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc -engine cryptodev
engine "cryptodev" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 82900 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 82806 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 81600 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 70219 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 30149 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 18408 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h 27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr)
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\"" -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 442.13k 1766.53k 6963.20k 23968.09k 82326.87k 100532.22k
root@OpenWrt:~#
The difference between software only, my HW driver with the OpenSSL AFALG engine and OpenSSL with cryptodev. As was to be expected, software is faster for small blocks. But at 256 bytes, the hardware is already a slight improvement over software. To my surprise the AFALG engine is not slower then the cryptodev.
I managed to compile OpenSSL 1.1.0h on a 4.14.44 snapshot build for my MT7628. I don't know what else this breaks, so for now just for benchmarking.
Thank you. I saw that openWRT 18.06 comes with linux kernel 4.9.109. Did you try with this kernel version? Since OpenSSL 1.1 should come soon, it will be great.
The difference between software only, my HW driver with the OpenSSL AFALG engine and OpenSSL with cryptodev. As was to be expected, software is faster for small blocks. But at 256 bytes, the hardware is already a slight improvement over software. To my surprise the AFALG engine is not slower then the cryptodev.
The results are inline with the previous link and the recent benchmark that I reported.
I didnât try with an older kernel, but that shouldnât make a difference. As in my other thread about OpenSSL 1.1 the reason why itâs (still) not in OpenWRT is (as far I understood) it break a lot of stuff that depends on OpenSSL.
As I understand from the mailing lists, the patches are in the make to get everything to play nicely with version 1.1
The way I understood the benchmarks done (long tone ago) by the cryptodev people, was that their implementation was much faster. And for a long time the only way to go was using cryptodev.
The AFALG engine is not perfect yet, AES-192-CBC and 256 worked perfectly, using AES-128-CBC generated some errors. I still have to look into that. Only AES engine in the MT7628 so I didnât do any testing with other ciphers or digests yet.
No success on the HH5a:
root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc -engine afalg
engine "afalg" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: ALG_PERR: engines/e_afalg.c(207): io_setup error : Function not implemented
...
It seems the Kernel is missing the io_setup syscall. Any idea how to fix this? I built from OpenWrt git master.
$ nm vmlinux.debug | egrep "sys_(ni_syscall|io_setup|uname)"
8004f6e0 W compat_sys_io_setup
8004f6e0 W sys_io_setup
8004f6e0 T sys_ni_syscall
80042768 T sys_uname
Apparently AFALG support for AES-192-CBC and 256 was only added in the development version of OpenSSL 1.1.1, while AES-128-CBC is already supported in OpenSSL 1.1.0h. Did you apply any patches to change that?
I did some more testing. It seems that OpenSSL needs to be compiled with "HAVE_CRYPTODEV" otherwise no offloading to the hardware occurs. Even with this flag during compilation, without the cryptodev module loaded, still no hardware offloading occurs, even when specifying AFALG as engine. I will ask how/why directly to the OpenSSL people.
root@OpenWrt:~# time -v openssl speed -elapsed -evp aes-256-cbc -engine afalg
engine "afalg" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 897941 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 264831 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 68595 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 17492 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 2150 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 16384 size blocks: 1036 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h 27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr)
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_API_COMPAT=0x10100000L -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\"" -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 4789.02k 5649.73k 5853.44k 5970.60k 5851.43k 5657.94k
Command being timed: "openssl speed -elapsed -evp aes-256-cbc -engine afalg"
User time (seconds): 17.53
System time (seconds): 0.12
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 18.04s
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 11648
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 117
Voluntary context switches: 1
Involuntary context switches: 2029
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
root@OpenWrt:~# cat /proc/interrupts
CPU0
4: 10394 MIPS 4 mt76x2e
5: 150 MIPS 5 10100000.ethernet
6: 33320 MIPS 6 mt7603e
7: 28042 MIPS 7 timer
21: 0 INTC 13 10004000.crypto
25: 2 INTC 17 esw
28: 14 INTC 20 ttyS0
40: 0 GPIO 38 gpio-keys
41: 0 GPIO 37 gpio-keys
ERR: 62
root@OpenWrt:~# opkg install /tmp/kmod-cryptodev_4.14.44\+1.9.git-2017-10-04-ram
ips-1_mipsel_24kc.ipk
Installing kmod-cryptodev (4.14.44+1.9.git-2017-10-04-ramips-1) to root...
Configuring kmod-cryptodev.
root@OpenWrt:~# time -v openssl speed -elapsed -evp aes-256-cbc -engine afalg
engine "afalg" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 87648 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 68664 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 85626 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 60228 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 28735 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 17696 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h 27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr)
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_API_COMPAT=0x10100000L -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\"" -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 467.46k 1464.83k 7306.75k 20557.82k 78465.71k 96643.75k
Command being timed: "openssl speed -elapsed -evp aes-256-cbc -engine afalg"
User time (seconds): 0.60
System time (seconds): 4.83
Percent of CPU this job got: 28%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 18.75s
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 11872
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 118
Voluntary context switches: 348622
Involuntary context switches: 363
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
root@OpenWrt:~#
If the conclusion is that cryptodev need to be present anyway, then there is no need to have the AFALG engine at all. Better keep using cryptodev directly and save a few bytes in flash.
Some progress from my side with userspace crypto acceleration on the HH5a. I created an image with the OpenSSL 1.1.0h cryptodev and afalg engines, and benchmarked them against OpenSSL software crypto.
The OpenSSL afalg engine requires AIO support from the Linux kernel. Since the official images ship with AIO disabled, it is necessary to build OpenWrt from source. I used the OpenSSL 1.1.0h packaging provided by @cotequeiroz. Here is a log of my steps:
git clone git://git.openwrt.org/openwrt/openwrt.git
cd openwrt
git remote add github git://github.com/openwrt/openwrt.git
git fetch github pull/965/head:openssl-1.1-cotequeiroz
git checkout openssl-1.1-cotequeiroz
scripts/feeds update packages
scripts/feeds install cryptodev-linux libpam
make menuconfig
Target System (Lantiq)
Subtarget (XRX200)
Target Profile (BT Home Hub 5A)
<Exit>, save configuration
make defconfig
make menuconfig
Global build settings > Kernel build options >
[*] Compile the kernel with asynchronous IO support
Kernel modules > Cryptographic API modules >
<*> kmod-cryptodev
<*> kmod-crypto-user
<*> kmod-ltq-deu-vr9 # already selected
# optionally, for each crypto module, select <*>
Libraries > SSL >
<*> libopenssl >
[*] Enable engine support
[*] Enable acceleration support through AF_ALG engine
[*] Acceleration support through /dev/crypto
[*] Digest acceleration support
Utilities >
<*> openssl-util
<Exit>, save configuration
make download
make -j5
install firmware image from bin/targets/lantiq/xrx200/ to router:
root@OpenWrt:~# sysupgrade -n /tmp/openwrt-lantiq-xrx200-bt_homehub-v5a-squashfs-sysupgrade.bin
(automatic reboot)
check if installation succeeded:
root@OpenWrt:~# cat /etc/openwrt_version
r6952+4-5399de754dde
OpenSSL engine capabilities and benchmarks:
root@OpenWrt:~# openssl engine cryptodev afalg -c -t
(cryptodev) BSD cryptodev engine
[RSA, DSA, DH, DES-CBC, AES-128-CBC, AES-192-CBC, AES-256-CBC, hmacWithMD5, hmacWithSHA1, MD5, SHA1]
[ available ]
(afalg) AFALG engine support
[AES-128-CBC]
[ available ]
root@OpenWrt:~# openssl speed -elapsed aes-128-cbc
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128 cbc 5844.07k 6527.51k 6730.33k 6782.63k 6793.90k 6777.51k
root@OpenWrt:~# openssl speed -elapsed -engine cryptodev -evp aes-128-cbc
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 1006.32k 3708.12k 11407.79k 23379.97k 32093.53k 33057.45k
root@OpenWrt:~# openssl speed -elapsed -engine afalg -evp aes-128-cbc
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 151.67k 600.19k 2247.51k 7308.29k 19901.10k 22446.08k
This confirms previous benchmarks that cryptodev is much faster than afalg, at least in the way OpenSSL uses them here.
I also tried aes-256-cbc even though it is not supported by the afalg engine:
root@OpenWrt:~# openssl speed -elapsed -engine afalg -evp aes-256-cbc
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 997.25k 3621.27k 10722.82k 20876.97k 28407.13k 29185.37k
This is faster than the previous aes-128-cbc on afalg. I'd be surprised if this was true.
Let's compare this to aes-256-cbc on the cryptodev engine:
root@OpenWrt:~# openssl speed -elapsed -engine cryptodev -evp aes-256-cbc
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 990.45k 3595.48k 10724.95k 20851.71k 28428.97k 29207.21k
For aes-256-cbc, the results are nearly identical between afalg and cryptodev.
Could it be that openssl speed
silently switches to cryptodev when afalg doesn't support the requested algorithm? This would also explain the surprising cryptodev requirements when afalg was requested.
Even I agree with the idea to use as much as possible a native provided API, it needs to be practical and on par with performance.
After playing with OpenSSL 1.1.0h and the AF_ALG engine the conclusion is that it will not be a good idea (in its current state) to use this implementation. The "official" AFALG engine is very limited compared to the original (unsupported) AFALG engine for 0.9.x
The fact that we need to compile the kernel with AIO (which is disabled by default) makes it impossible to have this as a "opkg loadable" module. This is a "no-no" for most users, who don't compile from source.
Performance is not even close for smaller blocks compared with cryptodev and most users would want to use it for OpenVPN which limits blocks to the MTU size (<1500). Unless you use "jumbo-frames" inside the tunnels, the cryptodev option is realistically the only way to go.
After playing with OpenSSL 1.1.0h and the AF_ALG engine the conclusion is that it will not be a good idea (in its current state) to use this implementation. The "official" AFALG engine is very limited compared to the original (unsupported) AFALG engine for 0.9.x
I agree with this even if I think you should say cryptodev instead of AFALG engine for 0.9.x.
Performance is not even close for smaller blocks compared with cryptodev and most users would want to use it for OpenVPN which limits blocks to the MTU size (<1500). Unless you use "jumbo-frames" inside the tunnels, the cryptodev option is realistically the only way to go.
According to latest benchmark @mpa, I agree with this.
I hope to report soon the benchmark by using openWRT 18.06 with latest openSSL 1.0.1g compiled as described here by @mpa using a fast Internet connection 50-60 Mbit/s instead of my ADSL2 12-14 Mbit/s. I tried protonVPN with their official client and paid premium account and I achieved about 30-40 Mbit/s with my i7-3537u.
After this we could have a clear situation. In my opinion, after the various benchmark, the CPU of BT Home Hub 5A is not the limiting factor since it could achieve with AES-256 at least 4526.87k ~ 4.5 MByte/s ~ 36 Mbit/s without offload and more with offloading via cryptodev.