BT Home Hub 5A: configuring protonVPN via openVPN

4526.87k is actually 4.5 mbps not MB/s so 9mbps looks like CPU limitation

I upgraded to 18.06-rc1 and I repeated the benchmark in order to check the differences with the previous version 17.04.1.

openssl speed -elapsed md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc rsa2048 dsa2048
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5                396.41k     1770.35k     3212.71k     9018.37k    15018.67k
des ede3          1223.26k     1394.79k     1397.33k     1404.59k     1401.62k
aes-192 cbc       5197.93k     5690.41k     5824.94k     5859.33k     5849.09k
aes-256 cbc       4642.82k     5023.49k     5124.18k     5153.79k     5147.31k
sha256             845.22k     2547.22k     3001.75k     5256.42k     3140.27k
sha512             143.48k      623.78k      842.84k     1657.45k     1723.05

Then I followed the procedure of @mpa (I had to install also kmod-crypto-authenc_4.9.109-1_mips_24kc) and I repeated the benchmarks.

rmmod cryptodev
openssl speed -elapsed md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc rsa2048 dsa2048
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5                993.93k     3667.86k    11772.25k    26215.08k    40949.08k
des ede3          1380.57k     1399.94k     1401.51k     1394.01k     1355.35k
aes-192 cbc       5157.39k     5663.87k     5775.96k     5835.09k     5860.01k
aes-256 cbc       4627.19k     5013.55k     5127.08k     5132.29k     5158.23k
sha256            2350.09k     5415.74k     9545.39k    11777.37k    12670.29k
sha512             495.86k     1976.55k     2748.25k     3709.27k     4134.23

and

modprobe cryptodev
for a in md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc; do openssl speed -elapsed -evp $a; done
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5                158.66k      596.12k     2288.81k     7885.48k    27560.62k
des-ede3          1349.11k     1375.53k     1383.34k     1382.40k     1381.72k
aes-192-cbc       1334.17k     3452.44k    10609.83k    22038.19k    31812.27k
aes-256-cbc       1325.06k     3382.61k    10154.84k    20216.83k    28292.44k
sha256             547.17k     1854.81k     5163.35k     9316.35k    12119.26k
sha512             287.68k     1179.24k     2206.29k     3408.21k     4033.40k

As expected improvements come for data size equal or above 256 bytes and for supported algorithm (aes-192-cbc and aes-256-cbc).

4526.87k is actually 4.5 mbps not MB/s so 9mbps looks like CPU limitation

No the values are in bytes, you can read here.

didn;t know that but in practice i've noticed openvpn speeds to be around the values of openssl speed result but in mbps

fwiw, I asked one of the devs, mkresin, to take a quick look at this interesting thread. He asked me to post this response:

First of all, cryptodev is a 3rd party kernel module which wasn't accepted by the kernel devs. Instead the Crypto API was added with a 2.6-ish linux kernel [0]. My opinion is the corresponding kernel modules are already packaged for OpenWrt [1].

Support for the Crypto API was added to OpenSSL 1.1.0 [2].

I have no idea whether or not OpenSSL is compiled with Crypto API support by default. Perhaps further special Kernel options need to be selected to enable the base Crypto API support. I don’t know whether the Lantiq DEU (Data Encryption Unit) driver supports the Crypto API. In best case scenario, it might be a matter of loading the correct kernel modules to get hardware accelerated cryptography working.

In my opinion using the cryptodev approach + the 3rd party module is the wrong way. I can only suggest perhaps someone picks this up as a task to have a look at Crypto API based acceleration.

Mathias

[0] https://en.wikipedia.org/wiki/Crypto_API_(Linux)
[1] https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/kernel/linux/modules/crypto.mk

[2] https://github.com/openssl/openssl/commit/7f458a48ff3a231d5841466525d2aacbcd4f6b77

1 Like

The hardware driver natively uses the Linux crypto API. Via the AF_ALG socket it will work from userspace. OpenSSL didn’t officially support it until 1.1.? OpenWRT is now getting the updated OpenSSL version as soon as all the patches are reviewed/added to Master.

Cryptodev is a third party module, but is for now the only way to get user space apps to use the hardware driver together with OpenSSL. The (also third party) AF_ALG engine for the older OpenSSL versions were never officially intergrated. This happened as said before with the 1.1 version.

I never got the separate OpenSSL engine to work with OpenSSL (didn’t try very hard). But according to the older benchmarks done by the cryptodev people, the BSD approach to the /dev/crypto was a lot faster compared to the AF_ALG socket.

As soon as OpenSSL 1.1 is officially in OpenWRT both approaches should work and switching between the two options is just a matter of changing engine. The AF_ALG engine in OpenSSL is a little more flexible in terms of selecting which encryption method or hash is using the engine.

I made some searches on the Web. According to this, cryptodev has better performance than AF_ALG API. However, according to more recent sources here 2014, here 2017 and here, software implementation outperforms both cryptodev and AF_ALG especially for small data size (TCP/UDP/IP about 64kBytes, ethernet between 1.5kBytes for standard frame and 9kBytes for jumbo frame). Moreover, according to latest benchmark here the difference between cryptodev and AF_ALG is not so high as showed here.

In my opinion using the cryptodev approach + the 3rd party module is the wrong way. I can only suggest perhaps someone picks this up as a task to have a look at Crypto API based acceleration.

Since AF_ALG is into the kernel and since OpenSSL 1.1 supports it, I agree that is the right way to go.

As soon as OpenSSL 1.1 is officially in OpenWRT both approaches should work and switching between the two options is just a matter of changing engine. The AF_ALG engine in OpenSSL is a little more flexible in terms of selecting which encryption method or hash is using the engine.

I hope that OpenSSL 1.1 package will come soon after openWRT 18.06 release. I read your [thread](Status of OpenSSL 1.1 Lede/OpenWrt? kmod-crypto-test).
Finally, the performance of various OpenSSL on my PC (i7- i7-3537U) are reported below.

openssl speed -elapsed md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc
OpenSSL 1.0.2g  1 Mar 2016
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              53154.38k   153379.90k   340930.30k   496694.95k   586377.90k
des ede3         23009.45k    23092.95k    23529.30k    23612.07k    23811.41k
aes-192 cbc      93006.03k    98322.52k    98695.85k   102235.14k   100177.24k
aes-256 cbc      80941.49k    83786.88k    83017.22k    77857.45k    82927.62k
sha256           55436.81k   124654.14k   219929.43k   266457.43k   277113.51k
sha512           38305.91k   155056.85k   249469.61k   344811.52k   395487.91
OpenSSL 1.0.2o  27 Mar 2018
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              36815.35k   115230.29k   304914.77k   481625.43k   585973.76k
des ede3         22890.91k    23685.12k    23256.23k    23674.88k    23568.38k
aes-192 cbc      88807.08k    95228.42k    99297.11k    95685.97k    98899.29k
aes-256 cbc      79682.97k    85615.96k    85707.69k    86874.79k    86876.16k
sha256           59590.05k   132226.45k   227302.66k   277670.23k   293915.31k
sha512           41993.43k   165328.75k   238422.27k   356699.48k   410555.73k
OpenSSL 1.1.0h  27 Mar 2018
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5             106596.49k   246759.36k   441293.65k   554538.33k   596413.10k   592942.42k
des ede3         22817.63k    23257.19k    22951.51k    23186.09k    23358.12k    23358.12k
aes-192 cbc      89113.88k    91224.68k    99162.88k   100212.05k    93547.18k    90030.08k
aes-256 cbc      77074.08k    80776.85k    84904.96k    82820.44k    87135.57k    87610.71k
sha256           57051.51k   130352.60k   226432.43k   276851.71k   291703.47k   297451.52k
sha512           39645.48k   160842.07k   248906.67k   362397.35k   416626.01k   414302.21k

Excluding md5 cipher, there is not great difference between latest OpenSSL 1.0.2o and OpenSSL 1.1.0h. So I do not expect any performance improvement on 18.06 and OpenSSL 1.1.0h. However, only benchmark can tell the truth.

Single thread


Multi thread

First an Intel i7 comes with the AES-NI. I wouldn’t call that a “software “ solution, but a hardware solution build-in. Since it’s part of the instruction set of the processor it can be used from user space without any restrictions. No need for expensive context switching. OpenSSL has some optimization for this and OpenVPN will use it via the EVP API.

Second, comparing an i7 with a the lantiq SoC is not realistic. It will just show that for testing purposes between the router and the i7, the bottleneck should be on the SoC side, so that’s the side which should be improved to get better overall performance.

Third, to do a 128 thread benchmark is nice on paper, but OpenVPN is a single thread implementation. It would be nice if the OpenVPN people would update/upgrade their code, but until then, the benchmark is “meaningless”.

I do hope as well that we get OpenSSL 1.1 soon. This means we can do without the cryptodev module. Less steps to do should still help a little. But to have the AF_ALG solution do a better job it should be combined with splice to get a zero copy implementation. From some benchmarks on the MT7628 I noticed 40-50% performance decrease on bigger blocks just because of the copy action.

First an Intel i7 comes with the AES-NI. I wouldn’t call that a “software “ solution, but a hardware solution build-in. Since it’s part of the instruction set of the processor it can be used from user space without any restrictions. No need for expensive context switching. OpenSSL has some optimization for this and OpenVPN will use it via the EVP API.

Of course, but for small data size 16-1024 bytes, software solution are always better than hardware solutions. For the other there is not a great difference. The exception is for aes algorithm that exploits AES-NI instruction set.

lsmod | grep crypto
crypto_simd            16384  1 aesni_intel
cryptd                 24576  3 crypto_simd,ghash_clmulni_intel,aesni_intel
openssl speed -elapsed md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc
OpenSSL 1.1.0h  27 Mar 2018
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5             105415.78k   247264.75k   431623.94k   551934.98k   591801.00k   597240.49k
des ede3         22963.75k    23107.31k    23262.38k    22519.13k    23052.29k    22500.69k
aes-192 cbc      86319.39k    95674.18k    97212.42k   101870.25k   102765.91k   101908.48k
aes-256 cbc      75856.45k    86094.14k    87158.95k    87806.63k    87812.78k    87435.95k
sha256           59242.62k   132046.87k   223446.95k   263048.53k   289035.61k   298467.33k
sha512           39790.47k   161326.63k   233818.71k   348111.87k   415061.33k   419790.85k

for a in md5 sha256 sha512 des-ede3 aes-192-cbc aes-256-cbc; do openssl speed -elapsed -evp $a; done
OpenSSL 1.1.0h  27 Mar 2018
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5              54150.99k   159543.49k   357452.54k   513594.71k   590310.06k   596940.12k
des-ede3         22604.08k    22867.03k    22960.81k    22320.47k    22544.38k    21653.01k
aes-192-cbc     435918.46k   480810.82k   499365.55k   500524.71k   502666.58k   502475.43k
aes-256-cbc     378546.90k   419011.65k   428491.86k   429825.71k   430929.24k   428184.92k
sha256           37964.67k   100282.84k   200767.91k   256311.64k   292585.47k   296452.10k
sha512           24929.65k   101383.83k   205816.49k   332540.93k   409840.30k   416956.42k

Second, comparing an i7 with a the lantiq SoC is not realistic. It will just show that for testing purposes between the router and the i7, the bottleneck should be on the SoC side, so that’s the side which should be improved to get better overall performance.

Sure, for this reason I added the benchmark in the previous message and I said that only benchmark can tell the truth. Even if I do not expect a great difference.

Third, to do a 128 thread benchmark is nice on paper, but OpenVPN is a single thread implementation. It would be nice if the OpenVPN people would update/upgrade their code, but until then, the benchmark is “meaningless”.

If you look the benchmarks, even the single thread version, shows that software implementation is better that hardware one and that there is not a great difference between cryptodev and AF_ALG. As I said before, different hardware can achieve different result and we need real benchmark.

I do hope as well that we get OpenSSL 1.1 soon. This means we can do without the cryptodev module. Less steps to do should still help a little. But to have the AF_ALG solution do a better job it should be combined with splice to get a zero copy implementation. From some benchmarks on the MT7628 I noticed 40-50% performance decrease on bigger blocks just because of the copy action.

This will be very interesting. Keep us updated if you have any news.

What about mbedtls? It should have AES-NI support since it was still called PolarSSL. What would it take to make it AES-NI aware, because clearly at the moment it isn't, at least not by default.

That’s getting far off topic but it’s all here:
https://tls.mbed.org/aes-source-code

I'm sorry. I was under the impression that it would somehow fit thematically since most of this thread was spent talking about hardware acceleration in ssl libraries, and mbedtls is the defacto default on OpenWrt now. Disregard me then, I didn't intend to derail anything.

I managed to compile OpenSSL 1.1.0h on a 4.14.44 snapshot build for my MT7628. I don't know what else this breaks, so for now just for benchmarking.

root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 356737 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 181242 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 61366 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 16816 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 2041 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 958 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h  27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr) 
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\""  -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc       1902.60k     3866.50k     5236.57k     5739.86k     5573.29k     5231.96k

root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc -engine afalg
engine "afalg" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 84442 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 84307 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 82663 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 72903 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 30576 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 18678 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h  27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr) 
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\""  -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc        450.36k     1798.55k     7053.91k    24884.22k    83492.86k   102006.78k

root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc -engine cryptodev
engine "cryptodev" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 82900 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 82806 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 81600 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 70219 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 30149 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 18408 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h  27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr) 
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\""  -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc        442.13k     1766.53k     6963.20k    23968.09k    82326.87k   100532.22k
root@OpenWrt:~# 

The difference between software only, my HW driver with the OpenSSL AFALG engine and OpenSSL with cryptodev. As was to be expected, software is faster for small blocks. But at 256 bytes, the hardware is already a slight improvement over software. To my surprise the AFALG engine is not slower then the cryptodev.

I managed to compile OpenSSL 1.1.0h on a 4.14.44 snapshot build for my MT7628. I don't know what else this breaks, so for now just for benchmarking.

Thank you. I saw that openWRT 18.06 comes with linux kernel 4.9.109. Did you try with this kernel version? Since OpenSSL 1.1 should come soon, it will be great.

The difference between software only, my HW driver with the OpenSSL AFALG engine and OpenSSL with cryptodev. As was to be expected, software is faster for small blocks. But at 256 bytes, the hardware is already a slight improvement over software. To my surprise the AFALG engine is not slower then the cryptodev.

The results are inline with the previous link and the recent benchmark that I reported.

I didn’t try with an older kernel, but that shouldn’t make a difference. As in my other thread about OpenSSL 1.1 the reason why it’s (still) not in OpenWRT is (as far I understood) it break a lot of stuff that depends on OpenSSL.

As I understand from the mailing lists, the patches are in the make to get everything to play nicely with version 1.1

The way I understood the benchmarks done (long tone ago) by the cryptodev people, was that their implementation was much faster. And for a long time the only way to go was using cryptodev.

The AFALG engine is not perfect yet, AES-192-CBC and 256 worked perfectly, using AES-128-CBC generated some errors. I still have to look into that. Only AES engine in the MT7628 so I didn’t do any testing with other ciphers or digests yet.

No success on the HH5a:

root@OpenWrt:~# openssl speed -elapsed -evp aes-256-cbc -engine afalg
engine "afalg" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: ALG_PERR: engines/e_afalg.c(207): io_setup error : Function not implemented
...

It seems the Kernel is missing the io_setup syscall. Any idea how to fix this? I built from OpenWrt git master.

$ nm vmlinux.debug | egrep "sys_(ni_syscall|io_setup|uname)"
8004f6e0 W compat_sys_io_setup
8004f6e0 W sys_io_setup
8004f6e0 T sys_ni_syscall
80042768 T sys_uname

Apparently AFALG support for AES-192-CBC and 256 was only added in the development version of OpenSSL 1.1.1, while AES-128-CBC is already supported in OpenSSL 1.1.0h. Did you apply any patches to change that?

I did some more testing. It seems that OpenSSL needs to be compiled with "HAVE_CRYPTODEV" otherwise no offloading to the hardware occurs. Even with this flag during compilation, without the cryptodev module loaded, still no hardware offloading occurs, even when specifying AFALG as engine. I will ask how/why directly to the OpenSSL people.

root@OpenWrt:~# time -v openssl speed -elapsed -evp aes-256-cbc -engine afalg
engine "afalg" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 897941 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 264831 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 68595 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 17492 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 2150 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 16384 size blocks: 1036 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h  27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr) 
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_API_COMPAT=0x10100000L -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\""  -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc       4789.02k     5649.73k     5853.44k     5970.60k     5851.43k     5657.94k
	Command being timed: "openssl speed -elapsed -evp aes-256-cbc -engine afalg"
	User time (seconds): 17.53
	System time (seconds): 0.12
	Percent of CPU this job got: 97%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 18.04s
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 11648
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 117
	Voluntary context switches: 1
	Involuntary context switches: 2029
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
root@OpenWrt:~# cat /proc/interrupts 
           CPU0       
  4:      10394      MIPS   4  mt76x2e
  5:        150      MIPS   5  10100000.ethernet
  6:      33320      MIPS   6  mt7603e
  7:      28042      MIPS   7  timer
 21:          0      INTC  13  10004000.crypto
 25:          2      INTC  17  esw
 28:         14      INTC  20  ttyS0
 40:          0      GPIO  38  gpio-keys
 41:          0      GPIO  37  gpio-keys
ERR:         62
root@OpenWrt:~# opkg install /tmp/kmod-cryptodev_4.14.44\+1.9.git-2017-10-04-ram
ips-1_mipsel_24kc.ipk 
Installing kmod-cryptodev (4.14.44+1.9.git-2017-10-04-ramips-1) to root...
Configuring kmod-cryptodev.
root@OpenWrt:~# time -v openssl speed -elapsed -evp aes-256-cbc -engine afalg
engine "afalg" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 87648 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 68664 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 85626 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 60228 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 28735 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 17696 aes-256-cbc's in 3.00s
OpenSSL 1.1.0h  27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr) 
compiler: mipsel-openwrt-linux-musl-gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DAES_ASM -DOPENSSL_API_COMPAT=0x10100000L -DOPENSSL_SMALL_FOOTPRINT -DOPENSSL_NO_ASYNC -DHAVE_CRYPTODEV -DOPENSSLDIR="\"/etc/ssl\"" -DENGINESDIR="\"/usr/lib/engines-1.1\""  -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/usr/include -I/home/drbrains/source/staging_dir/target-mipsel_24kc_musl/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/usr/include -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include/fortify -I/home/drbrains/source/staging_dir/toolchain-mipsel_24kc_gcc-7.3.0_musl/include -znow -zrelro
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc        467.46k     1464.83k     7306.75k    20557.82k    78465.71k    96643.75k
	Command being timed: "openssl speed -elapsed -evp aes-256-cbc -engine afalg"
	User time (seconds): 0.60
	System time (seconds): 4.83
	Percent of CPU this job got: 28%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 18.75s
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 11872
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 118
	Voluntary context switches: 348622
	Involuntary context switches: 363
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
root@OpenWrt:~# 

If the conclusion is that cryptodev need to be present anyway, then there is no need to have the AFALG engine at all. Better keep using cryptodev directly and save a few bytes in flash.

1 Like

Some progress from my side with userspace crypto acceleration on the HH5a. I created an image with the OpenSSL 1.1.0h cryptodev and afalg engines, and benchmarked them against OpenSSL software crypto.

The OpenSSL afalg engine requires AIO support from the Linux kernel. Since the official images ship with AIO disabled, it is necessary to build OpenWrt from source. I used the OpenSSL 1.1.0h packaging provided by @cotequeiroz. Here is a log of my steps:

git clone git://git.openwrt.org/openwrt/openwrt.git
cd openwrt
git remote add github git://github.com/openwrt/openwrt.git
git fetch github pull/965/head:openssl-1.1-cotequeiroz
git checkout openssl-1.1-cotequeiroz
scripts/feeds update packages
scripts/feeds install cryptodev-linux libpam
make menuconfig
  Target System (Lantiq)
  Subtarget (XRX200)
  Target Profile (BT Home Hub 5A)
  <Exit>, save configuration
make defconfig
make menuconfig
  Global build settings > Kernel build options >
    [*] Compile the kernel with asynchronous IO support
  Kernel modules > Cryptographic API modules >
    <*> kmod-cryptodev
    <*> kmod-crypto-user
    <*> kmod-ltq-deu-vr9  # already selected
    # optionally, for each crypto module, select <*>
  Libraries > SSL >
    <*> libopenssl >
      [*]   Enable engine support
      [*]     Enable acceleration support through AF_ALG engine
      [*]   Acceleration support through /dev/crypto
      [*]   Digest acceleration support
  Utilities >
    <*> openssl-util
  <Exit>, save configuration
make download
make -j5

install firmware image from bin/targets/lantiq/xrx200/ to router:
root@OpenWrt:~# sysupgrade -n /tmp/openwrt-lantiq-xrx200-bt_homehub-v5a-squashfs-sysupgrade.bin
(automatic reboot)

check if installation succeeded:
root@OpenWrt:~# cat /etc/openwrt_version 
r6952+4-5399de754dde

OpenSSL engine capabilities and benchmarks:

root@OpenWrt:~# openssl engine cryptodev afalg -c -t
(cryptodev) BSD cryptodev engine
 [RSA, DSA, DH, DES-CBC, AES-128-CBC, AES-192-CBC, AES-256-CBC, hmacWithMD5, hmacWithSHA1, MD5, SHA1]
     [ available ]
(afalg) AFALG engine support
 [AES-128-CBC]
     [ available ]

root@OpenWrt:~# openssl speed -elapsed aes-128-cbc
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128 cbc       5844.07k     6527.51k     6730.33k     6782.63k     6793.90k     6777.51k

root@OpenWrt:~# openssl speed -elapsed -engine cryptodev -evp aes-128-cbc
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc       1006.32k     3708.12k    11407.79k    23379.97k    32093.53k    33057.45k

root@OpenWrt:~# openssl speed -elapsed -engine afalg -evp aes-128-cbc
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc        151.67k      600.19k     2247.51k     7308.29k    19901.10k    22446.08k

This confirms previous benchmarks that cryptodev is much faster than afalg, at least in the way OpenSSL uses them here.

I also tried aes-256-cbc even though it is not supported by the afalg engine:

root@OpenWrt:~# openssl speed -elapsed -engine afalg -evp aes-256-cbc
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc        997.25k     3621.27k    10722.82k    20876.97k    28407.13k    29185.37k

This is faster than the previous aes-128-cbc on afalg. I'd be surprised if this was true.
Let's compare this to aes-256-cbc on the cryptodev engine:

root@OpenWrt:~# openssl speed -elapsed -engine cryptodev -evp aes-256-cbc
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc        990.45k     3595.48k    10724.95k    20851.71k    28428.97k    29207.21k

For aes-256-cbc, the results are nearly identical between afalg and cryptodev.
Could it be that openssl speed silently switches to cryptodev when afalg doesn't support the requested algorithm? This would also explain the surprising cryptodev requirements when afalg was requested.

1 Like

Even I agree with the idea to use as much as possible a native provided API, it needs to be practical and on par with performance.

After playing with OpenSSL 1.1.0h and the AF_ALG engine the conclusion is that it will not be a good idea (in its current state) to use this implementation. The "official" AFALG engine is very limited compared to the original (unsupported) AFALG engine for 0.9.x

The fact that we need to compile the kernel with AIO (which is disabled by default) makes it impossible to have this as a "opkg loadable" module. This is a "no-no" for most users, who don't compile from source.

Performance is not even close for smaller blocks compared with cryptodev and most users would want to use it for OpenVPN which limits blocks to the MTU size (<1500). Unless you use "jumbo-frames" inside the tunnels, the cryptodev option is realistically the only way to go.

2 Likes

After playing with OpenSSL 1.1.0h and the AF_ALG engine the conclusion is that it will not be a good idea (in its current state) to use this implementation. The "official" AFALG engine is very limited compared to the original (unsupported) AFALG engine for 0.9.x

I agree with this even if I think you should say cryptodev instead of AFALG engine for 0.9.x.

Performance is not even close for smaller blocks compared with cryptodev and most users would want to use it for OpenVPN which limits blocks to the MTU size (<1500). Unless you use "jumbo-frames" inside the tunnels, the cryptodev option is realistically the only way to go.

According to latest benchmark @mpa, I agree with this.

I hope to report soon the benchmark by using openWRT 18.06 with latest openSSL 1.0.1g compiled as described here by @mpa using a fast Internet connection 50-60 Mbit/s instead of my ADSL2 12-14 Mbit/s. I tried protonVPN with their official client and paid premium account and I achieved about 30-40 Mbit/s with my i7-3537u.
After this we could have a clear situation. In my opinion, after the various benchmark, the CPU of BT Home Hub 5A is not the limiting factor since it could achieve with AES-256 at least 4526.87k ~ 4.5 MByte/s ~ 36 Mbit/s without offload and more with offloading via cryptodev.