Hardware crypto support for PacketEngine-IP-93 (EIP-93) on MTK7621

I understand the complexity. But maybe someone сan. My skills too low to complete that at this epoch.

I'm not sure if this is the same thing, but @blogic seems to be working on this in his staging tree:

https://git.lede-project.org/?p=lede/blogic/staging.git;a=shortlog;h=refs/heads/mt7530-dsa-performance

This code from mediatek for EIP-97. It is for ARM MT7623 and others. We need RAMIPS and EIP-93.

In addition, this probably requires some NDA documentation from MediaTek along with a skilled developer. In other words, won't happen.

MediaTek has moved on to ARM for which they have upstream code for their new crypto engine.

Nonsense. All documentation is available. A lot of source examples by mediatek too.

Here is Mediatek IEP93 Crypto driver

May be someone can merge it to openwrt?

Tons of docs

No.

That thing is not only incomplete, it doesn't work properly currently.

I'm bumping up this old topic!

Social Isolation is at least good for one thing: I have time to look at my code again.

I have it working for simple ciphers (des / 3des / aes) and full authentication: hmac(x) , cipher(y)

The ciphers work from userspace with the openssl decrypto and afalg engines but performance is actually worse than software to be useful (I have to rethink the logic).

For IPsec I am now able to push up to 125Mbps with aes-256-cbc-hmac-sha256 in my local test setup.

Using it for LUKS encryption it will improve performance by factor 2 for LUKS v1 (due to sector size) and factor 3 with LUKS v2 with a sector size of 4096 (vs 512)

Feedback is appreciated.

Disclaimer: its still a Work in Progress, and even I did a lot of tests myself: do proper testing with LUKS before storing valuable data.

2 Likes

This is entirely unsurprising.

I have some experience of porting hardware crypto engines to Openwrt, since I've done it for Intel's C2000 series SoCs (C2558) and for their C3000 series SoCs (C3758). These are the processors known as Rangeley and Denverton and are substantially more powerful than the one you refer to here.

There are a number of caveats with regard to performance of these accelerators.

To get the best performance, the application needs to be aware that it is offloading to an accelerator. Everything you find on Openwrt is not aware and so not likely to benefit from the accelerator.

The reason for this is simple: bus transfer time. Data from a memory buffer needs to be transferred across the lookaside pci-e bus to the accelerator, the accelerator performs the operation on the buffer and the results are then transferred back. Granted, this is typically accomplished by DMA on contiguous pinned memory blocks, but there's still a substantial penalty for this transfer over the bus compared to just operating on it using CPU instructions.

For single, synchronous requests, it is slower than performing the operation in software using the CPU instructions.

This state of affairs is aggravated by smaller buffer sizes where the data transfer time occupies a higher proportion of the overall execution time. The larger the buffer, the more the application would benefit from the accelerator. Typical network applications are going to be using small buffer sizes and many are unlikely to be able to do so asynchronously, so likely to get worse performance by offloading.

The Intel quickassist devices do not even attempt to offload buffers < 2K in size and even at 4K size buffers, performance with chained symmetric ciphers is barely on parity with a pure software implementation.

openssl 1.1.0 brought in the notion of asynchronous operations, where the application can offload multiple non-blocking requests to to the acceleration device, allowing the device to process them all in parallel. This is where the performance gap between software/CPU implementations and a hardware accelerator becomes really substantial

For example, look at the performance of the C3758 hardware crypto on RSA below. The first two are purely synchronous operations, one in software and one in hardware. You'll note that the offloaded one is actually slower than software on the faster verify operation, while being faster on the much more computationally expensive signing operation which benefits more from the hardware offload.

root@OpenWrt:~# openssl speed -elapsed rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 3684 2048 bits private RSA's in 10.01s
Doing 2048 bits public rsa's for 10s: 127429 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1f  31 Mar 2020
built on: Mon Apr 13 15:35:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.002717s 0.000078s    368.0  12742.9

root@OpenWrt:~# openssl speed -elapsed -engine qat rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 12105 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 71326 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1f  31 Mar 2020
built on: Mon Apr 13 15:35:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000826s 0.000140s   1210.5   7132.6

For chained symmetric ciphers, synchronous operation on this platform is always slower on the crypto accelerator than using AESNI on the CPU. Even asynchronously, the chained symmetric ciphers are only faster on buffer sizes > 8K, which is much larger than the typical packet size and really only something likely to be encountered in file-based IO on an encrypted file system or NAS application.

Below are the results for asynchronous operation so you can see how having the application aware of the accelerator makes a difference.

root@OpenWrt:~# openssl speed -elapsed -async_jobs 72 rsa2048
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 3671 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 127503 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1f  31 Mar 2020
built on: Mon Apr 13 15:35:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.002724s 0.000078s    367.1  12750.3


root@OpenWrt:~# openssl speed -elapsed -engine qat -async_jobs 72 rsa2048
engine "qat" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 93879 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 743979 2048 bits public RSA's in 10.00s
OpenSSL 1.1.1f  31 Mar 2020
built on: Mon Apr 13 15:35:50 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: ccache_cc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,now -Wl,-z,relro -O3 -fpic -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000107s 0.000013s   9387.9  74397.9

Something like nginx modified to be able to use asynchronous operations can benefit from an accelerator (Intel has patches for nginx to allow this). Openvpn or IPSEC mostly likely cannot yield a huge performance improvement unless passing a high volume of data, even if doing so asynchronously.

So, in summary, unless the application is optimized for use on a hardware accelerator by making asynchronous openssl calls or by using the accelerator directly with asynchronous calls, it's likely to experience worse performance than a software implemention

I am recompiling now to add the "async_jobs" feature to openssl. It needs glibc vs musl, so that will take a bit.

I didn't say that I can do a "nice" benchmark with "openssl speed -elapsed -evp", but those are just numbers to make me smile that I managed to get data through the hardware engine.

EIP-93:
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes.
aes-256-cbc        426.45k     1662.25k     5944.05k    17304.58k    37937.15k    40867.47k. 

Software:
The 'numbers' are in 1000s of bytes per second processed.\
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes.
aes-256-cbc       7293.05k     8612.87k     9069.23k     9151.36k     9207.81k     9207.81k.

But I like this much better:
EIP-93:

IPSec AES-256-CBC HMAC(SHA256):
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-14.92  sec   221 MBytes   124 Mbits/sec    0             sender
[  5]   0.00-15.03  sec   222 MBytes   124 Mbits/sec                  receiver

Software:

IPSec AES-256-CBC HMAC(SHA256):
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-14.79  sec  67.3 MBytes  38.2 Mbits/sec    0             sender
[  5]   0.00-15.01  sec  68.3 MBytes  38.2 Mbits/sec                  receiver

As soon as I have figured out how to improve the interrupt handler to use multi-cores the performance should increase. I might even drop using interrupts all together and just poll the engine, since I know that a packet will arrive as soon as I send a packet to the engine. Especially since for network use the packets will be small anyway. That way we could use "pcrypt" to balance the work on all cpu's. Now that works only for the "send-to-engine" part, bu the "receive-from-engine" is still single core (and performance limiting) due to the way the interrupts are handled. (not balanced)

The interrupts portion is thankfully quite easy. You can do it manually or automatically with irqbalance. cat /proc/interrupts to see which interrupts your device users. For example, for my eth0 cat /proc/interrupts | grep eth0

root@OpenWrt:/etc/init.d# cat /proc/interrupts | grep eth0
 41:    2107299     415060     442496     269255      24707     254238     478668     625693   PCI-MSI 3145728-edge      eth0-TxRx-0
 42:     158626     834108     114461      94939      51730     104343     105596     260116   PCI-MSI 3145729-edge      eth0-TxRx-1
 43:     504040     103078    1438630     132312     218477     314344     155468     287855   PCI-MSI 3145730-edge      eth0-TxRx-2
 44:     335546     495946      86755     956176      42342     120709     228766     152050   PCI-MSI 3145731-edge      eth0-TxRx-3
 45:     600930     150127     111721      54430    1123279     160699     286036      75234   PCI-MSI 3145732-edge      eth0-TxRx-4
 46:     302096     128024     446581      81960     223405    1342997     285184      82005   PCI-MSI 3145733-edge      eth0-TxRx-5
 47:     243778     114572     259718      54426      40344      36679     862051     153350   PCI-MSI 3145734-edge      eth0-TxRx-6
 48:      62797      16831      20145      12909       3085      19370      54193     470426   PCI-MSI 3145735-edge      eth0-TxRx-7
 49:          3          1          2          1          0          0          1          0   PCI-MSI 3145736-edge      eth0

If you want to set the affinity manually do something like the following

echo 1 > /proc/irq/41/smp_affinity
echo 2 > /proc/irq/42/smp_affinity
echo 4 > /proc/irq/43/smp_affinity
echo 8 > /proc/irq/44/smp_affinity

echo 10 > /proc/irq/45/smp_affinity
echo 20 > /proc/irq/46/smp_affinity
echo 40 > /proc/irq/47/smp_affinity
echo 80 > /proc/irq/48/smp_affinity

The numbers you echo to /proc are just bit masks.

            Binary       Hex 
    CPU 0    0001         1 
    CPU 1    0010         2
    CPU 2    0100         4
    CPU 3    1000         8

That said, irqbalance does a fair job of balancing the load, so unless you can be sure your manual settings are better, you should just use irqbalance.

With regard to the qat crypto engine implementation you did

What percentage of the CPU is used ?

eg

time -v openssl speed -elapsed -evp aes-256-cbc -engine qat

or

time -v openssl speed -elapsed -evp aes-256-cbc -engine qat --multi 8

I disabled symmetric cipher offload in the engine for the reasons mentioned above: the software implementation using aesni is always faster. Even when not offloading, the symmetric ciphers will perform worse in the case that they're enabled in the engine since there is polling going on.

But regardless, if I do the test with rsa or anything else, in both software and engine cases it will max out the core it's running on.

If you don't specify a -multi n parameter, it will max out one core. If you specific a -multi 4 it will max out 4 cores.

Since it's trying in both cases to find the maximum possible throughput, it follows that it will max out the cpu core it's running on.

Ok, I see. Same happen on AESNI (max out cores)

I used for example kirkwood crypto on older Pogo E02 and Mobile 4, which did not max out the cpu. But obviously the crypto performance is very slow compared to today's standards.

I also got working on Fedora Core 19 some older crypto cards with openssl i bouth on eBay
Silicom PESC62 - Nitrox Engine

Would using just some of the cores on modern Intel CPU's for example, be the way to go?

It really depends on your use case and what you want to achieve. I disabled symmetric cipher offload since a network router is encrypting mainly small buffers, which are more efficient to do in software for the reasons above

If I were doing this on a NAS box, I would enable symmetric cipher offload.

Can you describe what you're trying to achieve?

IPv6 Network Encryption via CJDNS and Yggdrasil

Unfortunately its not as simple as setting the smp_affinity. The same for using irqbalance. By default smp_affinity is set to "F" on a 4 core CPU, meaning it should be able to use any of the 4 cores. In reality leaving it like that will always default CPU-0 to run the interrupt handler and subsequently and tasklet or workqueue. Setting the affinity to "2" will shift ALL the interrupts (for that IRQ) to CPU-1. Setting it to "6" will not schedule CPU-1 and -2; otherwise we could just leave them at "F" and have the system balance them for us.

irqbalance is "just a script" adjusting the affinity based on heavy usage and moving them to different cores.

Anyway, I will experiment a little with the IRQF_PERCPU flag, which I'm not sure how that works or if that even does something in regards to "queue_work", otherwise I will have to find some way to do "queue_work_on" in some round-robin way from within the top-half of the handler. I tried it in combination with "pcrypt" but that works "until" the interrupt, so I need to store the working cpu somewhere and do a queue_work_on( stored cpu,...

It's interesting to see how your interrupts are distributed, I would expect something like TxRx-0 to CPU-0, TxRx-1 to CPU-1 etc. I would guess that the underlaying hardware is a bit more sophisticated in terms of interrupt handling.

Your probably better off using aesni and making sure the interrupts on the network card are balanced properly over the different cores.

See the irqbalance comments above. Your use case is encryption of small buffers and the penalty for offloading it over the lookaside bus will probably be too high to justify. That is, if you're using a multi-core x86_64.

If the device has a much less capable CPU then the encryption offload may be viable.

2 Likes

You can use a mask that allows it to work across multiple cores. Generally, though, you'd want one interrupt to be processed on one core and not different cores. That's how irqbalance assigns it: one interrupt to a single core.

That was just a random example I wrote for the post. They're not actually distributed like that in practise.

I've been fiddling with irqbalance for the last few hours, so the proc/interrupts output above is what irqbalance has been doing. In practise, my manual affinity assignment is just as you suggest.

For performance I'm not debating that we should all use some little "nuc" kind of device base on intel with at least the aes-ni instruction set.

This is about me, developing a crypto-driver for the EIP93 which is incorporated in the Mediatek MT7621 (ramips target). And I know that the ubnt ER-X and the Mikrotik HEX RB750GR3 are using their own proprietary OS with IPsec (only) hardware offload. Those devices with the same MT7621 SoC are doing 300+ mbps with IPsec. and according to some guy who did some benchmarking even the ER-X maxed out at around 125 mbps and according to him that was because it was only using one Core for "upload".