IPQ806x NSS Drivers

quarky · December 10, 2018, 8:55am

I guess the enterprise firmware has more 'enterprise' features like maybe better QoS controls, or SNMP, supports more wireless clients with better roaming, etc. The routers we get to buy off the shelf likely contains the retail firmware. The enterprise type likely will be for those router or access points meant for enterprise deployment.

quarky · December 10, 2018, 2:37pm

Hi folks,

I have some good news, some bad news and some interesting observation. The good news is that I managed to somehow get the NSS core to accelerate OpenSSL functions. The bad news is that it is not stable, and the CFI drivers is throwing a lot of errors, which I had to disable to get it to produce some meaningful results.

And the surprising observation is that the built-in Linux Kernel crypto modules are surprisingly efficient!

Here are my results.

This one is using purely OpenSSL crypto code:

*** OpenSSL ***
root@LEDE:~# openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 9084467 aes-256 cbc's in 2.97s
Doing aes-256 cbc for 3s on 64 size blocks: 2651418 aes-256 cbc's in 2.98s
Doing aes-256 cbc for 3s on 256 size blocks: 693608 aes-256 cbc's in 2.98s
Doing aes-256 cbc for 3s on 1024 size blocks: 175730 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 21896 aes-256 cbc's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc      48939.89k    56943.21k    59585.12k    59982.51k    59790.68k
aes-256 cbc      50250.13k    56954.41k    59727.39k    60135.08k    60182.43k
aes-256 cbc      49473.95k    57152.47k    57030.01k    60355.38k    59995.48k

One CPU core @ 100% usr

This one is using the Linux kernel built-in crypto code, via /dev/cryptodev loaded, but without the qca-nss-cfi-cryptoapi module loaded:

*** Linux Kernel Crypto ***
root@LEDE:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 268484 aes-256-cbc's in 0.31s
Doing aes-256-cbc for 3s on 64 size blocks: 252034 aes-256-cbc's in 0.15s
Doing aes-256-cbc for 3s on 256 size blocks: 172118 aes-256-cbc's in 0.14s
Doing aes-256-cbc for 3s on 1024 size blocks: 66565 aes-256-cbc's in 0.10s
Doing aes-256-cbc for 3s on 8192 size blocks: 11014 aes-256-cbc's in 0.01s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc      13857.24k   107534.51k   314730.06k   681625.60k  9022668.80k
aes-256-cbc      20899.89k   132489.60k   269924.80k   488579.66k  3000729.60k
aes-256-cbc      20820.04k    53892.80k   306914.74k   521436.55k  4491673.60k

One CPU core @ 100% sys
Higher thruput if CPU 2 is used.

This one is using the NSS cores for crypto acceleration:

*** NSS Crypto ***
root@LEDE:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 12988 aes-256-cbc's in 0.09s
Doing aes-256-cbc for 3s on 64 size blocks: 13787 aes-256-cbc's in 0.05s
Doing aes-256-cbc for 3s on 256 size blocks: 11594 aes-256-cbc's in 0.09s
Doing aes-256-cbc for 3s on 1024 size blocks: 12825 aes-256-cbc's in 0.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 8507 aes-256-cbc's in 0.07s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc       2308.98k    17647.36k    32978.49k         infk   995562.06k
aes-256-cbc       2715.40k    16491.52k    63733.76k   153322.06k  9087385.60k
aes-256-cbc       3135.31k     8621.44k   118784.00k   274206.72k  4647321.60k

One CPU core @ 20% sys, 20% sirq

All tests are ran 3 times.

Even when using the Linux crypto modules, sometimes OpenSSL don't get any results for either the 1024 and/or 8192 bytes tests.

From the above results, it looks like the OpenSSL crypto code is really unoptimised.

The Linux kernel module results are the best, but uses very high CPU resources as compared to the NSS cores. I guess this is to be expected, as the Kraits CPUs are running at 1,700 MHz with NEON instruction sets while the NSS cores are running at 800 MHz, probably without any SIMD instruction sets, although the NSS cores has much less overheads.

Would be great if we have crypto experts here to help out tho. I'm way out of my depth here.

Meanwhile, I guess I have to study how these thingy works and hopefully make it stable.

LGA1150 · December 10, 2018, 3:47pm

Please add -elapsed parameter when testing, or the test result could be inaccurate

quarky · December 10, 2018, 11:04pm

No wonder the figures from my tests looks out of this world. Will test again and post results.

lesandie · December 16, 2018, 5:43pm

You've done it!
It makes me happy to see the work done. I'll try to cut some family time and help with the testing.
Best!

quarky · December 16, 2018, 10:50pm

Guess I’m lucky

Back to struggling with the crypto acceleration, which is still my primary objective. Think I’m close ... tho this time, will probably have to write my own drivers, referencing what I could find from Qualcomm and the Linux kernel drivers. The existing crypto drivers from CodeAurora looks to me like a proof of concept drivers that could only encrypt one ‘packet’ at a time.

The numbers I posted earlier are incorrect. At the moment there’s no effective driver to accelerate crypto functions for user land stuffs, and even kernel functionalities, using the NSS cores. If there’re crypto expert who’s interested to help, that’ll really help accelerate this effort.

LGA1150 · December 17, 2018, 6:39am

I doubt that

Please check /proc/crypto and /proc/interrupts to make sure that hardware crypto is being used

quarky · December 17, 2018, 7:37am

I meant that in the context of using the NSS cores for crypto acceleration. I got the Qualcomm drivers working, but it is limited to only one scatterlist page only. The QCA drivers will take over when encrypting payload less than 2KiB. Larger payload will make OpenSSL (which is what I'm using to test atm) or the Linux kernel chaining the payload into multiple scatterlist pages (either destination, source, or both) and the QCA driver will refuse to process it once either the destination or source scatterlist is greater than 1.

I don't really understand how the scatterlist mechanism to work at the moment. My understanding of AES encryption (concentrating on this algo first) is that encrypted payload will have the same size as the plaintext, but when I checked the scatterlist structure passed via the crypto APIs to the QCA drive, they have different source and destination sizes. That means I cannot cipher the source to destination in place, which means complicated memory manipulation.

I may be completely off-base here. If you can educate me further I'll be eternally grateful

LGA1150 · December 17, 2018, 7:41am

Maybe you can ask @drbrains for help. He has written crypto driver for MT76x8 and MT7621 from scratch

quarky · December 17, 2018, 9:10am

Thanks @LGA1150!

The link you shared for the work done by @drbrains gave me an idea how to proceed now.

Will see if it works!

drbrains · December 19, 2018, 2:04am

The aes-engine for the MT76x8 is more or less finished. The MT7621 on my GitHub is very early stage, but the cipher parts works with very bad performance.

I'm not an expert myself, but maybe I already ran into some problems you are having. Maybe we can help each other.

quarky · December 19, 2018, 4:59am

More than happy to share learning points with you. At the moment, I'm still trying to learn both the Linux crypto subsystems and the IPQ806x NSS cores crypto framework, so my progress will be rather slow.

For my side, the hard part has already been done by the Qualcomm team. I just need to figure out how to make it work for multiple sgs for the same request. Your codes that merged all the sgs showed me how it can be done for the NSS cores as well, so I'll be trying that out. Will be sharing the results once I get it working correctly.

The way I'm planning to do it will be to feed the multiple sg segments one at a time to the h/w crypto engine, probably much like what you'd done for the MT HW crypto driver, if I read your codes correctly. This likely will be generating too many interrupts, which likely will kill thruputs. Anyway, once I get the multi segment going correctly, I'll see if the performance is good or bad.

drbrains · December 19, 2018, 5:59am

Do you have a link to the code? It seems this SoC should be more powerful and very likely should be able to do scatter/gather in hardware. The EIP93 inside the MT7621 can't do that. The MT76x8 is able to do that, I used that as a testbed to try to see if I could make it work without hardware scatter/gather. This code will be replaced soon.

As for interrupts: the MT76x8, MT7621 and the newer (arm vs mips) MT7623 can have "delayed" interrupts. Meaning it will only send an interrupt after a certain number of "packets" have been processed. They usually have some timer based timeout as well, in case this number is not reached in time.

quarky · December 19, 2018, 8:20am

I'm referencing the drivers from the following repo that contains the ablkcipher driver:

https://source.codeaurora.org/quic/qsdk/oss/lklm/nss-cfi/tree/cryptoapi/v2.1?h=NHSS.QSDK.6.1.r1

It depends on the NSS crypto core driver from:

https://source.codeaurora.org/quic/qsdk/oss/lklm/nss-crypto/tree/v1.0?h=NHSS.QSDK.6.1.r1

Based on the benchmarking tool provided by the NSS crypto driver, it shows a peak AES128 bandwidth around 500 mbps, averaging between 300-400mbps, all running within the kernel memory space.

The problem at the moment is that the drivers I found only process one sg at a time. Combing thru the NSS crypto driver source codes seems to suggest that the IPQ806x NSS cores/firmware only process one buffer at a time.

Reading later versions of the QSDK sources seems to suggest that the IPQ 807x SoC can perform scatter/gather in hardware or rather by the NSS firmware. Hopefully I'm wrong.

drbrains · December 19, 2018, 9:27am

I had a (very) quick look. As even stated in the comments in the code only scatterlist with 1 segment it allowed. Workaround could be:

copy the entire scatterlist to a buffer, do the transformation and copy the resulting buffer back to a scatterlist. The larger the buffer needed, the more overhead. Given that most packets are around 1500 bytes (MTU) or maybe 4k using LUKS with large sectors, you have to think about what you want to achieve.

Alternatively, create a transformation for each scatterlist segment. Keep in mind that the "Source" and "Destination" scatterlist might not be the same segmented. This is what I'm trying to correct in my code. The only problem would be (and I didn't look long enough at your code for this), is can you use the IV from the previous transformation for the next. This is not the same as programming a "new" IV.

Most solutions I've looked at for different hardware are queuing in software, meaning they wait for each full request to be completed before de-queueing the next. In this cause you could use the "operation complete" interrupt which would generate the interrupt when the hardware has no more transformations for complete.

The first option is the easiest to implement: scatterlist-to-buffer, create a single segment "scatterlist" from this buffer, pass that to the remaining driver. Combine that with a one-request in the engine at the same time and you could even pre-allocated the buffer for this instead of dynamically allocating/freeing memory.

quarky · December 19, 2018, 10:09am

I’m thinking along this line. I got this idea reading your MT driver. Using the big buffer approach would probably negate the speed advantage and would be a waste of resource. I’m aware that I need to get the last block of the cipher result as IV for the next segment. Doing to CBC first. Then tackle CTR if I get CBC to work.

Thanks!

quarky · December 21, 2018, 4:27am

I managed to get the driver to work, using the multi-segment scatterlist approach, but the performance is terrible.

Below stats is using the Krait CPU:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      59750.90k    77232.73k    87382.27k    89440.60k    89877.16k

Below stats is using the NSS core:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc         65.99k      261.01k     1038.68k     2138.11k    11141.12k

Tested using the following command:

openssl speed -evp aes-128-cbc -elapsed

I suspect the issue is likely due to my code allocating and freeing kernel memory for every pass. I'll try to pre-allocate the memory and see how it goes. Only doing AES CBC encryption for now tho.

LGA1150 · December 21, 2018, 10:02am

You may refer to some already in-tree crypto drivers, such as Marvell-CESA

quarky · December 21, 2018, 10:31am

My assumption about memory allocation slowing it down is incorrect. To quickly test, I used a preallocated buffer for one users only, i.e. without kernel memory allocation. Same result.

Looks like I need to study the NSS crypto driver to understand in more detail where the bottleneck might be.

Ansuel · December 21, 2018, 12:22pm

Check the frequency of the nss core