I guess the enterprise firmware has more 'enterprise' features like maybe better QoS controls, or SNMP, supports more wireless clients with better roaming, etc. The routers we get to buy off the shelf likely contains the retail firmware. The enterprise type likely will be for those router or access points meant for enterprise deployment.
Hi folks,
I have some good news, some bad news and some interesting observation. The good news is that I managed to somehow get the NSS core to accelerate OpenSSL functions. The bad news is that it is not stable, and the CFI drivers is throwing a lot of errors, which I had to disable to get it to produce some meaningful results.
And the surprising observation is that the built-in Linux Kernel crypto modules are surprisingly efficient!
Here are my results.
This one is using purely OpenSSL crypto code:
*** OpenSSL ***
root@LEDE:~# openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 9084467 aes-256 cbc's in 2.97s
Doing aes-256 cbc for 3s on 64 size blocks: 2651418 aes-256 cbc's in 2.98s
Doing aes-256 cbc for 3s on 256 size blocks: 693608 aes-256 cbc's in 2.98s
Doing aes-256 cbc for 3s on 1024 size blocks: 175730 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 21896 aes-256 cbc's in 3.00s
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256 cbc 48939.89k 56943.21k 59585.12k 59982.51k 59790.68k
aes-256 cbc 50250.13k 56954.41k 59727.39k 60135.08k 60182.43k
aes-256 cbc 49473.95k 57152.47k 57030.01k 60355.38k 59995.48k
One CPU core @ 100% usr
This one is using the Linux kernel built-in crypto code, via /dev/cryptodev loaded, but without the qca-nss-cfi-cryptoapi module loaded:
*** Linux Kernel Crypto ***
root@LEDE:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 268484 aes-256-cbc's in 0.31s
Doing aes-256-cbc for 3s on 64 size blocks: 252034 aes-256-cbc's in 0.15s
Doing aes-256-cbc for 3s on 256 size blocks: 172118 aes-256-cbc's in 0.14s
Doing aes-256-cbc for 3s on 1024 size blocks: 66565 aes-256-cbc's in 0.10s
Doing aes-256-cbc for 3s on 8192 size blocks: 11014 aes-256-cbc's in 0.01s
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-cbc 13857.24k 107534.51k 314730.06k 681625.60k 9022668.80k
aes-256-cbc 20899.89k 132489.60k 269924.80k 488579.66k 3000729.60k
aes-256-cbc 20820.04k 53892.80k 306914.74k 521436.55k 4491673.60k
One CPU core @ 100% sys
Higher thruput if CPU 2 is used.
This one is using the NSS cores for crypto acceleration:
*** NSS Crypto ***
root@LEDE:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 12988 aes-256-cbc's in 0.09s
Doing aes-256-cbc for 3s on 64 size blocks: 13787 aes-256-cbc's in 0.05s
Doing aes-256-cbc for 3s on 256 size blocks: 11594 aes-256-cbc's in 0.09s
Doing aes-256-cbc for 3s on 1024 size blocks: 12825 aes-256-cbc's in 0.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 8507 aes-256-cbc's in 0.07s
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-cbc 2308.98k 17647.36k 32978.49k infk 995562.06k
aes-256-cbc 2715.40k 16491.52k 63733.76k 153322.06k 9087385.60k
aes-256-cbc 3135.31k 8621.44k 118784.00k 274206.72k 4647321.60k
One CPU core @ 20% sys, 20% sirq
All tests are ran 3 times.
Even when using the Linux crypto modules, sometimes OpenSSL don't get any results for either the 1024 and/or 8192 bytes tests.
From the above results, it looks like the OpenSSL crypto code is really unoptimised.
The Linux kernel module results are the best, but uses very high CPU resources as compared to the NSS cores. I guess this is to be expected, as the Kraits CPUs are running at 1,700 MHz with NEON instruction sets while the NSS cores are running at 800 MHz, probably without any SIMD instruction sets, although the NSS cores has much less overheads.
Would be great if we have crypto experts here to help out tho. I'm way out of my depth here.
Meanwhile, I guess I have to study how these thingy works and hopefully make it stable.
Please add -elapsed
parameter when testing, or the test result could be inaccurate
No wonder the figures from my tests looks out of this world. Will test again and post results.
You've done it!
It makes me happy to see the work done. I'll try to cut some family time and help with the testing.
Best!
Guess I’m lucky
Back to struggling with the crypto acceleration, which is still my primary objective. Think I’m close ... tho this time, will probably have to write my own drivers, referencing what I could find from Qualcomm and the Linux kernel drivers. The existing crypto drivers from CodeAurora looks to me like a proof of concept drivers that could only encrypt one ‘packet’ at a time.
The numbers I posted earlier are incorrect. At the moment there’s no effective driver to accelerate crypto functions for user land stuffs, and even kernel functionalities, using the NSS cores. If there’re crypto expert who’s interested to help, that’ll really help accelerate this effort.
I doubt that
Please check /proc/crypto
and /proc/interrupts
to make sure that hardware crypto is being used
I meant that in the context of using the NSS cores for crypto acceleration. I got the Qualcomm drivers working, but it is limited to only one scatterlist page only. The QCA drivers will take over when encrypting payload less than 2KiB. Larger payload will make OpenSSL (which is what I'm using to test atm) or the Linux kernel chaining the payload into multiple scatterlist pages (either destination, source, or both) and the QCA driver will refuse to process it once either the destination or source scatterlist is greater than 1.
I don't really understand how the scatterlist mechanism to work at the moment. My understanding of AES encryption (concentrating on this algo first) is that encrypted payload will have the same size as the plaintext, but when I checked the scatterlist structure passed via the crypto APIs to the QCA drive, they have different source and destination sizes. That means I cannot cipher the source to destination in place, which means complicated memory manipulation.
I may be completely off-base here. If you can educate me further I'll be eternally grateful
Maybe you can ask @drbrains for help. He has written crypto driver for MT76x8 and MT7621 from scratch
Thanks @LGA1150!
The link you shared for the work done by @drbrains gave me an idea how to proceed now.
Will see if it works!
The aes-engine for the MT76x8 is more or less finished. The MT7621 on my GitHub is very early stage, but the cipher parts works with very bad performance.
I'm not an expert myself, but maybe I already ran into some problems you are having. Maybe we can help each other.
More than happy to share learning points with you. At the moment, I'm still trying to learn both the Linux crypto subsystems and the IPQ806x NSS cores crypto framework, so my progress will be rather slow.
For my side, the hard part has already been done by the Qualcomm team. I just need to figure out how to make it work for multiple sgs for the same request. Your codes that merged all the sgs showed me how it can be done for the NSS cores as well, so I'll be trying that out. Will be sharing the results once I get it working correctly.
The way I'm planning to do it will be to feed the multiple sg segments one at a time to the h/w crypto engine, probably much like what you'd done for the MT HW crypto driver, if I read your codes correctly. This likely will be generating too many interrupts, which likely will kill thruputs. Anyway, once I get the multi segment going correctly, I'll see if the performance is good or bad.
Do you have a link to the code? It seems this SoC should be more powerful and very likely should be able to do scatter/gather in hardware. The EIP93 inside the MT7621 can't do that. The MT76x8 is able to do that, I used that as a testbed to try to see if I could make it work without hardware scatter/gather. This code will be replaced soon.
As for interrupts: the MT76x8, MT7621 and the newer (arm vs mips) MT7623 can have "delayed" interrupts. Meaning it will only send an interrupt after a certain number of "packets" have been processed. They usually have some timer based timeout as well, in case this number is not reached in time.
I'm referencing the drivers from the following repo that contains the ablkcipher driver:
https://source.codeaurora.org/quic/qsdk/oss/lklm/nss-cfi/tree/cryptoapi/v2.1?h=NHSS.QSDK.6.1.r1
It depends on the NSS crypto core driver from:
https://source.codeaurora.org/quic/qsdk/oss/lklm/nss-crypto/tree/v1.0?h=NHSS.QSDK.6.1.r1
Based on the benchmarking tool provided by the NSS crypto driver, it shows a peak AES128 bandwidth around 500 mbps, averaging between 300-400mbps, all running within the kernel memory space.
The problem at the moment is that the drivers I found only process one sg at a time. Combing thru the NSS crypto driver source codes seems to suggest that the IPQ806x NSS cores/firmware only process one buffer at a time.
Reading later versions of the QSDK sources seems to suggest that the IPQ 807x SoC can perform scatter/gather in hardware or rather by the NSS firmware. Hopefully I'm wrong.
I had a (very) quick look. As even stated in the comments in the code only scatterlist with 1 segment it allowed. Workaround could be:
copy the entire scatterlist to a buffer, do the transformation and copy the resulting buffer back to a scatterlist. The larger the buffer needed, the more overhead. Given that most packets are around 1500 bytes (MTU) or maybe 4k using LUKS with large sectors, you have to think about what you want to achieve.
Alternatively, create a transformation for each scatterlist segment. Keep in mind that the "Source" and "Destination" scatterlist might not be the same segmented. This is what I'm trying to correct in my code. The only problem would be (and I didn't look long enough at your code for this), is can you use the IV from the previous transformation for the next. This is not the same as programming a "new" IV.
Most solutions I've looked at for different hardware are queuing in software, meaning they wait for each full request to be completed before de-queueing the next. In this cause you could use the "operation complete" interrupt which would generate the interrupt when the hardware has no more transformations for complete.
The first option is the easiest to implement: scatterlist-to-buffer, create a single segment "scatterlist" from this buffer, pass that to the remaining driver. Combine that with a one-request in the engine at the same time and you could even pre-allocated the buffer for this instead of dynamically allocating/freeing memory.
I’m thinking along this line. I got this idea reading your MT driver. Using the big buffer approach would probably negate the speed advantage and would be a waste of resource. I’m aware that I need to get the last block of the cipher result as IV for the next segment. Doing to CBC first. Then tackle CTR if I get CBC to work.
Thanks!
I managed to get the driver to work, using the multi-segment scatterlist approach, but the performance is terrible.
Below stats is using the Krait CPU:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 59750.90k 77232.73k 87382.27k 89440.60k 89877.16k
Below stats is using the NSS core:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 65.99k 261.01k 1038.68k 2138.11k 11141.12k
Tested using the following command:
openssl speed -evp aes-128-cbc -elapsed
I suspect the issue is likely due to my code allocating and freeing kernel memory for every pass. I'll try to pre-allocate the memory and see how it goes. Only doing AES CBC encryption for now tho.
You may refer to some already in-tree crypto drivers, such as Marvell-CESA
My assumption about memory allocation slowing it down is incorrect. To quickly test, I used a preallocated buffer for one users only, i.e. without kernel memory allocation. Same result.
Looks like I need to study the NSS crypto driver to understand in more detail where the bottleneck might be.
Check the frequency of the nss core