IPQ806x NSS Drivers


#162

Not with ath10k it isn't, but with proprietary drivers it is. If you get a chance to do some openssl benches with crypto accel that'd be nice.

Note you will need the full openssl lib (the default mbedtls will not have any acceleration).


#163

Tackling this now. Seems that the qca-nss-cfi driver hooks into the kernel to provide crypto services and off-loading it to the NSS firmware. The /dev/cryptodev drivers and its associated chiper and digest support has to be enabled in the kernel as well. OpenSSL can then use the cryptodev services once hardware support is turned on in OpenSSL.

Will post numbers when I get it working.


#164

Once the crypto acceleration is working for the kernel, maybe the ath10k can be patched to use the kernel crypto services? That should free up the Krait CPUs.


#165

Nice work @quarky .. following your progress.

Unfortunately documentation is scarce, I realized after coming across this:
https://wiki.codeaurora.org/xwiki/bin/QSDK/
That the NHSS 6.1 branch is good for 3.14 kernels, while NHSS 6.1.1 is for 4.4 kernels. While later QSDK revisions (8, 9) are available and looking at code it shows ipq806x, they are intended for ipq807x.

Not sure this will be trivial, QCA always uses completely different drivers for their sdks, and the proprietary drivers have higher throughput. Eg on IPQ40xx targets they reach upto 600mbps but with ath10k its much slower.. but hey you seem to know what you're doing.

Cheers :slight_smile:


#166

Yeah. I've invested quite a fair bit of time reading through the source codes and finally decided on 6.1.r1, without being sure that it'll work. Guess I'm lucky. I briefly tried QSDK 8.1 but ran into too many issues and decided to restart with a another branch that's not so bleeding edge.

My latest branch (lede-17.01-quarkysg-qca-nss) should be considered beta now, and I welcome all who's brave enough to try out beta software :grin: It has most features in, except for the kernel crypto acceleration. Note that it's still kernel 4.4 tho.

Well, I'll see what I can do in that department. Will probably be a long long while before it bears any fruits tho. Wish me luck. :wink:


#167

Good luck!
Btw, looking at NSS related code, there is some logic to check for type of firmware.

Two types are available, "retail" and "enterprise", this likely suggests the underlying hardware is identical.. Would be interesting to see what "enterprise" firmware can do. But hey lets not get ahead of ourselves..
Just mentioning this so if anyone comes across this thread and has access to hardware they could possibly do a firmware dump for us to experiment with.

Cheers


#168

I guess the enterprise firmware has more 'enterprise' features like maybe better QoS controls, or SNMP, supports more wireless clients with better roaming, etc. The routers we get to buy off the shelf likely contains the retail firmware. The enterprise type likely will be for those router or access points meant for enterprise deployment.


#169

Hi folks,

I have some good news, some bad news and some interesting observation. The good news is that I managed to somehow get the NSS core to accelerate OpenSSL functions. The bad news is that it is not stable, and the CFI drivers is throwing a lot of errors, which I had to disable to get it to produce some meaningful results.

And the surprising observation is that the built-in Linux Kernel crypto modules are surprisingly efficient!

Here are my results.

This one is using purely OpenSSL crypto code:

*** OpenSSL ***
root@LEDE:~# openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 9084467 aes-256 cbc's in 2.97s
Doing aes-256 cbc for 3s on 64 size blocks: 2651418 aes-256 cbc's in 2.98s
Doing aes-256 cbc for 3s on 256 size blocks: 693608 aes-256 cbc's in 2.98s
Doing aes-256 cbc for 3s on 1024 size blocks: 175730 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 21896 aes-256 cbc's in 3.00s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc      48939.89k    56943.21k    59585.12k    59982.51k    59790.68k
aes-256 cbc      50250.13k    56954.41k    59727.39k    60135.08k    60182.43k
aes-256 cbc      49473.95k    57152.47k    57030.01k    60355.38k    59995.48k

One CPU core @ 100% usr

This one is using the Linux kernel built-in crypto code, via /dev/cryptodev loaded, but without the qca-nss-cfi-cryptoapi module loaded:

*** Linux Kernel Crypto ***
root@LEDE:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 268484 aes-256-cbc's in 0.31s
Doing aes-256-cbc for 3s on 64 size blocks: 252034 aes-256-cbc's in 0.15s
Doing aes-256-cbc for 3s on 256 size blocks: 172118 aes-256-cbc's in 0.14s
Doing aes-256-cbc for 3s on 1024 size blocks: 66565 aes-256-cbc's in 0.10s
Doing aes-256-cbc for 3s on 8192 size blocks: 11014 aes-256-cbc's in 0.01s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc      13857.24k   107534.51k   314730.06k   681625.60k  9022668.80k
aes-256-cbc      20899.89k   132489.60k   269924.80k   488579.66k  3000729.60k
aes-256-cbc      20820.04k    53892.80k   306914.74k   521436.55k  4491673.60k

One CPU core @ 100% sys
Higher thruput if CPU 2 is used.

This one is using the NSS cores for crypto acceleration:

*** NSS Crypto ***
root@LEDE:~# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 12988 aes-256-cbc's in 0.09s
Doing aes-256-cbc for 3s on 64 size blocks: 13787 aes-256-cbc's in 0.05s
Doing aes-256-cbc for 3s on 256 size blocks: 11594 aes-256-cbc's in 0.09s
Doing aes-256-cbc for 3s on 1024 size blocks: 12825 aes-256-cbc's in 0.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 8507 aes-256-cbc's in 0.07s

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc       2308.98k    17647.36k    32978.49k         infk   995562.06k
aes-256-cbc       2715.40k    16491.52k    63733.76k   153322.06k  9087385.60k
aes-256-cbc       3135.31k     8621.44k   118784.00k   274206.72k  4647321.60k

One CPU core @ 20% sys, 20% sirq

All tests are ran 3 times.

Even when using the Linux crypto modules, sometimes OpenSSL don't get any results for either the 1024 and/or 8192 bytes tests.

From the above results, it looks like the OpenSSL crypto code is really unoptimised.

The Linux kernel module results are the best, but uses very high CPU resources as compared to the NSS cores. I guess this is to be expected, as the Kraits CPUs are running at 1,700 MHz with NEON instruction sets while the NSS cores are running at 800 MHz, probably without any SIMD instruction sets, although the NSS cores has much less overheads.

Would be great if we have crypto experts here to help out tho. I'm way out of my depth here.

Meanwhile, I guess I have to study how these thingy works and hopefully make it stable.


#170

Please add -elapsed parameter when testing, or the test result could be inaccurate


#171

No wonder the figures from my tests looks out of this world. Will test again and post results.


#172

You've done it! :smiley:
It makes me happy to see the work done. I'll try to cut some family time and help with the testing.
Best!


#173

Guess I’m lucky :grimacing:

Back to struggling with the crypto acceleration, which is still my primary objective. Think I’m close ... tho this time, will probably have to write my own drivers, referencing what I could find from Qualcomm and the Linux kernel drivers. The existing crypto drivers from CodeAurora looks to me like a proof of concept drivers that could only encrypt one ‘packet’ at a time.

The numbers I posted earlier are incorrect. At the moment there’s no effective driver to accelerate crypto functions for user land stuffs, and even kernel functionalities, using the NSS cores. If there’re crypto expert who’s interested to help, that’ll really help accelerate this effort.


Build for Netgear R7800
#174

I doubt that

Please check /proc/crypto and /proc/interrupts to make sure that hardware crypto is being used


#175

I meant that in the context of using the NSS cores for crypto acceleration. I got the Qualcomm drivers working, but it is limited to only one scatterlist page only. The QCA drivers will take over when encrypting payload less than 2KiB. Larger payload will make OpenSSL (which is what I'm using to test atm) or the Linux kernel chaining the payload into multiple scatterlist pages (either destination, source, or both) and the QCA driver will refuse to process it once either the destination or source scatterlist is greater than 1.

I don't really understand how the scatterlist mechanism to work at the moment. My understanding of AES encryption (concentrating on this algo first) is that encrypted payload will have the same size as the plaintext, but when I checked the scatterlist structure passed via the crypto APIs to the QCA drive, they have different source and destination sizes. That means I cannot cipher the source to destination in place, which means complicated memory manipulation.

I may be completely off-base here. If you can educate me further I'll be eternally grateful :sweat_smile:


#176

Maybe you can ask @drbrains for help. He has written crypto driver for MT76x8 and MT7621 from scratch


#177

Thanks @LGA1150!

The link you shared for the work done by @drbrains gave me an idea how to proceed now.

Will see if it works!


#178

The aes-engine for the MT76x8 is more or less finished. The MT7621 on my GitHub is very early stage, but the cipher parts works with very bad performance.

I'm not an expert myself, but maybe I already ran into some problems you are having. Maybe we can help each other.


#179

More than happy to share learning points with you. At the moment, I'm still trying to learn both the Linux crypto subsystems and the IPQ806x NSS cores crypto framework, so my progress will be rather slow.

For my side, the hard part has already been done by the Qualcomm team. I just need to figure out how to make it work for multiple sgs for the same request. Your codes that merged all the sgs showed me how it can be done for the NSS cores as well, so I'll be trying that out. Will be sharing the results once I get it working correctly.

The way I'm planning to do it will be to feed the multiple sg segments one at a time to the h/w crypto engine, probably much like what you'd done for the MT HW crypto driver, if I read your codes correctly. This likely will be generating too many interrupts, which likely will kill thruputs. Anyway, once I get the multi segment going correctly, I'll see if the performance is good or bad.


#180

Do you have a link to the code? It seems this SoC should be more powerful and very likely should be able to do scatter/gather in hardware. The EIP93 inside the MT7621 can't do that. The MT76x8 is able to do that, I used that as a testbed to try to see if I could make it work without hardware scatter/gather. This code will be replaced soon.

As for interrupts: the MT76x8, MT7621 and the newer (arm vs mips) MT7623 can have "delayed" interrupts. Meaning it will only send an interrupt after a certain number of "packets" have been processed. They usually have some timer based timeout as well, in case this number is not reached in time.


#181

I'm referencing the drivers from the following repo that contains the ablkcipher driver:

https://source.codeaurora.org/quic/qsdk/oss/lklm/nss-cfi/tree/cryptoapi/v2.1?h=NHSS.QSDK.6.1.r1

It depends on the NSS crypto core driver from:

https://source.codeaurora.org/quic/qsdk/oss/lklm/nss-crypto/tree/v1.0?h=NHSS.QSDK.6.1.r1

Based on the benchmarking tool provided by the NSS crypto driver, it shows a peak AES128 bandwidth around 500 mbps, averaging between 300-400mbps, all running within the kernel memory space.

The problem at the moment is that the drivers I found only process one sg at a time. Combing thru the NSS crypto driver source codes seems to suggest that the IPQ806x NSS cores/firmware only process one buffer at a time.

Reading later versions of the QSDK sources seems to suggest that the IPQ 807x SoC can perform scatter/gather in hardware or rather by the NSS firmware. Hopefully I'm wrong.