IPQ806x NSS Drivers

drbrains · December 19, 2018, 9:27am

I had a (very) quick look. As even stated in the comments in the code only scatterlist with 1 segment it allowed. Workaround could be:

copy the entire scatterlist to a buffer, do the transformation and copy the resulting buffer back to a scatterlist. The larger the buffer needed, the more overhead. Given that most packets are around 1500 bytes (MTU) or maybe 4k using LUKS with large sectors, you have to think about what you want to achieve.

Alternatively, create a transformation for each scatterlist segment. Keep in mind that the "Source" and "Destination" scatterlist might not be the same segmented. This is what I'm trying to correct in my code. The only problem would be (and I didn't look long enough at your code for this), is can you use the IV from the previous transformation for the next. This is not the same as programming a "new" IV.

Most solutions I've looked at for different hardware are queuing in software, meaning they wait for each full request to be completed before de-queueing the next. In this cause you could use the "operation complete" interrupt which would generate the interrupt when the hardware has no more transformations for complete.

The first option is the easiest to implement: scatterlist-to-buffer, create a single segment "scatterlist" from this buffer, pass that to the remaining driver. Combine that with a one-request in the engine at the same time and you could even pre-allocated the buffer for this instead of dynamically allocating/freeing memory.

quarky · December 19, 2018, 10:09am

I’m thinking along this line. I got this idea reading your MT driver. Using the big buffer approach would probably negate the speed advantage and would be a waste of resource. I’m aware that I need to get the last block of the cipher result as IV for the next segment. Doing to CBC first. Then tackle CTR if I get CBC to work.

Thanks!

quarky · December 21, 2018, 4:27am

I managed to get the driver to work, using the multi-segment scatterlist approach, but the performance is terrible.

Below stats is using the Krait CPU:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      59750.90k    77232.73k    87382.27k    89440.60k    89877.16k

Below stats is using the NSS core:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc         65.99k      261.01k     1038.68k     2138.11k    11141.12k

Tested using the following command:

openssl speed -evp aes-128-cbc -elapsed

I suspect the issue is likely due to my code allocating and freeing kernel memory for every pass. I'll try to pre-allocate the memory and see how it goes. Only doing AES CBC encryption for now tho.

LGA1150 · December 21, 2018, 10:02am

You may refer to some already in-tree crypto drivers, such as Marvell-CESA

quarky · December 21, 2018, 10:31am

My assumption about memory allocation slowing it down is incorrect. To quickly test, I used a preallocated buffer for one users only, i.e. without kernel memory allocation. Same result.

Looks like I need to study the NSS crypto driver to understand in more detail where the bottleneck might be.

Ansuel · December 21, 2018, 12:22pm

Check the frequency of the nss core

quarky · December 21, 2018, 3:25pm

It’s running at max 800MHz.

drbrains · December 21, 2018, 4:18pm

I’m confused. If you are using a multi segmented scatterlist approach, why do you need to allocate memory other then for a kind of command descriptor? It should do DMA directly from either kernel or userspace “zero copy”.

Judging by the performance, I would look into your interrupt handling. This is likely the biggest bottleneck to tackle. (I’m having similar problems with the MT driver, I’m rewriting my code now in the style of the EIP197 safexcel driver).

quarky · December 21, 2018, 10:16pm

I needed some memory allocated to keep track of the computed segments and which segments each calls are currently at. Basically data structure to keep tracks of the segments pointers.

At the moment I have no clue where the bottleneck could be. It could be the interrupts. Need to study the QCA drivers.

quarky · December 26, 2018, 9:12am

@drbrains I did some crude benchmarking on my hacked driver for the NSS crypto. Trying to encrypt a payload of 4096 bytes split into 3 segments resulted in wildly erratic timings.

What I did was to capture the time before starting ciphering the 3 segments and capture the time again when all 3 segments are done. The times I get varies from between just over 400 us to slightly over 1000 us. This gives me a thruput of between 77 mbps and 30 mbps. Each pass will generate 3 interrupts for each of the completed segments.

So it looks to me like the interrupt nature of the entire ciphering process is contributing to the bad performance.

Not too sure if I can improve this behaviour, as the interrupts are generated by the NSS firmware, which I have no control over. Judging from the results, I may have better consistency going with the large buffer approach copy all segments into a large buffer and copying the results back to the scatter list.

I'll try that and see if it results in better performance, thought I'm not hopeful.

neheb · December 26, 2018, 5:00pm

mt7621 drivers are currently in staging at kernel.org. It was suggested that I mention this if you want the kernel community to work on it.

quarky · December 27, 2018, 2:02pm

It would be great if the kernel folks could help out. But seeing that this is something that contains proprietary firmware codes with open source drivers, chances of getting help will likely be slim.

I've tried the big buffer approach. Performance improved 100%, but it's still bad.

Below are numbers from the NSS crypto with consolidated buffer allocated:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc         62.02k      250.05k      938.84k     3764.22k    26383.70k
aes-256-cbc         59.18k      235.33k      973.65k     3707.56k    24980.14k

Below are numbers from kernel crypto drivers running on the Krait CPU:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc       1349.59k     5541.06k    14076.42k    25547.60k    35430.40k
aes-256-cbc       1449.99k     5312.90k    13398.10k    22709.93k    29892.61k

Finally, below are the ciphers from OpenSSL running on the Krait CPU:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc      70107.68k    80491.29k    86042.11k    88152.75k    88995.16k
aes-256 cbc      49108.50k    56065.37k    58578.77k    59422.38k    59714.22k

Looks like it may be difficult to improve the performance further.

I'll probably do more study of the existing drivers and see if I can learn new tricks to further improve the performance of the NSS crypto drivers.

Ansuel · December 27, 2018, 2:26pm

It could be that the nss core(in stock firmware) is used only for WiFi crypto and not for raw encryption (for VPN )
As in every way the CPU will be better than the nss core as the CPU have higher frequency

We should grab some performance test from the original firmware to check this.

npr · December 27, 2018, 4:39pm

Does stock firmware use crypto accel capabilities in any way?

quarky · December 27, 2018, 10:26pm

Stock firmware does not have the necessary drivers to test OpenSSL, so not sure if we can get any meaningful number from stock firmware.

The NSS crypto driver comes with a benchmarking tool, which shows peak bandwidth over 500mbps doing AES-128-CBC with SHA1 on 256bytes payload. So the NSS crypto is fast, which likely could be due to it not doing much except networking and crypto tasks, compared to the Kraft CPU cores doing a ton more work, although being twice as fast in clock frequenciy.

quarky · December 27, 2018, 10:27pm

As far as I know, stock firmware uses the crypto engine for IPSec acceleration.

quarky · December 29, 2018, 10:43am

Bad news folks. The inclusion of the NSS drivers has made the firmware unstable. I’m getting random reboots without any logs or kernel panics. Router just freeze and rebooted. Looks like it’ll be painful to isolate the issue.

Maybe it’s due to me upgrading too many components, I.e, kernel, ath10k.

I’ll revert back to lede-17.01 and try again.

escalade · January 4, 2019, 10:40am

@quarky

Have you been using the ath10k-ct driver? Previous versions were unstable to the point where the router would reboot. There's been some important fixes lately though.

quarky · January 4, 2019, 12:06pm

No. Not using the CT ath10k Drivers. The reboot would occur even without using wireless. I’m trying to trace which part of the kernel that I missed out patching. So it’s a trial and error process now.

Evgeniy1 · January 8, 2019, 5:50pm

Dear quarky,

pls. clarify, what was conditions in this test .
pppoE, or simple dhcp+ NAT ?
QoS disabled ?
settings of cpu governor ?
iperf ? size of packets ?