I had a (very) quick look. As even stated in the comments in the code only scatterlist with 1 segment it allowed. Workaround could be:
copy the entire scatterlist to a buffer, do the transformation and copy the resulting buffer back to a scatterlist. The larger the buffer needed, the more overhead. Given that most packets are around 1500 bytes (MTU) or maybe 4k using LUKS with large sectors, you have to think about what you want to achieve.
Alternatively, create a transformation for each scatterlist segment. Keep in mind that the "Source" and "Destination" scatterlist might not be the same segmented. This is what I'm trying to correct in my code. The only problem would be (and I didn't look long enough at your code for this), is can you use the IV from the previous transformation for the next. This is not the same as programming a "new" IV.
Most solutions I've looked at for different hardware are queuing in software, meaning they wait for each full request to be completed before de-queueing the next. In this cause you could use the "operation complete" interrupt which would generate the interrupt when the hardware has no more transformations for complete.
The first option is the easiest to implement: scatterlist-to-buffer, create a single segment "scatterlist" from this buffer, pass that to the remaining driver. Combine that with a one-request in the engine at the same time and you could even pre-allocated the buffer for this instead of dynamically allocating/freeing memory.
I’m thinking along this line. I got this idea reading your MT driver. Using the big buffer approach would probably negate the speed advantage and would be a waste of resource. I’m aware that I need to get the last block of the cipher result as IV for the next segment. Doing to CBC first. Then tackle CTR if I get CBC to work.
I managed to get the driver to work, using the multi-segment scatterlist approach, but the performance is terrible.
Below stats is using the Krait CPU:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 59750.90k 77232.73k 87382.27k 89440.60k 89877.16k
Below stats is using the NSS core:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 65.99k 261.01k 1038.68k 2138.11k 11141.12k
Tested using the following command:
openssl speed -evp aes-128-cbc -elapsed
I suspect the issue is likely due to my code allocating and freeing kernel memory for every pass. I'll try to pre-allocate the memory and see how it goes. Only doing AES CBC encryption for now tho.
My assumption about memory allocation slowing it down is incorrect. To quickly test, I used a preallocated buffer for one users only, i.e. without kernel memory allocation. Same result.
Looks like I need to study the NSS crypto driver to understand in more detail where the bottleneck might be.
I’m confused. If you are using a multi segmented scatterlist approach, why do you need to allocate memory other then for a kind of command descriptor? It should do DMA directly from either kernel or userspace “zero copy”.
Judging by the performance, I would look into your interrupt handling. This is likely the biggest bottleneck to tackle. (I’m having similar problems with the MT driver, I’m rewriting my code now in the style of the EIP197 safexcel driver).
I needed some memory allocated to keep track of the computed segments and which segments each calls are currently at. Basically data structure to keep tracks of the segments pointers.
At the moment I have no clue where the bottleneck could be. It could be the interrupts. Need to study the QCA drivers.
@drbrains I did some crude benchmarking on my hacked driver for the NSS crypto. Trying to encrypt a payload of 4096 bytes split into 3 segments resulted in wildly erratic timings.
What I did was to capture the time before starting ciphering the 3 segments and capture the time again when all 3 segments are done. The times I get varies from between just over 400 us to slightly over 1000 us. This gives me a thruput of between 77 mbps and 30 mbps. Each pass will generate 3 interrupts for each of the completed segments.
So it looks to me like the interrupt nature of the entire ciphering process is contributing to the bad performance.
Not too sure if I can improve this behaviour, as the interrupts are generated by the NSS firmware, which I have no control over. Judging from the results, I may have better consistency going with the large buffer approach copy all segments into a large buffer and copying the results back to the scatter list.
I'll try that and see if it results in better performance, thought I'm not hopeful.
It would be great if the kernel folks could help out. But seeing that this is something that contains proprietary firmware codes with open source drivers, chances of getting help will likely be slim.
I've tried the big buffer approach. Performance improved 100%, but it's still bad.
Below are numbers from the NSS crypto with consolidated buffer allocated:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 62.02k 250.05k 938.84k 3764.22k 26383.70k
aes-256-cbc 59.18k 235.33k 973.65k 3707.56k 24980.14k
Below are numbers from kernel crypto drivers running on the Krait CPU:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 1349.59k 5541.06k 14076.42k 25547.60k 35430.40k
aes-256-cbc 1449.99k 5312.90k 13398.10k 22709.93k 29892.61k
Finally, below are the ciphers from OpenSSL running on the Krait CPU:
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128 cbc 70107.68k 80491.29k 86042.11k 88152.75k 88995.16k
aes-256 cbc 49108.50k 56065.37k 58578.77k 59422.38k 59714.22k
Looks like it may be difficult to improve the performance further.
I'll probably do more study of the existing drivers and see if I can learn new tricks to further improve the performance of the NSS crypto drivers.
It could be that the nss core(in stock firmware) is used only for WiFi crypto and not for raw encryption (for VPN )
As in every way the CPU will be better than the nss core as the CPU have higher frequency
We should grab some performance test from the original firmware to check this.
Stock firmware does not have the necessary drivers to test OpenSSL, so not sure if we can get any meaningful number from stock firmware.
The NSS crypto driver comes with a benchmarking tool, which shows peak bandwidth over 500mbps doing AES-128-CBC with SHA1 on 256bytes payload. So the NSS crypto is fast, which likely could be due to it not doing much except networking and crypto tasks, compared to the Kraft CPU cores doing a ton more work, although being twice as fast in clock frequenciy.
Bad news folks. The inclusion of the NSS drivers has made the firmware unstable. I’m getting random reboots without any logs or kernel panics. Router just freeze and rebooted. Looks like it’ll be painful to isolate the issue.
Maybe it’s due to me upgrading too many components, I.e, kernel, ath10k.
Have you been using the ath10k-ct driver? Previous versions were unstable to the point where the router would reboot. There's been some important fixes lately though.
No. Not using the CT ath10k Drivers. The reboot would occur even without using wireless. I’m trying to trace which part of the kernel that I missed out patching. So it’s a trial and error process now.