Qualcommax NSS Build

cryptoapi/v2.0/nss_cryptoapi_skcipher.c is qca-nss-cfi patches.

Excellent news but the main question is if the performance is better or not and could wireguard or other VPNs really benefit from this.

2 Likes

wireguard uses chachapoly. and wireguard runs already with gigabit speed using software crypto. so whats the point. it maxes out the standard ethernet port in my standard test scenarios. its more limited by the physical speed than by the software or crypto

nss-crypto and qce qcrypt (both using the same hw acceleration) has best benefits for bigger data. for packet encryption software crypto using arm64 aes works best. but this has nothing todo with wireguard since wireguard uses a different algorithm which is already lightning fast by software

2 Likes

Good to know all of this. My point was that if the NSS crypto core was able to offload VPNs network traffic too (or any other tasks currently supported only by the main CPU) that would keep even more free CPU cycles for other tasks like ksmbd network shares, etc.
I don't care here about the pure VPN speed but I care if we can use the NSS crypto core for useful tasks too.
A really simple example is the regular OpenWrt build without NSS support at all. We cannot achieve a Gigabit VPN because only the main CPU does all the work.
Because, before your fix for the NSS crypto, the NSS core was just a piece of useless hardware.
So, I wonder in this case, what other tasks could benefit from the supported NSS crypto algorithms now?
Probably my previous post wasn't clear enough.
I really appreciate your NSS patches and great contribution here. I hope that now NSS crypto core will do something useful too.

if you use wireguard you can archive gigbit speed without using nss. so why are you saying something else. if you cannot archive this, you have another bug unrelated to nss

for vpn applications nss crypto wont help you since the encrypted data (typically a ethernet packet of about 1500 bytes) is too small and the hw crypto engine overhead is too high here. but for ksmbd if samba encryption is enabled it will have great benefit.
but the argument to save resources for other tasks. i mean gigabit vpn speed with great nas performance at the same time is a little bit curious. your router doesnt seem to have any focus on priorities.
now we have one problem. lets say you are using openvpn dco. and you run ksmbd. ksmbd will have performance gain from hw crypto but at the same time it decreases performance for openvpn dco due the small packet sizes. you cannot use it selective. if the driver is loaded all kernel services make use of it

2 Likes

Actually, from my experience, even with small packet sizes, hardware crypto will benefit. If memory serves, ipq806x nss crypto benchmark code, (running in kernel), could achieve 500mbps of thruput (I believe it is using 64-256 bytes payload; thruput should be higher with larger payloads), while for OpenSSL, it probably tops out at 2-300mbps thruput with one CPU core maxed out.

I believe using OpenVPN DCO will benefit thruput, provided if nss crypto (with DMA overhead) is faster than arm64 AES-GCM kernel optimised code.

it will not benefit. we do not talk about hw crypto like aes-ni here. arm64 has a similiar instruction set which has nothing todo with nss crypto and will work already in place. but the nss crypto engine works different. consider that every data you want to encrypt must be sendet to a second chip and the encrypted result must be transfered back. this is the bootleneck here and looses performance on small data like ethernet packets. so its no in place hw encryption like special cpu instructions.

and finally. i did benchmark nss crypto and qcrypt and arm64 aes with various sizes already. so i know what i'm talking about

1 Like

I don't "just" say anything.
I was curious to see what WG speeds I can get compared to other platforms.
Here are my WG benchmark tests.
Although this test is mostly synthetic I've tried to make it more closer to the real world usage scenario running simultaneously wg-bench and Speedtest.
The differences between NSS and Non-NSS builds are obvious unless something else was fundamentally wrong in my setup.

wireguard will not benefit from any hw crypto engine. so whats the point with it?

but do me a favor. use tcrypt in kernel benmark (insmod tcrypt mode=500 sec=1)

do it without loading nss crypto driver. and then do it with loading nss crypto driver. tcrypt already only uses small sizes up to 4096. you will quickly find out that the in kernel crypto drivers for arm64 are faster than nss. i did the same for qce / qcrypto. from my benchmarks the break even when you get some benefits i s a datasize of 16384.
the reason is that no matter how big the data is, the time it needs to encrypt/descript is almost identical. but the amount of encryption / decryption runs is limited. so lets say you can do a maximum of 12000 encryptions by second. the result here is very different if your block is 1024 or its 16384. and this is what i see in benchmarking it

3 Likes

in addition ipq806x since you mentioed it. here the situation is different. the armv7 architecture has no special aes instruction set like cortex-a53. so on a ipq806x 32 bit system the benefit will be bigger. now comes the QCA joke. vendor firmwares using the ipq807x chipset are running only in 32 bit mode and cannot benefit from arm64 aes instructions. so the nss crypto makes more sense in 32 bit mode for them. but for us since we run it in 64 mode the situation is different

3 Likes

Reading all of your posts is really useful.
So any recommendations about these settings for ipq807x.


Screenshot is from my NSS ipq806x menuconfig.
Reading the help for both options,

Enable ChaCha20-Poly1305 ciphersuite support
Prefer ChaCha20-Poly1305 over AES-GCM by default 

Do I properly understand that below option

Prefer ChaCha20-Poly1305 over AES-GCM by default 

is beneficial for ipq806x but it isn't for ipq807x (because the latter has AES instructions). Maybe it doesn't matter in case the application is wisely coded as the help says that it's only a default case that application can always override.
Anyone benchmarked those?
@jkool702 You may want to chime in with the discussion about NSS crypto acceleration again considering the recent NSS crypto fix.

if you compile wireguard with the kernel the neccessary components are selected anyway since they are dependencies. wireguard requires chacha20poly1305, libcurve25519, blake2s. make sure that the arm64 optimized variants are compiled in and please look into the kernel config. the openwrt menuconfig might not be be complete

I did some benchmark tests using openssl

openssl speed -seconds 10 -engine devcrypto aes

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 18715.49k 63367.38k 411619.20k 1463933.16k 13153177.60k 25966533.49k
aes-192-cbc 13718.83k 68621.87k 274562.13k 1012917.17k 8759432.53k 23302261.03k
aes-256-cbc 22865.07k 102972.00k 366014.58k 1879844.57k 11666591.29k 17451417.60k

openssl speed -seconds 10 aes

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 110452.12k 336755.46k 668732.72k 920181.04k 1033949.30k 1040359.42k
aes-192-cbc 107795.87k 306096.39k 552953.88k 712999.01k 778744.31k 781908.38k
aes-256-cbc 104773.77k 281387.54k 480802.56k 597366.17k 642897.34k 644995.48k

nss / non-nss in %

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 16.94% 18.82% 61.55% 159.09% 1272.13% 2495.92%
aes-192-cbc 12.73% 22.42% 49.65% 142.06% 1124.81% 2980.18%
aes-256-cbc 21.82% 36.59% 76.13% 314.69% 1814.69% 2705.67%

Some observations: non-nss accelerated tests will occupy 25% of the CPU (single core), while nss accelerated tests use 4% CPU. Besides, I cannot get similar results for aes-gcm. It seems nss will be bypassed and the result will be the same as non-nss tests.

1 Like

your results dont look like they have been made without any hw crypto like qce.

here are my results without any hw crypto involved using your parameters

aes-128 cbc      81911.97k    91870.11k    94795.93k    95669.00k    95494.02k    95876.07k
aes-192 cbc      73744.70k    79307.03k    81643.58k    81960.06k    82279.79k    82146.55k
aes-256 cbc      65395.20k    69744.83k    71300.04k    72152.72k    71928.57k    72174.09k

more interesting it gets with -evp

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc     139651.85k   516102.61k  1053835.94k  1461820.34k  1646942.30k  1668023.60k

what i dont understand is why tcrypt results in such poor speed

sometimes 5Ghz radio goes down with below error and even with 802.11r error spams like hell in system log
also mostly caused by iphone client

peer ast idx 965 can't be found

but reapplied iphone fix than error disappeared
edit: talked too early so error shows up again

1 Like

this is no error. this only means your client disconnected and the driver is unable to update the client statistic since connection got lost. so client left without noticing the driver for disconnect. but i will do some research in case there is another reason for it (i have some ideas)

2 Likes

@AgustinLorenzo

Could you include the Netgear WAX218 into your builds? tried building myself but all clients disconnect after a couple of seconds.

here are the benchmarks with my latest code using nss crypto

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc      25711.00k    91355.73k   365448.53k  1519859.20k 34978747.73k 31416866.13k
aes-192-cbc      25697.40k   274026.67k   298968.44k  1315891.20k 104862515.20k 30430549.33k
aes-256-cbc      51407.60k   411312.00k   411305.60k  1195836.51k 17406225.07k 22687470.93k

2 Likes

Have doubled. Can get the code out?