cryptoapi/v2.0/nss_cryptoapi_skcipher.c
is qca-nss-cfi
patches.
Excellent news but the main question is if the performance is better or not and could wireguard or other VPNs really benefit from this.
wireguard uses chachapoly. and wireguard runs already with gigabit speed using software crypto. so whats the point. it maxes out the standard ethernet port in my standard test scenarios. its more limited by the physical speed than by the software or crypto
nss-crypto and qce qcrypt (both using the same hw acceleration) has best benefits for bigger data. for packet encryption software crypto using arm64 aes works best. but this has nothing todo with wireguard since wireguard uses a different algorithm which is already lightning fast by software
Good to know all of this. My point was that if the NSS crypto core was able to offload VPNs network traffic too (or any other tasks currently supported only by the main CPU) that would keep even more free CPU cycles for other tasks like ksmbd network shares, etc.
I don't care here about the pure VPN speed but I care if we can use the NSS crypto core for useful tasks too.
A really simple example is the regular OpenWrt build without NSS support at all. We cannot achieve a Gigabit VPN because only the main CPU does all the work.
Because, before your fix for the NSS crypto, the NSS core was just a piece of useless hardware.
So, I wonder in this case, what other tasks could benefit from the supported NSS crypto algorithms now?
Probably my previous post wasn't clear enough.
I really appreciate your NSS patches and great contribution here. I hope that now NSS crypto core will do something useful too.
if you use wireguard you can archive gigbit speed without using nss. so why are you saying something else. if you cannot archive this, you have another bug unrelated to nss
for vpn applications nss crypto wont help you since the encrypted data (typically a ethernet packet of about 1500 bytes) is too small and the hw crypto engine overhead is too high here. but for ksmbd if samba encryption is enabled it will have great benefit.
but the argument to save resources for other tasks. i mean gigabit vpn speed with great nas performance at the same time is a little bit curious. your router doesnt seem to have any focus on priorities.
now we have one problem. lets say you are using openvpn dco. and you run ksmbd. ksmbd will have performance gain from hw crypto but at the same time it decreases performance for openvpn dco due the small packet sizes. you cannot use it selective. if the driver is loaded all kernel services make use of it
Actually, from my experience, even with small packet sizes, hardware crypto will benefit. If memory serves, ipq806x nss crypto benchmark code, (running in kernel), could achieve 500mbps of thruput (I believe it is using 64-256 bytes payload; thruput should be higher with larger payloads), while for OpenSSL, it probably tops out at 2-300mbps thruput with one CPU core maxed out.
I believe using OpenVPN DCO will benefit thruput, provided if nss crypto (with DMA overhead) is faster than arm64 AES-GCM kernel optimised code.
it will not benefit. we do not talk about hw crypto like aes-ni here. arm64 has a similiar instruction set which has nothing todo with nss crypto and will work already in place. but the nss crypto engine works different. consider that every data you want to encrypt must be sendet to a second chip and the encrypted result must be transfered back. this is the bootleneck here and looses performance on small data like ethernet packets. so its no in place hw encryption like special cpu instructions.
and finally. i did benchmark nss crypto and qcrypt and arm64 aes with various sizes already. so i know what i'm talking about
I don't "just" say anything.
I was curious to see what WG speeds I can get compared to other platforms.
Here are my WG benchmark tests.
Although this test is mostly synthetic I've tried to make it more closer to the real world usage scenario running simultaneously wg-bench and Speedtest.
The differences between NSS and Non-NSS builds are obvious unless something else was fundamentally wrong in my setup.
wireguard will not benefit from any hw crypto engine. so whats the point with it?
but do me a favor. use tcrypt in kernel benmark (insmod tcrypt mode=500 sec=1)
do it without loading nss crypto driver. and then do it with loading nss crypto driver. tcrypt already only uses small sizes up to 4096. you will quickly find out that the in kernel crypto drivers for arm64 are faster than nss. i did the same for qce / qcrypto. from my benchmarks the break even when you get some benefits i s a datasize of 16384.
the reason is that no matter how big the data is, the time it needs to encrypt/descript is almost identical. but the amount of encryption / decryption runs is limited. so lets say you can do a maximum of 12000 encryptions by second. the result here is very different if your block is 1024 or its 16384. and this is what i see in benchmarking it
in addition ipq806x since you mentioed it. here the situation is different. the armv7 architecture has no special aes instruction set like cortex-a53. so on a ipq806x 32 bit system the benefit will be bigger. now comes the QCA joke. vendor firmwares using the ipq807x chipset are running only in 32 bit mode and cannot benefit from arm64 aes instructions. so the nss crypto makes more sense in 32 bit mode for them. but for us since we run it in 64 mode the situation is different
Reading all of your posts is really useful.
So any recommendations about these settings for ipq807x.
Screenshot is from my NSS ipq806x menuconfig.
Reading the help for both options,
Enable ChaCha20-Poly1305 ciphersuite support
Prefer ChaCha20-Poly1305 over AES-GCM by default
Do I properly understand that below option
Prefer ChaCha20-Poly1305 over AES-GCM by default
is beneficial for ipq806x but it isn't for ipq807x (because the latter has AES instructions). Maybe it doesn't matter in case the application is wisely coded as the help says that it's only a default case that application can always override.
Anyone benchmarked those?
@jkool702 You may want to chime in with the discussion about NSS crypto acceleration again considering the recent NSS crypto fix.
if you compile wireguard with the kernel the neccessary components are selected anyway since they are dependencies. wireguard requires chacha20poly1305, libcurve25519, blake2s. make sure that the arm64 optimized variants are compiled in and please look into the kernel config. the openwrt menuconfig might not be be complete
I did some benchmark tests using openssl
openssl speed -seconds 10 -engine devcrypto aes
type | 16 bytes | 64 bytes | 256 bytes | 1024 bytes | 8192 bytes | 16384 bytes |
---|---|---|---|---|---|---|
aes-128-cbc | 18715.49k | 63367.38k | 411619.20k | 1463933.16k | 13153177.60k | 25966533.49k |
aes-192-cbc | 13718.83k | 68621.87k | 274562.13k | 1012917.17k | 8759432.53k | 23302261.03k |
aes-256-cbc | 22865.07k | 102972.00k | 366014.58k | 1879844.57k | 11666591.29k | 17451417.60k |
openssl speed -seconds 10 aes
type | 16 bytes | 64 bytes | 256 bytes | 1024 bytes | 8192 bytes | 16384 bytes |
---|---|---|---|---|---|---|
aes-128-cbc | 110452.12k | 336755.46k | 668732.72k | 920181.04k | 1033949.30k | 1040359.42k |
aes-192-cbc | 107795.87k | 306096.39k | 552953.88k | 712999.01k | 778744.31k | 781908.38k |
aes-256-cbc | 104773.77k | 281387.54k | 480802.56k | 597366.17k | 642897.34k | 644995.48k |
nss / non-nss in %
type | 16 bytes | 64 bytes | 256 bytes | 1024 bytes | 8192 bytes | 16384 bytes |
---|---|---|---|---|---|---|
aes-128-cbc | 16.94% | 18.82% | 61.55% | 159.09% | 1272.13% | 2495.92% |
aes-192-cbc | 12.73% | 22.42% | 49.65% | 142.06% | 1124.81% | 2980.18% |
aes-256-cbc | 21.82% | 36.59% | 76.13% | 314.69% | 1814.69% | 2705.67% |
Some observations: non-nss accelerated tests will occupy 25% of the CPU (single core), while nss accelerated tests use 4% CPU. Besides, I cannot get similar results for aes-gcm. It seems nss will be bypassed and the result will be the same as non-nss tests.
your results dont look like they have been made without any hw crypto like qce.
here are my results without any hw crypto involved using your parameters
aes-128 cbc 81911.97k 91870.11k 94795.93k 95669.00k 95494.02k 95876.07k
aes-192 cbc 73744.70k 79307.03k 81643.58k 81960.06k 82279.79k 82146.55k
aes-256 cbc 65395.20k 69744.83k 71300.04k 72152.72k 71928.57k 72174.09k
more interesting it gets with -evp
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 139651.85k 516102.61k 1053835.94k 1461820.34k 1646942.30k 1668023.60k
what i dont understand is why tcrypt results in such poor speed
sometimes 5Ghz radio goes down with below error and even with 802.11r error spams like hell in system log
also mostly caused by iphone client
peer ast idx 965 can't be found
but reapplied iphone fix than error disappeared
edit: talked too early so error shows up again
this is no error. this only means your client disconnected and the driver is unable to update the client statistic since connection got lost. so client left without noticing the driver for disconnect. but i will do some research in case there is another reason for it (i have some ideas)
Could you include the Netgear WAX218 into your builds? tried building myself but all clients disconnect after a couple of seconds.
here are the benchmarks with my latest code using nss crypto
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 25711.00k 91355.73k 365448.53k 1519859.20k 34978747.73k 31416866.13k
aes-192-cbc 25697.40k 274026.67k 298968.44k 1315891.20k 104862515.20k 30430549.33k
aes-256-cbc 51407.60k 411312.00k 411305.60k 1195836.51k 17406225.07k 22687470.93k
Have doubled. Can get the code out?