@Ansuel@quarky did some work on the NSS commits - trying to evaluate the cleanest way to "add NSS to master" (...as clean as possible). The nbg6817, C2600, EA7500v1, and EA8500 are all working per user reports after some dts changes - seems like NSS is ready for some inclusion in to master.
Good things:
All the main packages are in package/qca - tons of the commits are small edits due to minor compiling errors. Just starting off with loading the packages in to Master is a huge step forward and seems like the first major step.
iproute2 has two easy patches 500-add-nssmirred.patch and 400-add-nss-qdisc.patch - seems like it could make it in to master pretty easily.
wifi offloading patches could likely make it in to master - they are straight forward, addition to the mac80211 makefile, several patches to subsys/ath folders, and ath10-ct folder. I included @facboy 5.4=>5.8 updates. I couldn't compile mac80211 for some reason when building from scratch.... so I might be missing something.
Patches under target/linux/ipq806x/patches-5.4 look organized and while invasive - since they are ipq806x specific could make it in to master. Like mentioned - the patch could be included with each driver as a commit to give it an organized appearance for master. ex: ecm-support patches with ecm package
The bad:
NSS voltage and crypto drivers are still in target/linux/ipq806x/files-5.4 (this folder was removed in master). Is there a better location for these- do they make sense to go in to packages/qca or in to the new "files" folder?
L2 scaling driver is throughout the build. Would be best if the latest version was integrated in to master or if we pulled it out for the purposes of getting some of NSS in to master. I can go back 30 commits to pull files out - is there a suggested cleaner option to revert the changes?
dts and dtsi files need multiple changes. Several of the patches and commits conflict. Might be easiest to commit the change straight from master to the final product and delete several of the historic intermediate changes? - ex: the 990-00-add-required-entries-in-dts-files patch
Is there a suggested way to enable the use of the hardware offloading on/off switch in the firewall? Would be helpful if there were NSS issues for particular features (ex: SQM with cake) that you could customize if you want the hardware offloading or if you want plain.
I just uploaded all my qca-nss-cfi and qca-nss-crypto changes. I started from a "clean" source and tried to add my changes in a logical way via separate commit. All commits are needed to make it actually work again.
No patches after this are needed and the 999-04-qca-nss-cfi-support.patch can and should be removed. Exception is the core/crypto clock patch.
I'm recompiling now from make dist-clean to make sure I didn't forget anything and it works as expected. Test device is a NBG6817.
Maybe I should start a seperate thread for this to get more people to see it, but:
Does anyone still know how OpenSwan was using their own IPSECx devices via KLIPS? I know they had their own tool to add "ipsec0:eth0" and this could be intercepted / handed to the nss ipsec driver/manager. Was an IPSEC device always connected to just one inbound and one outbound SA and policy?
This will help me make more sense of the code that its used to add ipsec offload to the nss cores.
I backported the patch from spf11.3 to spf11.2 and tried with a delay of 20 packets(planning to reduce it to 10) and tried.. so far the results have been good..
Keep in mind that the qca-nss-cfi like this still contains some cheating code to limit the cryptlen to 65520 to get performance estimates using cryptsetup (which uses 64KB).
This should be removed and a fail error should be returned or.. the request should be split into more parts. (Only valid for CBC Encrypt)... all others work without cryptlen limitations as they can be split into smaller parts and send to the engine straight away.
I am also not getting the performance I want (expect) from the EIP93, see my github, and try the "split" branch if you want to have a look.
I created a dummy-engine: which does absolutely nothing except run all the code without actually hardware or transformation to see where in the driver code performance could be improved the most: Sofar the conclusion is: don't queue any requests, put them directly into the hardware and poll for the result without returning, without any work-queue, tasklet or interrupt for best performance
I will benchmark my version against your patches later.
From what I can piece together, NSS firmware generates interrupts for all task completions. The qca-nss-drv handles the generic interrupt and calls the callback function that’s registered, be it network or crypto. ‘Polling’ in the NSS context basically made the code synchronous, but it is still polling a semaphore that’s set by the interrupt routine.
NSS has 4 crypto engines that can operate concurrently. For CBC mode, you’re constrained to using only one engine due to IV. For ECB you can likely increase thruput by shooting encryption/decryption requests asynchronously 4 at a time and wait for completion.
I had some time to do some comparing between my almost refactored crypto-cfi and the patched version that we started with.
The bad news:
I can't do a cryptsetup comparison because the original version is not able to do that. Just adding the multi-buffer patch is not enough to make that work.
Because other parts needed to be reworked I can't even do a "simple" tcrypt comparison because it will fail on setkey().
Tomorrow I should have more time to a an ipsec / iperf3 comparison.
So I ran a few openssl speed tests to compare:
All the added code to make the driver pass all the extended fuzz tests didn't affect performance but did make the driver scatter/gather (multi buffer) expect for CBC-encrypt... which I'm almost sure there must be a way to reuse the IV from the previous operation.
The other bad: neither version is good enough to even come close to software crypto when its needed from userspace (openssl / ssh / openvpn etc.).
Lets see tomorrow how ipsec holds up: authenc(hmac(sha x), cbc/ctr aes). I did find OpenSwan code from 2017 with all the KLIPS code so maybe there is an "easy" way to add ipsec offload to the nss core via an virtual interface.
There are two ways to get GCM to work in combination with the nss-crypto:
1: Use a bounce buffer to the destination scatterlist: this kills performance to the point that it is faster with blocks greater than let's say 4096 bytes.
2: Hack (patch ) the mainline gcm.c to add padding to the internal structures to get them cache-line aligned. This options is below: and seems also not very useful even from kernel space as the tcrypt results show
[ 154.174293]
[ 154.174293] testing speed of rfc4106(gcm(aes)) (rfc4106(gcm_base(ctr(aes-generic),ghash-generic))) encryption
[ 154.174402] test 0 (160 bit key, 16 byte blocks):
[ 154.174679] 1 operation in 120 cycles (16 bytes)
[ 154.189402] test 1 (160 bit key, 64 byte blocks):
[ 154.189739] 1 operation in 162 cycles (64 bytes)
[ 154.198775] test 2 (160 bit key, 256 byte blocks):
[ 154.199434] 1 operation in 330 cycles (256 bytes)
[ 154.208149] test 3 (160 bit key, 512 byte blocks):
[ 154.209217] 1 operation in 544 cycles (512 bytes)
[ 154.217698] test 4 (160 bit key, 1024 byte blocks):
[ 154.219593] 1 operation in 974 cycles (1024 bytes)
[ 154.227720] test 5 (160 bit key, 2048 byte blocks):
[ 154.231257] 1 operation in 1830 cycles (2048 bytes)
[ 154.237429] test 6 (160 bit key, 4096 byte blocks):
[ 154.244375] 1 operation in 3635 cycles (4096 bytes)
[ 154.248445] test 7 (160 bit key, 8192 byte blocks):
[ 154.262009] 1 operation in 6998 cycles (8192 bytes)
[ 154.266229] testing speed of gcm(aes) (gcm_base(ctr(aes-generic),ghash-generic)) encryption
[ 154.270687] test 0 (128 bit key, 16 byte blocks):
[ 154.270912] 1 operation in 103 cycles (16 bytes)
[ 154.285445] test 1 (128 bit key, 64 byte blocks):
[ 154.285753] 1 operation in 147 cycles (64 bytes)
[ 154.294816] test 2 (128 bit key, 256 byte blocks):
[ 154.295444] 1 operation in 315 cycles (256 bytes)
[ 154.304220] test 3 (128 bit key, 512 byte blocks):
[ 154.305261] 1 operation in 530 cycles (512 bytes)
[ 154.313763] test 4 (128 bit key, 1024 byte blocks):
[ 154.315630] 1 operation in 960 cycles (1024 bytes)
[ 154.323651] test 5 (128 bit key, 2048 byte blocks):
[ 154.327156] 1 operation in 1812 cycles (2048 bytes)
[ 154.333366] test 6 (128 bit key, 4096 byte blocks):
[ 154.340166] 1 operation in 3531 cycles (4096 bytes)
[ 154.344341] test 7 (128 bit key, 8192 byte blocks):
..
[ 99.141587]
[ 99.141587] testing speed of rfc4106(gcm(aes)) (rfc4106(gcm_base(nss-ctr-aes,ghash-generic))) encryption
[ 99.142372] test 0 (160 bit key, 16 byte blocks):
[ 99.146590] 1 operation in 2048 cycles (16 bytes)
[ 99.156794] test 1 (160 bit key, 64 byte blocks):
[ 99.159406] 1 operation in 1152 cycles (64 bytes)
[ 99.166308] test 2 (160 bit key, 256 byte blocks):
[ 99.168400] 1 operation in 1022 cycles (256 bytes)
[ 99.175449] test 3 (160 bit key, 512 byte blocks):
[ 99.177448] 1 operation in 1023 cycles (512 bytes)
[ 99.185116] test 4 (160 bit key, 1024 byte blocks):
[ 99.187634] 1 operation in 1138 cycles (1024 bytes)
[ 99.195108] test 5 (160 bit key, 2048 byte blocks):
[ 99.198131] 1 operation in 1370 cycles (2048 bytes)
[ 99.204935] test 6 (160 bit key, 4096 byte blocks):
[ 99.211432] 1 operation in 3931 cycles (4096 bytes)
[ 99.215910] test 7 (160 bit key, 8192 byte blocks):
[ 99.223960] 1 operation in 4246 cycles (8192 bytes)
[ 99.228474]
[ 99.228474] testing speed of gcm(aes) (gcm_base(nss-ctr-aes,ghash-generic)) encryption
[ 99.233280] test 0 (128 bit key, 16 byte blocks):
[ 99.237224] 1 operation in 2057 cycles (16 bytes)
[ 99.247206] test 1 (128 bit key, 64 byte blocks):
[ 99.249496] 1 operation in 1275 cycles (64 bytes)
[ 99.256738] test 2 (128 bit key, 256 byte blocks):
[ 99.259189] 1 operation in 1023 cycles (256 bytes)
[ 99.266072] test 3 (128 bit key, 512 byte blocks):
[ 99.268215] 1 operation in 1023 cycles (512 bytes)
[ 99.275712] test 4 (128 bit key, 1024 byte blocks):
[ 99.277882] 1 operation in 1151 cycles (1024 bytes)
[ 99.285707] test 5 (128 bit key, 2048 byte blocks):
[ 99.289550] 1 operation in 2044 cycles (2048 bytes)
[ 99.295538] test 6 (128 bit key, 4096 byte blocks):
[ 99.299460] 1 operation in 2047 cycles (4096 bytes)
[ 99.305204] test 7 (128 bit key, 8192 byte blocks):
I added some ipsec performance numbers to my github. This is without "tricks" like adding "echainiv" or "seqiv" templates and still without the use of any NSS Virtual Interface. Also removed the "hack" to make the skb linear within the e.g esp4.c file.
I will test later with "original" version which uses the IV templates. Reason why I removed them is because there is no IV generation at all in the NSS core or the driver. Which means it always uses the same based on the raw sequence number.
Having issues with the nss-volt-ipq806x.c driver. target/linux is failing to build. Built fine on December 31st Master - don't know what has changed in master since that date that would cause the issue. Anyone able to read this log and help guide me on a fix?
make[5]: Entering directory '/home/HTPC/OpenWRT/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/linux-5.4.89'
CALL scripts/atomic/check-atomics.sh
CALL scripts/checksyscalls.sh
CHK include/generated/compile.h
CC drivers/regulator/nss-volt-ipq806x.o
drivers/regulator/nss-volt-ipq806x.c:184:5: error: redefinition of 'nss_ramp_voltage'
int nss_ramp_voltage(unsigned long rate, bool ramp_up)
^~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:34:5: note: previous definition of 'nss_ramp_voltage' was here
int nss_ramp_voltage(unsigned long rate, bool ramp_up)
^~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:215:34: error: redefinition of 'nss_ipq806x_match_table'
static const struct of_device_id nss_ipq806x_match_table[] = {
^~~~~~~~~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:73:34: note: previous definition of 'nss_ipq806x_match_table' was here
static const struct of_device_id nss_ipq806x_match_table[] = {
^~~~~~~~~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:220:12: error: redefinition of 'nss_volt_ipq806x_probe'
static int nss_volt_ipq806x_probe(struct platform_device *pdev)
^~~~~~~~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:78:12: note: previous definition of 'nss_volt_ipq806x_probe' was here
static int nss_volt_ipq806x_probe(struct platform_device *pdev)
^~~~~~~~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:266:31: error: redefinition of 'nss_ipq806x_driver'
static struct platform_driver nss_ipq806x_driver = {
^~~~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:126:31: note: previous definition of 'nss_ipq806x_driver' was here
static struct platform_driver nss_ipq806x_driver = {
^~~~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:275:19: error: redefinition of 'nss_ipq806x_init'
static int __init nss_ipq806x_init(void)
^~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:135:19: note: previous definition of 'nss_ipq806x_init' was here
static int __init nss_ipq806x_init(void)
^~~~~~~~~~~~~~~~
In file included from ./include/linux/printk.h:6,
from ./include/linux/kernel.h:15,
from drivers/regulator/nss-volt-ipq806x.c:17:
./include/linux/init.h:196:20: error: redefinition of '__initcall_nss_ipq806x_init7'
static initcall_t __initcall_##fn##id __used \
^~~~~~~~~~~
./include/linux/init.h:200:35: note: in expansion of macro '___define_initcall'
#define __define_initcall(fn, id) ___define_initcall(fn, id, .initcall##id)
^~~~~~~~~~~~~~~~~~
./include/linux/init.h:231:28: note: in expansion of macro '__define_initcall'
#define late_initcall(fn) __define_initcall(fn, 7)
^~~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:279:1: note: in expansion of macro 'late_initcall'
late_initcall(nss_ipq806x_init);
^~~~~~~~~~~~~
./include/linux/init.h:196:20: note: previous definition of '__initcall_nss_ipq806x_init7' was here
static initcall_t __initcall_##fn##id __used \
^~~~~~~~~~~
./include/linux/init.h:200:35: note: in expansion of macro '___define_initcall'
#define __define_initcall(fn, id) ___define_initcall(fn, id, .initcall##id)
^~~~~~~~~~~~~~~~~~
./include/linux/init.h:231:28: note: in expansion of macro '__define_initcall'
#define late_initcall(fn) __define_initcall(fn, 7)
^~~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:139:1: note: in expansion of macro 'late_initcall'
late_initcall(nss_ipq806x_init);
^~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:281:20: error: redefinition of 'nss_ipq806x_exit'
static void __exit nss_ipq806x_exit(void)
^~~~~~~~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:141:20: note: previous definition of 'nss_ipq806x_exit' was here
static void __exit nss_ipq806x_exit(void)
^~~~~~~~~~~~~~~~
In file included from ./include/linux/printk.h:6,
from ./include/linux/kernel.h:15,
from drivers/regulator/nss-volt-ipq806x.c:17:
./include/linux/init.h:237:20: error: redefinition of '__exitcall_nss_ipq806x_exit'
static exitcall_t __exitcall_##fn __exit_call = fn
^~~~~~~~~~~
./include/linux/module.h:97:24: note: in expansion of macro '__exitcall'
#define module_exit(x) __exitcall(x);
^~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:285:1: note: in expansion of macro 'module_exit'
module_exit(nss_ipq806x_exit);
^~~~~~~~~~~
./include/linux/init.h:237:20: note: previous definition of '__exitcall_nss_ipq806x_exit' was here
static exitcall_t __exitcall_##fn __exit_call = fn
^~~~~~~~~~~
./include/linux/module.h:97:24: note: in expansion of macro '__exitcall'
#define module_exit(x) __exitcall(x);
^~~~~~~~~~
drivers/regulator/nss-volt-ipq806x.c:145:1: note: in expansion of macro 'module_exit'
module_exit(nss_ipq806x_exit);
^~~~~~~~~~~
make[7]: *** [scripts/Makefile.build:262: drivers/regulator/nss-volt-ipq806x.o] Error 1
make[6]: *** [scripts/Makefile.build:496: drivers/regulator] Error 2
make[6]: *** Waiting for unfinished jobs....
make[5]: *** [Makefile:1732: drivers] Error 2
make[5]: Leaving directory '/home/HTPC/OpenWRT/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/linux-5.4.89'
make[4]: *** [Makefile:27: /home/HTPC/OpenWRT/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/linux-5.4.89/.image] Error 2
make[4]: Leaving directory '/home/HTPC/OpenWRT/openwrt/target/linux/ipq806x'
make[3]: *** [Makefile:13: install] Error 2
make[3]: Leaving directory '/home/HTPC/OpenWRT/openwrt/target/linux'
time: target/linux/install#42.65#23.77#20.86
ERROR: target/linux failed to build.
I can't judge the goodness of your code but I can tell you that I successfully build an image using my standard diffconfig_for_nss. I did not flash the image. Waiting for more feedback for that
I’m using it for NAT, wireless, and have VPN on my build (I’m not using VPN but both OpenVPN-OpenSSL & wireguard are in my build). Anyone seeing any performance advantages for using NSS crypto, shortcut-fe, or any of the additional packages over my current packages?