IPQ806x NSS Drivers

Yes, all working now. Thanks!

It should register a NSS virtual interface per wireless interface configured, e.g wlan0, wlan1, etc. Are you seeing multiple NSS virtual interfaces registered for a single wireless interface?

If you’re referring to the number reported in the qca-nss-drv debug fs output, I.e. 2 created per wireless interface, my understanding is that this is correct. A NSS virtual interface has a host-to-NSS and NSS-to-host sub-interfaces. This is how I made use of the sub-interfaces to create the nss-ifb driver.

Hope I’m not confusing anyone with the explanation above.

1 Like

Yep I'm seeing this 2 registration for interface... For some reason WiFi offload is not working on my repo.
Did you test WiFi offload with latest version of openwrt mac80211?

Not yet. My repo is branched from master in June 2020. Have yet to pull updates. Still messing around with OpenVPN NSS crypto.

What symptoms are you seeing?

They told me no WiFi offload.
The patch needed some changes but they were not related to the nss function.
The virtual interface from the logs tells me that they correctly register.

I plan to relook into the mac80211 stack after I get bored with the crypto stuff, heh heh.

I suspect ECM is not picking up the network flow correctly and still flowing it thru the slow path. Maybe try enabling Linux flow-offload as a temp workaround.

Didn't enabling Linux flow offload bypass ECM?

Have not tried, but from what I understand of the NSS architecture, they can co-exists. NSS firmware will always take precedence once activated by ECM. Those not handled by NSS will flow into the Linux network stack.

NSS takes over the GMAC data plane, so all GMAC packets must flow into NSS first.

Any news there?

Not too sure what you meant there, but I'm guessing testing with software-flow offload and NSS both activated? Have not tried, since my R7800 seems to be working OK with partial Wi-Fi traffic off-loaded to NSS.

When I tried last, sirq went down when I enabled mac80211 offload, so it appears to be working for me. But I only have a 2x2 wireless clients to test with, so at most I can achieve is 600 Mbps with Wi-Fi, with about 60% sirq for 1 CPU core. At this point, I don't know whether the load is caused by mac80211 or netfilter, or something else. Need some effort to look into it.

@robimarko Yes, I tested on the QSDK , OpenVPN is offloaded to NSS.

@quarky I just used two PCs to test again because one PC is a bottleneck for OpenVPN is single thread. I got about 383+478=861. It's excellent.

root@OpenWrt:/tmp/etc# cat run-two-iperf.sh 
iperf -c 10.8.0.10 -t 30 & 
iperf -c 10.8.0.14 -t 30

root@OpenWrt:/tmp/etc# ./run-two-iperf.sh                                                                                                                                                                          
------------------------------------------------------------                                                                                                                                                       
Client connecting to 10.8.0.14, TCP port 5001                                                                                                                                                                      
TCP window size: 45.0 KByte (default)                                                                                                                                                                              
------------------------------------------------------------                                                                                                                                                       
[  3] local 10.8.0.1 port 60504 connected with 10.8.0.14 port 5001                                                                                                                                                 
------------------------------------------------------------                                                                                                                                                       
Client connecting to 10.8.0.10, TCP port 5001                                                                                                                                                                      
TCP window size: 45.0 KByte (default)                                                                                                                                                                              
------------------------------------------------------------                                                                                                                                                       
[  3] local 10.8.0.1 port 34274 connected with 10.8.0.10 port 5001                                                                                                                                                 
[ ID] Interval       Transfer     Bandwidth                                                                                                                                                                        
[  3]  0.0-30.0 sec  1.34 GBytes   383 Mbits/sec                                                                                                                                                                   
[ ID] Interval       Transfer     Bandwidth                                                                                                                                                                        
[  3]  0.0-30.0 sec  1.67 GBytes   478 Mbits/sec        

If I disabled NSS offload(remove "enable-dca" in the configuration file), it's just about 111Mbps.

root@OpenWrt:/tmp/etc# iperf -c 10.8.0.6 -t 30
------------------------------------------------------------
Client connecting to 10.8.0.6, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.8.0.1 port 37982 connected with 10.8.0.6 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec   397 MBytes   111 Mbits/sec

@quarky Do you think if it's possible to fetch IPQ807x OpenVPN NSS offload patches to IPQ806x?

nope or at least we need to use something else as our nss core and firmware doesn't support openvpn acceleration
(for example ipq807x has wifi acceleration deeply integrated in the wifi firmware, our implementation is different and use a virtual interface)

I'm trying out something now, but like what @Ansuel pointed out it'll be tough.

From reading the QSDK NSS drivers, the qvpn interfaces are built into the NSS firmware for the ipq807x SoCs and it should be able to directly intercept OpenVPN packets for encryption/decryption and sends it on it's way into the network, if netfilter clears it prior.

To emulate this for the ipq806x SoCs, we would have to build a virtual interface driver to intercept OpenVPN packets for ciphering, and route it directly, without sending it back to the user space where the OpenVPN app is running. We probably can use most of the work done by the QCA folks, but the qvpn NSS interface need to be reworked using the ipq806x virtual interface mechanism.

Of course all these are theoretical at the moment. I'm at the stage where I'm trying to understand how to shoot OpenVPN's packets into the NSS crypto instead of to OpenSSL. Currently OpenVPN is doing a two step process if the connection is using AES-CBC with SHA-HMAC, so I'm trying to shoot this straight into the NSS crypto engine and doing the cphering and authentication in a single step. The hope is to improve the performance.

If it works, then we can probably think about emulating the qvpn interfaces in the ipq806x SoCs.

In the end, it may all be for nothing as the performance may not be good at all, but it'll be an interesting project to mess with. Even if we somehow manage to improve performance, I'm thinking it'll probably not be by a lot. The ipq807x SoC is quite a bit faster compared to the ipq806x. The ipq807x NSS core clock speed is already nearing the ipq806x Krait CPU clock.

If we're able to even keep the performance the same, but drastically reduce the Krait CPU usage, I would still consider this as a win. Heh heh.

@quarky Would you share the command of NSS crypto bench tool for refernce in ipq806x?

@ devs: I'm getting this log message every five minutes. It is related to an access point connected via WDS to my R7800 with the NSS version:

wlan0.sta1: NSS TX failed with error[5]: NSS_TX_FAILURE_TOO_SHORT

I don't see any connection problems related to it and just wanted to give you smart guys a heads-up.

@quarky, Ok, after checking the source code and testing, I have gotten the commands. The performance is good in ipq806x too. The problem is how to use the crypto engine from user space such
as OpenVPN.

root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 1024 > bam_len
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 128 > cipher_len
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 50 > loops
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 3 > print
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo bench > cmd
[  809.571868] auth algo SHA1_HMAC
[  809.571897] cipher algo AES
[  809.575482] preparing crypto bench
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo start > cmd
[  814.204889] #root@OpenWrt:/sys/kernel/debug/crypto_bench# bench: completed (reqs = 128, size = 1024, time = 815, mbps = 1286)
[  814.214990] #bench: completed (reqs = 128, size = 1024, time = 991, mbps = 1058)
[  814.222318] #bench: completed (reqs = 128, size = 1024, time = 878, mbps = 1194)
[  814.229676] #bench: completed (reqs = 128, size = 1024, time = 896, mbps = 1170)
[  814.237131] #bench: completed (reqs = 128, size = 1024, time = 1035, mbps = 1013)
[  814.244525] #bench: completed (reqs = 128, size = 1024, time = 913, mbps = 1148)
[  814.252106] #bench: completed (reqs = 128, size = 1024, time = 778, mbps = 1347)
[  814.259252] #bench: completed (reqs = 128, size = 1024, time = 989, mbps = 1060)
[  814.266758] #bench: completed (reqs = 128, size = 1024, time = 955, mbps = 1097)
[  814.274141] #bench: completed (reqs = 128, size = 1024, time = 894, mbps = 1172)
[  814.281387] #bench: completed (reqs = 128, size = 1024, time = 858, mbps = 1222)
[  814.288767] #bench: completed (reqs = 128, size = 1024, time = 810, mbps = 1294)
[  814.296269] #bench: completed (reqs = 128, size = 1024, time = 928, mbps = 1129)
[  814.303620] #bench: completed (reqs = 128, size = 1024, time = 896, mbps = 1170)
[  814.310902] #bench: completed (reqs = 128, size = 1024, time = 839, mbps = 1249)
[  814.318317] #bench: completed (reqs = 128, size = 1024, time = 972, mbps = 1078)
[  814.325770] #bench: completed (reqs = 128, size = 1024, time = 932, mbps = 1125)
[  814.333122] #bench: completed (reqs = 128, size = 1024, time = 880, mbps = 1191)
[  814.340445] #bench: completed (reqs = 128, size = 1024, time = 1016, mbps = 1032)
[  814.347927] #bench: completed (reqs = 128, size = 1024, time = 887, mbps = 1182)
[  814.355332] #bench: completed (reqs = 128, size = 1024, time = 996, mbps = 1052)
[  814.362754] #bench: completed (reqs = 128, size = 1024, time = 963, mbps = 1088)
[  814.370053] #bench: completed (reqs = 128, size = 1024, time = 733, mbps = 1430)
[  814.377484] #bench: completed (reqs = 128, size = 1024, time = 860, mbps = 1219)
[  814.384860] #bench: completed (reqs = 128, size = 1024, time = 828, mbps = 1266)
[  814.392193] #bench: completed (reqs = 128, size = 1024, time = 720, mbps = 1456)
[  814.399555] #bench: completed (reqs = 128, size = 1024, time = 906, mbps = 1157)
[  814.407004] #bench: completed (reqs = 128, size = 1024, time = 867, mbps = 1209)
[  814.414421] #bench: completed (reqs = 128, size = 1024, time = 978, mbps = 1072)
[  814.421686] #bench: completed (reqs = 128, size = 1024, time = 940, mbps = 1115)
[  814.429044] #bench: completed (reqs = 128, size = 1024, time = 906, mbps = 1157)
[  814.436556] #bench: completed (reqs = 128, size = 1024, time = 1013, mbps = 1035)
[  814.443933] #bench: completed (reqs = 128, size = 1024, time = 898, mbps = 1167)
[  814.451266] #bench: completed (reqs = 128, size = 1024, time = 841, mbps = 1246)
[  814.458667] #bench: completed (reqs = 128, size = 1024, time = 967, mbps = 1084)
[  814.466139] #bench: completed (reqs = 128, size = 1024, time = 733, mbps = 1430)
[  814.473484] #bench: completed (reqs = 128, size = 1024, time = 876, mbps = 1197)
[  814.480780] #bench: completed (reqs = 128, size = 1024, time = 845, mbps = 1240)
[  814.488336] #bench: completed (reqs = 128, size = 1024, time = 801, mbps = 1309)
  814.503004] #bench: completed (reqs = 128, size = 1024, time = 882, mbps = 1188)
[  814.510332] #bench: completed (reqs = 128, size = 1024, time = 820, mbps = 1278)
[  814.517775] #bench: completed (reqs = 128, size = 1024, time = 789, mbps = 1328)
[  814.525122] #bench: completed (reqs = 128, size = 1024, time = 823, mbps = 1274)
[  814.532554] #bench: completed (reqs = 128, size = 1024, time = 860, mbps = 1219)
[  814.539841] #bench: completed (reqs = 128, size = 1024, time = 990, mbps = 1059)
[  814.547271] #bench: completed (reqs = 128, size = 1024, time = 762, mbps = 1376)
[  814.554665] #bench: completed (reqs = 128, size = 1024, time = 896, mbps = 1170)
[  814.561978] #bench: completed (reqs = 128, size = 1024, time = 855, mbps = 1226)
[  814.569323] #bench: completed (reqs = 128, size = 1024, time = 812, mbps = 1291)
[  814.576797] crypto bench is done

1 Like

Performance are good but as soon as we use userspace... Performance are bad

@rog do you see a lot of these or only sporadic logs? The driver code will send packets back to the Linux kernel if it failed to go into the NSS layer, so it should not affect the flow. Likely some edge cases that need to be handled.

The changes I pushed into my Github repo already allows the use of the NSS crypto engine via the OpenSSL engine mechanism with OpenVPN. Performance is no good tho. At the moment it’s only using the AES-CBC cipher only, so HMAC is still software based.

Let’s see the results when I manage to get the AEAD cipher working.

They appear exactly every 5 minutes. I thought that it could be related to "inactivity polling" which default time limit seems to be set at 300 seconds, but I don't know if that applies to WDS connections. I tested disabling it but nothing changed.