IPQ806x NSS Drivers

I plan to relook into the mac80211 stack after I get bored with the crypto stuff, heh heh.

I suspect ECM is not picking up the network flow correctly and still flowing it thru the slow path. Maybe try enabling Linux flow-offload as a temp workaround.

Didn't enabling Linux flow offload bypass ECM?

Have not tried, but from what I understand of the NSS architecture, they can co-exists. NSS firmware will always take precedence once activated by ECM. Those not handled by NSS will flow into the Linux network stack.

NSS takes over the GMAC data plane, so all GMAC packets must flow into NSS first.

Any news there?

Not too sure what you meant there, but I'm guessing testing with software-flow offload and NSS both activated? Have not tried, since my R7800 seems to be working OK with partial Wi-Fi traffic off-loaded to NSS.

When I tried last, sirq went down when I enabled mac80211 offload, so it appears to be working for me. But I only have a 2x2 wireless clients to test with, so at most I can achieve is 600 Mbps with Wi-Fi, with about 60% sirq for 1 CPU core. At this point, I don't know whether the load is caused by mac80211 or netfilter, or something else. Need some effort to look into it.

@robimarko Yes, I tested on the QSDK , OpenVPN is offloaded to NSS.

@quarky I just used two PCs to test again because one PC is a bottleneck for OpenVPN is single thread. I got about 383+478=861. It's excellent.

root@OpenWrt:/tmp/etc# cat run-two-iperf.sh 
iperf -c 10.8.0.10 -t 30 & 
iperf -c 10.8.0.14 -t 30

root@OpenWrt:/tmp/etc# ./run-two-iperf.sh                                                                                                                                                                          
------------------------------------------------------------                                                                                                                                                       
Client connecting to 10.8.0.14, TCP port 5001                                                                                                                                                                      
TCP window size: 45.0 KByte (default)                                                                                                                                                                              
------------------------------------------------------------                                                                                                                                                       
[  3] local 10.8.0.1 port 60504 connected with 10.8.0.14 port 5001                                                                                                                                                 
------------------------------------------------------------                                                                                                                                                       
Client connecting to 10.8.0.10, TCP port 5001                                                                                                                                                                      
TCP window size: 45.0 KByte (default)                                                                                                                                                                              
------------------------------------------------------------                                                                                                                                                       
[  3] local 10.8.0.1 port 34274 connected with 10.8.0.10 port 5001                                                                                                                                                 
[ ID] Interval       Transfer     Bandwidth                                                                                                                                                                        
[  3]  0.0-30.0 sec  1.34 GBytes   383 Mbits/sec                                                                                                                                                                   
[ ID] Interval       Transfer     Bandwidth                                                                                                                                                                        
[  3]  0.0-30.0 sec  1.67 GBytes   478 Mbits/sec        

If I disabled NSS offload(remove "enable-dca" in the configuration file), it's just about 111Mbps.

root@OpenWrt:/tmp/etc# iperf -c 10.8.0.6 -t 30
------------------------------------------------------------
Client connecting to 10.8.0.6, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.8.0.1 port 37982 connected with 10.8.0.6 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec   397 MBytes   111 Mbits/sec

@quarky Do you think if it's possible to fetch IPQ807x OpenVPN NSS offload patches to IPQ806x?

nope or at least we need to use something else as our nss core and firmware doesn't support openvpn acceleration
(for example ipq807x has wifi acceleration deeply integrated in the wifi firmware, our implementation is different and use a virtual interface)

I'm trying out something now, but like what @Ansuel pointed out it'll be tough.

From reading the QSDK NSS drivers, the qvpn interfaces are built into the NSS firmware for the ipq807x SoCs and it should be able to directly intercept OpenVPN packets for encryption/decryption and sends it on it's way into the network, if netfilter clears it prior.

To emulate this for the ipq806x SoCs, we would have to build a virtual interface driver to intercept OpenVPN packets for ciphering, and route it directly, without sending it back to the user space where the OpenVPN app is running. We probably can use most of the work done by the QCA folks, but the qvpn NSS interface need to be reworked using the ipq806x virtual interface mechanism.

Of course all these are theoretical at the moment. I'm at the stage where I'm trying to understand how to shoot OpenVPN's packets into the NSS crypto instead of to OpenSSL. Currently OpenVPN is doing a two step process if the connection is using AES-CBC with SHA-HMAC, so I'm trying to shoot this straight into the NSS crypto engine and doing the cphering and authentication in a single step. The hope is to improve the performance.

If it works, then we can probably think about emulating the qvpn interfaces in the ipq806x SoCs.

In the end, it may all be for nothing as the performance may not be good at all, but it'll be an interesting project to mess with. Even if we somehow manage to improve performance, I'm thinking it'll probably not be by a lot. The ipq807x SoC is quite a bit faster compared to the ipq806x. The ipq807x NSS core clock speed is already nearing the ipq806x Krait CPU clock.

If we're able to even keep the performance the same, but drastically reduce the Krait CPU usage, I would still consider this as a win. Heh heh.

@quarky Would you share the command of NSS crypto bench tool for refernce in ipq806x?

@ devs: I'm getting this log message every five minutes. It is related to an access point connected via WDS to my R7800 with the NSS version:

wlan0.sta1: NSS TX failed with error[5]: NSS_TX_FAILURE_TOO_SHORT

I don't see any connection problems related to it and just wanted to give you smart guys a heads-up.

@quarky, Ok, after checking the source code and testing, I have gotten the commands. The performance is good in ipq806x too. The problem is how to use the crypto engine from user space such
as OpenVPN.

root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 1024 > bam_len
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 128 > cipher_len
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 50 > loops
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 3 > print
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo bench > cmd
[  809.571868] auth algo SHA1_HMAC
[  809.571897] cipher algo AES
[  809.575482] preparing crypto bench
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo start > cmd
[  814.204889] #root@OpenWrt:/sys/kernel/debug/crypto_bench# bench: completed (reqs = 128, size = 1024, time = 815, mbps = 1286)
[  814.214990] #bench: completed (reqs = 128, size = 1024, time = 991, mbps = 1058)
[  814.222318] #bench: completed (reqs = 128, size = 1024, time = 878, mbps = 1194)
[  814.229676] #bench: completed (reqs = 128, size = 1024, time = 896, mbps = 1170)
[  814.237131] #bench: completed (reqs = 128, size = 1024, time = 1035, mbps = 1013)
[  814.244525] #bench: completed (reqs = 128, size = 1024, time = 913, mbps = 1148)
[  814.252106] #bench: completed (reqs = 128, size = 1024, time = 778, mbps = 1347)
[  814.259252] #bench: completed (reqs = 128, size = 1024, time = 989, mbps = 1060)
[  814.266758] #bench: completed (reqs = 128, size = 1024, time = 955, mbps = 1097)
[  814.274141] #bench: completed (reqs = 128, size = 1024, time = 894, mbps = 1172)
[  814.281387] #bench: completed (reqs = 128, size = 1024, time = 858, mbps = 1222)
[  814.288767] #bench: completed (reqs = 128, size = 1024, time = 810, mbps = 1294)
[  814.296269] #bench: completed (reqs = 128, size = 1024, time = 928, mbps = 1129)
[  814.303620] #bench: completed (reqs = 128, size = 1024, time = 896, mbps = 1170)
[  814.310902] #bench: completed (reqs = 128, size = 1024, time = 839, mbps = 1249)
[  814.318317] #bench: completed (reqs = 128, size = 1024, time = 972, mbps = 1078)
[  814.325770] #bench: completed (reqs = 128, size = 1024, time = 932, mbps = 1125)
[  814.333122] #bench: completed (reqs = 128, size = 1024, time = 880, mbps = 1191)
[  814.340445] #bench: completed (reqs = 128, size = 1024, time = 1016, mbps = 1032)
[  814.347927] #bench: completed (reqs = 128, size = 1024, time = 887, mbps = 1182)
[  814.355332] #bench: completed (reqs = 128, size = 1024, time = 996, mbps = 1052)
[  814.362754] #bench: completed (reqs = 128, size = 1024, time = 963, mbps = 1088)
[  814.370053] #bench: completed (reqs = 128, size = 1024, time = 733, mbps = 1430)
[  814.377484] #bench: completed (reqs = 128, size = 1024, time = 860, mbps = 1219)
[  814.384860] #bench: completed (reqs = 128, size = 1024, time = 828, mbps = 1266)
[  814.392193] #bench: completed (reqs = 128, size = 1024, time = 720, mbps = 1456)
[  814.399555] #bench: completed (reqs = 128, size = 1024, time = 906, mbps = 1157)
[  814.407004] #bench: completed (reqs = 128, size = 1024, time = 867, mbps = 1209)
[  814.414421] #bench: completed (reqs = 128, size = 1024, time = 978, mbps = 1072)
[  814.421686] #bench: completed (reqs = 128, size = 1024, time = 940, mbps = 1115)
[  814.429044] #bench: completed (reqs = 128, size = 1024, time = 906, mbps = 1157)
[  814.436556] #bench: completed (reqs = 128, size = 1024, time = 1013, mbps = 1035)
[  814.443933] #bench: completed (reqs = 128, size = 1024, time = 898, mbps = 1167)
[  814.451266] #bench: completed (reqs = 128, size = 1024, time = 841, mbps = 1246)
[  814.458667] #bench: completed (reqs = 128, size = 1024, time = 967, mbps = 1084)
[  814.466139] #bench: completed (reqs = 128, size = 1024, time = 733, mbps = 1430)
[  814.473484] #bench: completed (reqs = 128, size = 1024, time = 876, mbps = 1197)
[  814.480780] #bench: completed (reqs = 128, size = 1024, time = 845, mbps = 1240)
[  814.488336] #bench: completed (reqs = 128, size = 1024, time = 801, mbps = 1309)
  814.503004] #bench: completed (reqs = 128, size = 1024, time = 882, mbps = 1188)
[  814.510332] #bench: completed (reqs = 128, size = 1024, time = 820, mbps = 1278)
[  814.517775] #bench: completed (reqs = 128, size = 1024, time = 789, mbps = 1328)
[  814.525122] #bench: completed (reqs = 128, size = 1024, time = 823, mbps = 1274)
[  814.532554] #bench: completed (reqs = 128, size = 1024, time = 860, mbps = 1219)
[  814.539841] #bench: completed (reqs = 128, size = 1024, time = 990, mbps = 1059)
[  814.547271] #bench: completed (reqs = 128, size = 1024, time = 762, mbps = 1376)
[  814.554665] #bench: completed (reqs = 128, size = 1024, time = 896, mbps = 1170)
[  814.561978] #bench: completed (reqs = 128, size = 1024, time = 855, mbps = 1226)
[  814.569323] #bench: completed (reqs = 128, size = 1024, time = 812, mbps = 1291)
[  814.576797] crypto bench is done

1 Like

Performance are good but as soon as we use userspace... Performance are bad

@rog do you see a lot of these or only sporadic logs? The driver code will send packets back to the Linux kernel if it failed to go into the NSS layer, so it should not affect the flow. Likely some edge cases that need to be handled.

The changes I pushed into my Github repo already allows the use of the NSS crypto engine via the OpenSSL engine mechanism with OpenVPN. Performance is no good tho. At the moment it’s only using the AES-CBC cipher only, so HMAC is still software based.

Let’s see the results when I manage to get the AEAD cipher working.

They appear exactly every 5 minutes. I thought that it could be related to "inactivity polling" which default time limit seems to be set at 300 seconds, but I don't know if that applies to WDS connections. I tested disabling it but nothing changed.

Setting the timeout at the router will not do anything. The error is generated when the router receives a network packet from a wireless client and tries to send it into the NSS firmware which failed. So it appears that every 5 minutes the client(s) connected to your router is probably sending an empty keep alive packet? which the NSS firmware disagrees with. It is probably a packet with no payload.

Well folks, I did a quick test using the NSS crypto AEAD cipher (i.e. aes-128-cbc-hmac-sha1) with OpenVPN. As I suspected, performance is no good. Below are the results tested with iperf3 with the following:

iPad <-- WiFi--> R7800 <-- OpenVPN tunnel/LAN --> iMac

Results:

Without OVPN       : 500Mbps
With OVPN-OpenSSL  : 50Mbps - CPU 60-70% loaded
With OVPN-NSS-AEAD : 20Mbps - CPU 30-40% loaded
With OVPN-NSS-CBC  : 15Mbps - CPU 30-40% loaded

It appears that transferring buffers between user space and kernel space is the limiting factor. When using NSS crypto, this penalty will be doubled, first time sending buffer to the NSS crypto engine, and the second time sending the encrypted/decrypted buffer to the network socket for routing.

The next step would probably to write a virtual interface driver to perform encryption/decryption in kernel before sending it over to the OpenVPN application in user space. This would probably maintain the thruput performance comparable to when using OpenSSL software crypto, but should bring down the CPU load.

The ideal solution is to bring everything into kernel space, with OpenVPN application managing the control plane.

2 Likes

@quarky Not sure if you have done such test of using no encryption. Such test will tell us if the performance is limited by the division between kernel-space and user-space processes or by the encryption and decryption routines

@quarky, sorry I didn't read your post very carefully. Below test you did should be the test of using no encryption("cipher none" and "auth none" in configuration file). Yeah, I got similar result in another router.
I'm surprised to see the performance is limited by the division between kernel-space and user-space processes . Is it possible that OpenVPN team should optimize this or if they have done it in latest version? I am using very old version --OpenVPN 2.3.6 . What's your version?

With OVPN-OpenSSL  : 50Mbps - CPU 60-70% loaded

Actually, all my tests are done with aes-128-cbc with hmac-sha1 authentication. If without encryption, thruput will definitely be higher but it’s of no practical use. The line you quoted in your post uses the OpenSSL cipher.

I re-read the drivers from OSDK. It looks like I can use their techniques for the ipq806x crypto engine, so there’s a couple of tricks that I can still try. My hope is that we can still achieve the same thruput as per the OpenSSL cipher but with drastically reduced Krait CPU load. Then we can make full use of the SoC instead of letting the crypto engine go idling.