IPQ806x NSS Drivers

@robimarko Yes, I tested on the QSDK , OpenVPN is offloaded to NSS.

@quarky I just used two PCs to test again because one PC is a bottleneck for OpenVPN is single thread. I got about 383+478=861. It's excellent.

root@OpenWrt:/tmp/etc# cat run-two-iperf.sh 
iperf -c 10.8.0.10 -t 30 & 
iperf -c 10.8.0.14 -t 30

root@OpenWrt:/tmp/etc# ./run-two-iperf.sh                                                                                                                                                                          
------------------------------------------------------------                                                                                                                                                       
Client connecting to 10.8.0.14, TCP port 5001                                                                                                                                                                      
TCP window size: 45.0 KByte (default)                                                                                                                                                                              
------------------------------------------------------------                                                                                                                                                       
[  3] local 10.8.0.1 port 60504 connected with 10.8.0.14 port 5001                                                                                                                                                 
------------------------------------------------------------                                                                                                                                                       
Client connecting to 10.8.0.10, TCP port 5001                                                                                                                                                                      
TCP window size: 45.0 KByte (default)                                                                                                                                                                              
------------------------------------------------------------                                                                                                                                                       
[  3] local 10.8.0.1 port 34274 connected with 10.8.0.10 port 5001                                                                                                                                                 
[ ID] Interval       Transfer     Bandwidth                                                                                                                                                                        
[  3]  0.0-30.0 sec  1.34 GBytes   383 Mbits/sec                                                                                                                                                                   
[ ID] Interval       Transfer     Bandwidth                                                                                                                                                                        
[  3]  0.0-30.0 sec  1.67 GBytes   478 Mbits/sec        

If I disabled NSS offload(remove "enable-dca" in the configuration file), it's just about 111Mbps.

root@OpenWrt:/tmp/etc# iperf -c 10.8.0.6 -t 30
------------------------------------------------------------
Client connecting to 10.8.0.6, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.8.0.1 port 37982 connected with 10.8.0.6 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec   397 MBytes   111 Mbits/sec

@quarky Do you think if it's possible to fetch IPQ807x OpenVPN NSS offload patches to IPQ806x?

nope or at least we need to use something else as our nss core and firmware doesn't support openvpn acceleration
(for example ipq807x has wifi acceleration deeply integrated in the wifi firmware, our implementation is different and use a virtual interface)

I'm trying out something now, but like what @Ansuel pointed out it'll be tough.

From reading the QSDK NSS drivers, the qvpn interfaces are built into the NSS firmware for the ipq807x SoCs and it should be able to directly intercept OpenVPN packets for encryption/decryption and sends it on it's way into the network, if netfilter clears it prior.

To emulate this for the ipq806x SoCs, we would have to build a virtual interface driver to intercept OpenVPN packets for ciphering, and route it directly, without sending it back to the user space where the OpenVPN app is running. We probably can use most of the work done by the QCA folks, but the qvpn NSS interface need to be reworked using the ipq806x virtual interface mechanism.

Of course all these are theoretical at the moment. I'm at the stage where I'm trying to understand how to shoot OpenVPN's packets into the NSS crypto instead of to OpenSSL. Currently OpenVPN is doing a two step process if the connection is using AES-CBC with SHA-HMAC, so I'm trying to shoot this straight into the NSS crypto engine and doing the cphering and authentication in a single step. The hope is to improve the performance.

If it works, then we can probably think about emulating the qvpn interfaces in the ipq806x SoCs.

In the end, it may all be for nothing as the performance may not be good at all, but it'll be an interesting project to mess with. Even if we somehow manage to improve performance, I'm thinking it'll probably not be by a lot. The ipq807x SoC is quite a bit faster compared to the ipq806x. The ipq807x NSS core clock speed is already nearing the ipq806x Krait CPU clock.

If we're able to even keep the performance the same, but drastically reduce the Krait CPU usage, I would still consider this as a win. Heh heh.

@quarky Would you share the command of NSS crypto bench tool for refernce in ipq806x?

@ devs: I'm getting this log message every five minutes. It is related to an access point connected via WDS to my R7800 with the NSS version:

wlan0.sta1: NSS TX failed with error[5]: NSS_TX_FAILURE_TOO_SHORT

I don't see any connection problems related to it and just wanted to give you smart guys a heads-up.

@quarky, Ok, after checking the source code and testing, I have gotten the commands. The performance is good in ipq806x too. The problem is how to use the crypto engine from user space such
as OpenVPN.

root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 1024 > bam_len
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 128 > cipher_len
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 50 > loops
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo 3 > print
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo bench > cmd
[  809.571868] auth algo SHA1_HMAC
[  809.571897] cipher algo AES
[  809.575482] preparing crypto bench
root@OpenWrt:/sys/kernel/debug/crypto_bench# echo start > cmd
[  814.204889] #root@OpenWrt:/sys/kernel/debug/crypto_bench# bench: completed (reqs = 128, size = 1024, time = 815, mbps = 1286)
[  814.214990] #bench: completed (reqs = 128, size = 1024, time = 991, mbps = 1058)
[  814.222318] #bench: completed (reqs = 128, size = 1024, time = 878, mbps = 1194)
[  814.229676] #bench: completed (reqs = 128, size = 1024, time = 896, mbps = 1170)
[  814.237131] #bench: completed (reqs = 128, size = 1024, time = 1035, mbps = 1013)
[  814.244525] #bench: completed (reqs = 128, size = 1024, time = 913, mbps = 1148)
[  814.252106] #bench: completed (reqs = 128, size = 1024, time = 778, mbps = 1347)
[  814.259252] #bench: completed (reqs = 128, size = 1024, time = 989, mbps = 1060)
[  814.266758] #bench: completed (reqs = 128, size = 1024, time = 955, mbps = 1097)
[  814.274141] #bench: completed (reqs = 128, size = 1024, time = 894, mbps = 1172)
[  814.281387] #bench: completed (reqs = 128, size = 1024, time = 858, mbps = 1222)
[  814.288767] #bench: completed (reqs = 128, size = 1024, time = 810, mbps = 1294)
[  814.296269] #bench: completed (reqs = 128, size = 1024, time = 928, mbps = 1129)
[  814.303620] #bench: completed (reqs = 128, size = 1024, time = 896, mbps = 1170)
[  814.310902] #bench: completed (reqs = 128, size = 1024, time = 839, mbps = 1249)
[  814.318317] #bench: completed (reqs = 128, size = 1024, time = 972, mbps = 1078)
[  814.325770] #bench: completed (reqs = 128, size = 1024, time = 932, mbps = 1125)
[  814.333122] #bench: completed (reqs = 128, size = 1024, time = 880, mbps = 1191)
[  814.340445] #bench: completed (reqs = 128, size = 1024, time = 1016, mbps = 1032)
[  814.347927] #bench: completed (reqs = 128, size = 1024, time = 887, mbps = 1182)
[  814.355332] #bench: completed (reqs = 128, size = 1024, time = 996, mbps = 1052)
[  814.362754] #bench: completed (reqs = 128, size = 1024, time = 963, mbps = 1088)
[  814.370053] #bench: completed (reqs = 128, size = 1024, time = 733, mbps = 1430)
[  814.377484] #bench: completed (reqs = 128, size = 1024, time = 860, mbps = 1219)
[  814.384860] #bench: completed (reqs = 128, size = 1024, time = 828, mbps = 1266)
[  814.392193] #bench: completed (reqs = 128, size = 1024, time = 720, mbps = 1456)
[  814.399555] #bench: completed (reqs = 128, size = 1024, time = 906, mbps = 1157)
[  814.407004] #bench: completed (reqs = 128, size = 1024, time = 867, mbps = 1209)
[  814.414421] #bench: completed (reqs = 128, size = 1024, time = 978, mbps = 1072)
[  814.421686] #bench: completed (reqs = 128, size = 1024, time = 940, mbps = 1115)
[  814.429044] #bench: completed (reqs = 128, size = 1024, time = 906, mbps = 1157)
[  814.436556] #bench: completed (reqs = 128, size = 1024, time = 1013, mbps = 1035)
[  814.443933] #bench: completed (reqs = 128, size = 1024, time = 898, mbps = 1167)
[  814.451266] #bench: completed (reqs = 128, size = 1024, time = 841, mbps = 1246)
[  814.458667] #bench: completed (reqs = 128, size = 1024, time = 967, mbps = 1084)
[  814.466139] #bench: completed (reqs = 128, size = 1024, time = 733, mbps = 1430)
[  814.473484] #bench: completed (reqs = 128, size = 1024, time = 876, mbps = 1197)
[  814.480780] #bench: completed (reqs = 128, size = 1024, time = 845, mbps = 1240)
[  814.488336] #bench: completed (reqs = 128, size = 1024, time = 801, mbps = 1309)
  814.503004] #bench: completed (reqs = 128, size = 1024, time = 882, mbps = 1188)
[  814.510332] #bench: completed (reqs = 128, size = 1024, time = 820, mbps = 1278)
[  814.517775] #bench: completed (reqs = 128, size = 1024, time = 789, mbps = 1328)
[  814.525122] #bench: completed (reqs = 128, size = 1024, time = 823, mbps = 1274)
[  814.532554] #bench: completed (reqs = 128, size = 1024, time = 860, mbps = 1219)
[  814.539841] #bench: completed (reqs = 128, size = 1024, time = 990, mbps = 1059)
[  814.547271] #bench: completed (reqs = 128, size = 1024, time = 762, mbps = 1376)
[  814.554665] #bench: completed (reqs = 128, size = 1024, time = 896, mbps = 1170)
[  814.561978] #bench: completed (reqs = 128, size = 1024, time = 855, mbps = 1226)
[  814.569323] #bench: completed (reqs = 128, size = 1024, time = 812, mbps = 1291)
[  814.576797] crypto bench is done

1 Like

Performance are good but as soon as we use userspace... Performance are bad

@rog do you see a lot of these or only sporadic logs? The driver code will send packets back to the Linux kernel if it failed to go into the NSS layer, so it should not affect the flow. Likely some edge cases that need to be handled.

The changes I pushed into my Github repo already allows the use of the NSS crypto engine via the OpenSSL engine mechanism with OpenVPN. Performance is no good tho. At the moment it’s only using the AES-CBC cipher only, so HMAC is still software based.

Let’s see the results when I manage to get the AEAD cipher working.

They appear exactly every 5 minutes. I thought that it could be related to "inactivity polling" which default time limit seems to be set at 300 seconds, but I don't know if that applies to WDS connections. I tested disabling it but nothing changed.

Setting the timeout at the router will not do anything. The error is generated when the router receives a network packet from a wireless client and tries to send it into the NSS firmware which failed. So it appears that every 5 minutes the client(s) connected to your router is probably sending an empty keep alive packet? which the NSS firmware disagrees with. It is probably a packet with no payload.

Well folks, I did a quick test using the NSS crypto AEAD cipher (i.e. aes-128-cbc-hmac-sha1) with OpenVPN. As I suspected, performance is no good. Below are the results tested with iperf3 with the following:

iPad <-- WiFi--> R7800 <-- OpenVPN tunnel/LAN --> iMac

Results:

Without OVPN       : 500Mbps
With OVPN-OpenSSL  : 50Mbps - CPU 60-70% loaded
With OVPN-NSS-AEAD : 20Mbps - CPU 30-40% loaded
With OVPN-NSS-CBC  : 15Mbps - CPU 30-40% loaded

It appears that transferring buffers between user space and kernel space is the limiting factor. When using NSS crypto, this penalty will be doubled, first time sending buffer to the NSS crypto engine, and the second time sending the encrypted/decrypted buffer to the network socket for routing.

The next step would probably to write a virtual interface driver to perform encryption/decryption in kernel before sending it over to the OpenVPN application in user space. This would probably maintain the thruput performance comparable to when using OpenSSL software crypto, but should bring down the CPU load.

The ideal solution is to bring everything into kernel space, with OpenVPN application managing the control plane.

2 Likes

@quarky Not sure if you have done such test of using no encryption. Such test will tell us if the performance is limited by the division between kernel-space and user-space processes or by the encryption and decryption routines

@quarky, sorry I didn't read your post very carefully. Below test you did should be the test of using no encryption("cipher none" and "auth none" in configuration file). Yeah, I got similar result in another router.
I'm surprised to see the performance is limited by the division between kernel-space and user-space processes . Is it possible that OpenVPN team should optimize this or if they have done it in latest version? I am using very old version --OpenVPN 2.3.6 . What's your version?

With OVPN-OpenSSL  : 50Mbps - CPU 60-70% loaded

Actually, all my tests are done with aes-128-cbc with hmac-sha1 authentication. If without encryption, thruput will definitely be higher but it’s of no practical use. The line you quoted in your post uses the OpenSSL cipher.

I re-read the drivers from OSDK. It looks like I can use their techniques for the ipq806x crypto engine, so there’s a couple of tricks that I can still try. My hope is that we can still achieve the same thruput as per the OpenSSL cipher but with drastically reduced Krait CPU load. Then we can make full use of the SoC instead of letting the crypto engine go idling.

So the qsdk driver actually used both CPU and nss for crypto?

Crpyto operations are solely done by the NSS firmware with the NSS cores.

What I can gather so far with the QSDK drivers for OpenVPN (which unfortunately is only for the Hawkeye SoCs) patched the OpenVPN internals extensively so that the QSDK drivers are able to:

  1. Take over the socket operations when sending and receiving encrypted packets to external OpenVPN servers/clients. OpenVPN will send packets in clear text to the QSDK drivers and the driver will then shoot it into the NSS crypto engine for ciphering. For receive end, it'll be the reverse.

  2. Hook into ECM to have the NSS firmware take over routing function for tun/tap packets after decryption.

I'll be trying out (1.) first and see if it's going to be useful for the Akronite SoCs. From what I can gather, it should work. This should reduce Krait CPU load. This technique will also be applicable to other SoCs with crypto engine (e.g. MT7621A) which I'll definitely be trying on my Linksys EA7500v-2 if it works.

Will see if (2.) can be achieved once (1.) is successful.

If without encryption, thruput will definitely be higher but it’s of no practical use.

I just tested without encryption on R7800. It's only about 110Mbps in tun mode(VPN IP range is 192.168.6.0/24). Seems it's not a good performance.

root@OpenWrt:/tmp/openvpn# iperf -c 192.168.6.2 -t 15 &                                                                                                                                                            
root@OpenWrt:/tmp/openvpn# ------------------------------------------------------------                                                                                                                            
Client connecting to 192.168.6.2, TCP port 5001                                                                                                                                                                    
TCP window size: 43.8 KByte (default)                                                                                                                                                                              
------------------------------------------------------------                                                                                                                                                       
[  3] local 192.168.6.1 port 48828 connected with 192.168.6.2 port 5001                                                                                                                                            
                                                                                                                                                                                                                   
root@OpenWrt:/tmp/openvpn# mpstat  -P ALL 2                                                                                                                                                                        
Linux 4.4.30 (OpenWrt)  08/17/20        _armv7l_        (2 CPU)                                                                                                                                                    
                                                                                                                                                                                                                   
03:28:51     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle                                                                                                                   
03:28:53     all    8.29    0.00   48.99    0.00    0.00   22.86    0.00    0.00    0.00   19.85                                                                                                                   
03:28:53       0    1.02    0.00   33.50    0.00    0.00   25.89    0.00    0.00    0.00   39.59                                                                                                                   
03:28:53       1   15.42    0.00   64.18    0.00    0.00   19.90    0.00    0.00    0.00    0.50                                                                                                                   

03:28:53     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:28:55     all    8.64    0.00   51.83    0.00    0.00   19.11    0.00    0.00    0.00   20.42
03:28:55       0    1.64    0.00   32.24    0.00    0.00   24.04    0.00    0.00    0.00   42.08
03:28:55       1   15.08    0.00   69.85    0.00    0.00   14.57    0.00    0.00    0.00    0.50

03:28:55     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:28:57     all    9.16    0.00   50.74    0.00    0.00   21.29    0.00    0.00    0.00   18.81
03:28:57       0    1.98    0.00   37.13    0.00    0.00   23.76    0.00    0.00    0.00   37.13
03:28:57       1   16.34    0.00   64.36    0.00    0.00   18.81    0.00    0.00    0.00    0.50

03:28:57     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:28:59     all   10.53    0.00   50.63    0.00    0.00   19.80    0.00    0.00    0.00   19.05
03:28:59       0    2.01    0.00   37.69    0.00    0.00   22.61    0.00    0.00    0.00   37.69
03:28:59       1   19.00    0.00   63.50    0.00    0.00   17.00    0.00    0.00    0.00    0.50

03:28:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:29:01     all    9.69    0.00   51.02    0.00    0.00   20.92    0.00    0.00    0.00   18.37
03:29:01       0    3.14    0.00   39.27    0.00    0.00   20.42    0.00    0.00    0.00   37.17
03:29:01       1   15.92    0.00   62.19    0.00    0.00   21.39    0.00    0.00    0.00    0.50

03:29:01     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:29:03     all    9.90    0.00   48.76    0.00    0.00   22.77    0.00    0.00    0.00   18.56
03:29:03       0    1.46    0.00   37.56    0.00    0.00   24.88    0.00    0.00    0.00   36.10
03:29:03       1   18.59    0.00   60.30    0.00    0.00   20.60    0.00    0.00    0.00    0.50
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-15.0 sec   196 MBytes   110 Mbits/sec

If running iperf directly without OpenVPN in R7800 to PC connected to WAN port of R7800, it's ~851.
So I think we need to figure out why there is only ~110Mbps without encryption first. How do you think?

root@OpenWrt:/tmp/openvpn# iperf -c 192.168.2.2 -t 15 
------------------------------------------------------------
Client connecting to 192.168.2.2, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.1 port 38282 connected with 192.168.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-15.0 sec  1.49 GBytes   851 Mbits/sec

My guess are there are two major reasons why it is 'slow':

  1. Interaction between kernel and user space
  2. Netfilter.

Imaging a connection in OpenVPN has been established between two nodes, with the ipq806x SoC router acting as a server in one of the nodes. Now when the router receives an encrypted packet from the client, the following happens:

  1. Linux kernel receive packet and forward to OpenVPN socket listener (KS)
  2. OpenVPN wakes up and copies packet from (KS) to (US) (** slow **)
  3. OpenVPN decrypts packet in (US) (** slow **)
  4. OpenVPN decapsulate tunnel (tun/tap) packet from decrypted packet and prepares it for forwarding (US)
  5. OpenVPN send tun/tap packet back to the Linux kernel for routing (US)
  6. Linux kernel copies tun/tap packet from (US) to (KS) (** slow **)
  7. Tun/tap packet then goes thru netfilter (slowpath) before it gets routed. (** slow **)

For local tun/tap packet going to the other end, just reverse the flow above.

The above is what I think kills the performance for a 'slow' SoCs like the ipq806x and even ipq807x without acceleration offload.

Try disabling netfilter (i.e. the firewall) and re-test. You should see higher thruput.

1 Like

I bricked my u-boot of R7800 in another test. But I encouted similar issue in IPQ807x when I turned off IPQ807x OpenVPN offload and encryption. Just found this low performance is caused by default "--tun-mtu 1500". I adjusted --tun-mtu 48000. The performance becomes ~940 from ~300. Would you have a try on R7800?