IPQ806x NSS Drivers

Performance are good but as soon as we use userspace... Performance are bad

@rog do you see a lot of these or only sporadic logs? The driver code will send packets back to the Linux kernel if it failed to go into the NSS layer, so it should not affect the flow. Likely some edge cases that need to be handled.

The changes I pushed into my Github repo already allows the use of the NSS crypto engine via the OpenSSL engine mechanism with OpenVPN. Performance is no good tho. At the moment it’s only using the AES-CBC cipher only, so HMAC is still software based.

Let’s see the results when I manage to get the AEAD cipher working.

They appear exactly every 5 minutes. I thought that it could be related to "inactivity polling" which default time limit seems to be set at 300 seconds, but I don't know if that applies to WDS connections. I tested disabling it but nothing changed.

Setting the timeout at the router will not do anything. The error is generated when the router receives a network packet from a wireless client and tries to send it into the NSS firmware which failed. So it appears that every 5 minutes the client(s) connected to your router is probably sending an empty keep alive packet? which the NSS firmware disagrees with. It is probably a packet with no payload.

Well folks, I did a quick test using the NSS crypto AEAD cipher (i.e. aes-128-cbc-hmac-sha1) with OpenVPN. As I suspected, performance is no good. Below are the results tested with iperf3 with the following:

iPad <-- WiFi--> R7800 <-- OpenVPN tunnel/LAN --> iMac

Results:

Without OVPN       : 500Mbps
With OVPN-OpenSSL  : 50Mbps - CPU 60-70% loaded
With OVPN-NSS-AEAD : 20Mbps - CPU 30-40% loaded
With OVPN-NSS-CBC  : 15Mbps - CPU 30-40% loaded

It appears that transferring buffers between user space and kernel space is the limiting factor. When using NSS crypto, this penalty will be doubled, first time sending buffer to the NSS crypto engine, and the second time sending the encrypted/decrypted buffer to the network socket for routing.

The next step would probably to write a virtual interface driver to perform encryption/decryption in kernel before sending it over to the OpenVPN application in user space. This would probably maintain the thruput performance comparable to when using OpenSSL software crypto, but should bring down the CPU load.

The ideal solution is to bring everything into kernel space, with OpenVPN application managing the control plane.

2 Likes

@quarky Not sure if you have done such test of using no encryption. Such test will tell us if the performance is limited by the division between kernel-space and user-space processes or by the encryption and decryption routines

@quarky, sorry I didn't read your post very carefully. Below test you did should be the test of using no encryption("cipher none" and "auth none" in configuration file). Yeah, I got similar result in another router.
I'm surprised to see the performance is limited by the division between kernel-space and user-space processes . Is it possible that OpenVPN team should optimize this or if they have done it in latest version? I am using very old version --OpenVPN 2.3.6 . What's your version?

With OVPN-OpenSSL  : 50Mbps - CPU 60-70% loaded

Actually, all my tests are done with aes-128-cbc with hmac-sha1 authentication. If without encryption, thruput will definitely be higher but it’s of no practical use. The line you quoted in your post uses the OpenSSL cipher.

I re-read the drivers from OSDK. It looks like I can use their techniques for the ipq806x crypto engine, so there’s a couple of tricks that I can still try. My hope is that we can still achieve the same thruput as per the OpenSSL cipher but with drastically reduced Krait CPU load. Then we can make full use of the SoC instead of letting the crypto engine go idling.

So the qsdk driver actually used both CPU and nss for crypto?

Crpyto operations are solely done by the NSS firmware with the NSS cores.

What I can gather so far with the QSDK drivers for OpenVPN (which unfortunately is only for the Hawkeye SoCs) patched the OpenVPN internals extensively so that the QSDK drivers are able to:

  1. Take over the socket operations when sending and receiving encrypted packets to external OpenVPN servers/clients. OpenVPN will send packets in clear text to the QSDK drivers and the driver will then shoot it into the NSS crypto engine for ciphering. For receive end, it'll be the reverse.

  2. Hook into ECM to have the NSS firmware take over routing function for tun/tap packets after decryption.

I'll be trying out (1.) first and see if it's going to be useful for the Akronite SoCs. From what I can gather, it should work. This should reduce Krait CPU load. This technique will also be applicable to other SoCs with crypto engine (e.g. MT7621A) which I'll definitely be trying on my Linksys EA7500v-2 if it works.

Will see if (2.) can be achieved once (1.) is successful.

If without encryption, thruput will definitely be higher but it’s of no practical use.

I just tested without encryption on R7800. It's only about 110Mbps in tun mode(VPN IP range is 192.168.6.0/24). Seems it's not a good performance.

root@OpenWrt:/tmp/openvpn# iperf -c 192.168.6.2 -t 15 &                                                                                                                                                            
root@OpenWrt:/tmp/openvpn# ------------------------------------------------------------                                                                                                                            
Client connecting to 192.168.6.2, TCP port 5001                                                                                                                                                                    
TCP window size: 43.8 KByte (default)                                                                                                                                                                              
------------------------------------------------------------                                                                                                                                                       
[  3] local 192.168.6.1 port 48828 connected with 192.168.6.2 port 5001                                                                                                                                            
                                                                                                                                                                                                                   
root@OpenWrt:/tmp/openvpn# mpstat  -P ALL 2                                                                                                                                                                        
Linux 4.4.30 (OpenWrt)  08/17/20        _armv7l_        (2 CPU)                                                                                                                                                    
                                                                                                                                                                                                                   
03:28:51     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle                                                                                                                   
03:28:53     all    8.29    0.00   48.99    0.00    0.00   22.86    0.00    0.00    0.00   19.85                                                                                                                   
03:28:53       0    1.02    0.00   33.50    0.00    0.00   25.89    0.00    0.00    0.00   39.59                                                                                                                   
03:28:53       1   15.42    0.00   64.18    0.00    0.00   19.90    0.00    0.00    0.00    0.50                                                                                                                   

03:28:53     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:28:55     all    8.64    0.00   51.83    0.00    0.00   19.11    0.00    0.00    0.00   20.42
03:28:55       0    1.64    0.00   32.24    0.00    0.00   24.04    0.00    0.00    0.00   42.08
03:28:55       1   15.08    0.00   69.85    0.00    0.00   14.57    0.00    0.00    0.00    0.50

03:28:55     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:28:57     all    9.16    0.00   50.74    0.00    0.00   21.29    0.00    0.00    0.00   18.81
03:28:57       0    1.98    0.00   37.13    0.00    0.00   23.76    0.00    0.00    0.00   37.13
03:28:57       1   16.34    0.00   64.36    0.00    0.00   18.81    0.00    0.00    0.00    0.50

03:28:57     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:28:59     all   10.53    0.00   50.63    0.00    0.00   19.80    0.00    0.00    0.00   19.05
03:28:59       0    2.01    0.00   37.69    0.00    0.00   22.61    0.00    0.00    0.00   37.69
03:28:59       1   19.00    0.00   63.50    0.00    0.00   17.00    0.00    0.00    0.00    0.50

03:28:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:29:01     all    9.69    0.00   51.02    0.00    0.00   20.92    0.00    0.00    0.00   18.37
03:29:01       0    3.14    0.00   39.27    0.00    0.00   20.42    0.00    0.00    0.00   37.17
03:29:01       1   15.92    0.00   62.19    0.00    0.00   21.39    0.00    0.00    0.00    0.50

03:29:01     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:29:03     all    9.90    0.00   48.76    0.00    0.00   22.77    0.00    0.00    0.00   18.56
03:29:03       0    1.46    0.00   37.56    0.00    0.00   24.88    0.00    0.00    0.00   36.10
03:29:03       1   18.59    0.00   60.30    0.00    0.00   20.60    0.00    0.00    0.00    0.50
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-15.0 sec   196 MBytes   110 Mbits/sec

If running iperf directly without OpenVPN in R7800 to PC connected to WAN port of R7800, it's ~851.
So I think we need to figure out why there is only ~110Mbps without encryption first. How do you think?

root@OpenWrt:/tmp/openvpn# iperf -c 192.168.2.2 -t 15 
------------------------------------------------------------
Client connecting to 192.168.2.2, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.1 port 38282 connected with 192.168.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-15.0 sec  1.49 GBytes   851 Mbits/sec

My guess are there are two major reasons why it is 'slow':

  1. Interaction between kernel and user space
  2. Netfilter.

Imaging a connection in OpenVPN has been established between two nodes, with the ipq806x SoC router acting as a server in one of the nodes. Now when the router receives an encrypted packet from the client, the following happens:

  1. Linux kernel receive packet and forward to OpenVPN socket listener (KS)
  2. OpenVPN wakes up and copies packet from (KS) to (US) (** slow **)
  3. OpenVPN decrypts packet in (US) (** slow **)
  4. OpenVPN decapsulate tunnel (tun/tap) packet from decrypted packet and prepares it for forwarding (US)
  5. OpenVPN send tun/tap packet back to the Linux kernel for routing (US)
  6. Linux kernel copies tun/tap packet from (US) to (KS) (** slow **)
  7. Tun/tap packet then goes thru netfilter (slowpath) before it gets routed. (** slow **)

For local tun/tap packet going to the other end, just reverse the flow above.

The above is what I think kills the performance for a 'slow' SoCs like the ipq806x and even ipq807x without acceleration offload.

Try disabling netfilter (i.e. the firewall) and re-test. You should see higher thruput.

1 Like

I bricked my u-boot of R7800 in another test. But I encouted similar issue in IPQ807x when I turned off IPQ807x OpenVPN offload and encryption. Just found this low performance is caused by default "--tun-mtu 1500". I adjusted --tun-mtu 48000. The performance becomes ~940 from ~300. Would you have a try on R7800?

Sorry to hear about your R7800. Hope you can revive it.

Testing with '--tun-mtu 48000' is not a reasonable test, as it has not practical real world use. Jumbo frames MTU max out at 9000, and no ISP in the world (as least from what I know) supports MTU greater than 1500.

Setting the tunnel MTU to a greater value will just result in the Linux kernel fragmenting packets into smaller chunks before sending it out the wire. I'm surprised that this setting actually increases your thruput. Are you sure you are testing it correctly?

It is possible that using such a high value of MTU causes netfilter to do less work, thus improving thruput, but as I've pointed out, it is of no real world practical use. You probably can't connect reliably or at all to any external OpenVPN server in the Internet with such a big tunnel MTU value. Even if you can connect, you'll probably be experiencing PMTU issues.

see here https://community.openvpn.net/openvpn/wiki/Gigabit_Networks_Linux
This is OpenVPN offical guide.

  • increase the MTU size of the tun adapter ('--tun-mtu') to 6000 bytes. This resembles Jumbo frames on a regular Ethernet LAN. Note that the MTU size on the underlying network switches was not altered.

By increasing the MTU size of the tun adapter and by disabling OpenVPN's internal fragmentation routines the throughput can be increased quite dramatically. The reason behind this is that by feeding larger packets to the OpenSSL encryption and decryption routines the performance will go up.

1 Like

There was a guy using this option in real world.

https://forums.openvpn.net/viewtopic.php?t=21738

It really depends on your use case. If you control both ends of the OpenVPN tunnel, you can configure the MTUs, but that's not how most folks uses OpenVPN. And even then, when you connect to external LAN devices to the OpenVPN router, you will be limited to the jumbo frame limit of 9000 bytes. Most folks don't know how to configure jumbo frames and will leave MTU at the default 1500.

Now, if you set the tunnel MTU to a very high value, like 16000, and uses the NSS crypto engine, efficiency will go up. So if you transfer tons of data between two sites which you control, this would be a viable option. It's just that it doesn't provide much benefit to most real world use cases.

If you control both ends of the OpenVPN tunnel, you can configure the MTUs, but that's not how most folks uses OpenVPN.

I'm afraid not. We only need to provide conguration file of OpenVPN client to users. This configuration file contains the MTU setting. This is most cases.

And even then, when you connect to external LAN devices to the OpenVPN router, you will be limited to the jumbo frame limit of 9000 bytes. Most folks don't know how to configure jumbo frames and will leave MTU at the default 1500.

But I captured the packets of phycial Ethernet, their size is not above 1500. I guess it just affects the MTU of vitrual tunnel interface, not the physical Ethernet. See below:

enp0s31f6 Link encap:Ethernet  HWaddr 40:b0:34:ec:2f:91  
          inet addr:192.168.2.2  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::60a2:c73c:a31:e14f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:237819837 errors:0 dropped:22942 overruns:0 frame:0
          TX packets:52747925 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:348040825300 (348.0 GB)  TX bytes:51324449886 (51.3 GB)
          Interrupt:16 Memory:e1200000-e1220000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:2081881 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2081881 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:176931662 (176.9 MB)  TX bytes:176931662 (176.9 MB)

tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          inet addr:192.168.3.4  P-t-P:192.168.3.4  Mask:255.255.255.0
          inet6 addr: fe80::f215:af9e:b50:30a2/64 Scope:Link
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:48000  Metric:1
          RX packets:61543 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30967 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:2927940876 (2.9 GB)  TX bytes:1610264 (1.6 MB)

Would you have a try? I think it's a very good tip when we use the NSS crypto engine because the bottleneck is not crypto when NSS crypto engine is fast enough.

@quarky I misunderstood your point. Please ignore this reply. Yes, if I connect a LAN device and let router running VPN server forward packets to VPN client which connects router from WAN, I see the low performance again. I will try configure the jumbo frame in the LAN device(maybe also need to enable jumbo frame in the switch of router) to see if I can get higher performance.