Setting the timeout at the router will not do anything. The error is generated when the router receives a network packet from a wireless client and tries to send it into the NSS firmware which failed. So it appears that every 5 minutes the client(s) connected to your router is probably sending an empty keep alive packet? which the NSS firmware disagrees with. It is probably a packet with no payload.
Well folks, I did a quick test using the NSS crypto AEAD cipher (i.e. aes-128-cbc-hmac-sha1) with OpenVPN. As I suspected, performance is no good. Below are the results tested with iperf3 with the following:
iPad <-- WiFi--> R7800 <-- OpenVPN tunnel/LAN --> iMac
Results:
Without OVPN : 500Mbps
With OVPN-OpenSSL : 50Mbps - CPU 60-70% loaded
With OVPN-NSS-AEAD : 20Mbps - CPU 30-40% loaded
With OVPN-NSS-CBC : 15Mbps - CPU 30-40% loaded
It appears that transferring buffers between user space and kernel space is the limiting factor. When using NSS crypto, this penalty will be doubled, first time sending buffer to the NSS crypto engine, and the second time sending the encrypted/decrypted buffer to the network socket for routing.
The next step would probably to write a virtual interface driver to perform encryption/decryption in kernel before sending it over to the OpenVPN application in user space. This would probably maintain the thruput performance comparable to when using OpenSSL software crypto, but should bring down the CPU load.
The ideal solution is to bring everything into kernel space, with OpenVPN application managing the control plane.
@quarky Not sure if you have done such test of using no encryption. Such test will tell us if the performance is limited by the division between kernel-space and user-space processes or by the encryption and decryption routines
@quarky, sorry I didn't read your post very carefully. Below test you did should be the test of using no encryption("cipher none" and "auth none" in configuration file). Yeah, I got similar result in another router.
I'm surprised to see the performance is limited by the division between kernel-space and user-space processes . Is it possible that OpenVPN team should optimize this or if they have done it in latest version? I am using very old version --OpenVPN 2.3.6 . What's your version?
With OVPN-OpenSSL : 50Mbps - CPU 60-70% loaded
Actually, all my tests are done with aes-128-cbc with hmac-sha1 authentication. If without encryption, thruput will definitely be higher but it’s of no practical use. The line you quoted in your post uses the OpenSSL cipher.
I re-read the drivers from OSDK. It looks like I can use their techniques for the ipq806x crypto engine, so there’s a couple of tricks that I can still try. My hope is that we can still achieve the same thruput as per the OpenSSL cipher but with drastically reduced Krait CPU load. Then we can make full use of the SoC instead of letting the crypto engine go idling.
So the qsdk driver actually used both CPU and nss for crypto?
Crpyto operations are solely done by the NSS firmware with the NSS cores.
What I can gather so far with the QSDK drivers for OpenVPN (which unfortunately is only for the Hawkeye SoCs) patched the OpenVPN internals extensively so that the QSDK drivers are able to:
-
Take over the socket operations when sending and receiving encrypted packets to external OpenVPN servers/clients. OpenVPN will send packets in clear text to the QSDK drivers and the driver will then shoot it into the NSS crypto engine for ciphering. For receive end, it'll be the reverse.
-
Hook into ECM to have the NSS firmware take over routing function for tun/tap packets after decryption.
I'll be trying out (1.) first and see if it's going to be useful for the Akronite SoCs. From what I can gather, it should work. This should reduce Krait CPU load. This technique will also be applicable to other SoCs with crypto engine (e.g. MT7621A) which I'll definitely be trying on my Linksys EA7500v-2 if it works.
Will see if (2.) can be achieved once (1.) is successful.
If without encryption, thruput will definitely be higher but it’s of no practical use.
I just tested without encryption on R7800. It's only about 110Mbps in tun mode(VPN IP range is 192.168.6.0/24). Seems it's not a good performance.
root@OpenWrt:/tmp/openvpn# iperf -c 192.168.6.2 -t 15 &
root@OpenWrt:/tmp/openvpn# ------------------------------------------------------------
Client connecting to 192.168.6.2, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.6.1 port 48828 connected with 192.168.6.2 port 5001
root@OpenWrt:/tmp/openvpn# mpstat -P ALL 2
Linux 4.4.30 (OpenWrt) 08/17/20 _armv7l_ (2 CPU)
03:28:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:28:53 all 8.29 0.00 48.99 0.00 0.00 22.86 0.00 0.00 0.00 19.85
03:28:53 0 1.02 0.00 33.50 0.00 0.00 25.89 0.00 0.00 0.00 39.59
03:28:53 1 15.42 0.00 64.18 0.00 0.00 19.90 0.00 0.00 0.00 0.50
03:28:53 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:28:55 all 8.64 0.00 51.83 0.00 0.00 19.11 0.00 0.00 0.00 20.42
03:28:55 0 1.64 0.00 32.24 0.00 0.00 24.04 0.00 0.00 0.00 42.08
03:28:55 1 15.08 0.00 69.85 0.00 0.00 14.57 0.00 0.00 0.00 0.50
03:28:55 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:28:57 all 9.16 0.00 50.74 0.00 0.00 21.29 0.00 0.00 0.00 18.81
03:28:57 0 1.98 0.00 37.13 0.00 0.00 23.76 0.00 0.00 0.00 37.13
03:28:57 1 16.34 0.00 64.36 0.00 0.00 18.81 0.00 0.00 0.00 0.50
03:28:57 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:28:59 all 10.53 0.00 50.63 0.00 0.00 19.80 0.00 0.00 0.00 19.05
03:28:59 0 2.01 0.00 37.69 0.00 0.00 22.61 0.00 0.00 0.00 37.69
03:28:59 1 19.00 0.00 63.50 0.00 0.00 17.00 0.00 0.00 0.00 0.50
03:28:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:29:01 all 9.69 0.00 51.02 0.00 0.00 20.92 0.00 0.00 0.00 18.37
03:29:01 0 3.14 0.00 39.27 0.00 0.00 20.42 0.00 0.00 0.00 37.17
03:29:01 1 15.92 0.00 62.19 0.00 0.00 21.39 0.00 0.00 0.00 0.50
03:29:01 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:29:03 all 9.90 0.00 48.76 0.00 0.00 22.77 0.00 0.00 0.00 18.56
03:29:03 0 1.46 0.00 37.56 0.00 0.00 24.88 0.00 0.00 0.00 36.10
03:29:03 1 18.59 0.00 60.30 0.00 0.00 20.60 0.00 0.00 0.00 0.50
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-15.0 sec 196 MBytes 110 Mbits/sec
If running iperf directly without OpenVPN in R7800 to PC connected to WAN port of R7800, it's ~851.
So I think we need to figure out why there is only ~110Mbps without encryption first. How do you think?
root@OpenWrt:/tmp/openvpn# iperf -c 192.168.2.2 -t 15
------------------------------------------------------------
Client connecting to 192.168.2.2, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.2.1 port 38282 connected with 192.168.2.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-15.0 sec 1.49 GBytes 851 Mbits/sec
My guess are there are two major reasons why it is 'slow':
- Interaction between kernel and user space
- Netfilter.
Imaging a connection in OpenVPN has been established between two nodes, with the ipq806x SoC router acting as a server in one of the nodes. Now when the router receives an encrypted packet from the client, the following happens:
- Linux kernel receive packet and forward to OpenVPN socket listener (KS)
- OpenVPN wakes up and copies packet from (KS) to (US) (** slow **)
- OpenVPN decrypts packet in (US) (** slow **)
- OpenVPN decapsulate tunnel (tun/tap) packet from decrypted packet and prepares it for forwarding (US)
- OpenVPN send tun/tap packet back to the Linux kernel for routing (US)
- Linux kernel copies tun/tap packet from (US) to (KS) (** slow **)
- Tun/tap packet then goes thru netfilter (slowpath) before it gets routed. (** slow **)
For local tun/tap packet going to the other end, just reverse the flow above.
The above is what I think kills the performance for a 'slow' SoCs like the ipq806x and even ipq807x without acceleration offload.
Try disabling netfilter (i.e. the firewall) and re-test. You should see higher thruput.
I bricked my u-boot of R7800 in another test. But I encouted similar issue in IPQ807x when I turned off IPQ807x OpenVPN offload and encryption. Just found this low performance is caused by default "--tun-mtu 1500". I adjusted --tun-mtu 48000. The performance becomes ~940 from ~300. Would you have a try on R7800?
Sorry to hear about your R7800. Hope you can revive it.
Testing with '--tun-mtu 48000' is not a reasonable test, as it has not practical real world use. Jumbo frames MTU max out at 9000, and no ISP in the world (as least from what I know) supports MTU greater than 1500.
Setting the tunnel MTU to a greater value will just result in the Linux kernel fragmenting packets into smaller chunks before sending it out the wire. I'm surprised that this setting actually increases your thruput. Are you sure you are testing it correctly?
It is possible that using such a high value of MTU causes netfilter to do less work, thus improving thruput, but as I've pointed out, it is of no real world practical use. You probably can't connect reliably or at all to any external OpenVPN server in the Internet with such a big tunnel MTU value. Even if you can connect, you'll probably be experiencing PMTU issues.
see here https://community.openvpn.net/openvpn/wiki/Gigabit_Networks_Linux
This is OpenVPN offical guide.
- increase the MTU size of the tun adapter ('--tun-mtu') to 6000 bytes. This resembles Jumbo frames on a regular Ethernet LAN. Note that the MTU size on the underlying network switches was not altered.
By increasing the MTU size of the tun adapter and by disabling OpenVPN's internal fragmentation routines the throughput can be increased quite dramatically. The reason behind this is that by feeding larger packets to the OpenSSL encryption and decryption routines the performance will go up.
There was a guy using this option in real world.
It really depends on your use case. If you control both ends of the OpenVPN tunnel, you can configure the MTUs, but that's not how most folks uses OpenVPN. And even then, when you connect to external LAN devices to the OpenVPN router, you will be limited to the jumbo frame limit of 9000 bytes. Most folks don't know how to configure jumbo frames and will leave MTU at the default 1500.
Now, if you set the tunnel MTU to a very high value, like 16000, and uses the NSS crypto engine, efficiency will go up. So if you transfer tons of data between two sites which you control, this would be a viable option. It's just that it doesn't provide much benefit to most real world use cases.
If you control both ends of the OpenVPN tunnel, you can configure the MTUs, but that's not how most folks uses OpenVPN.
I'm afraid not. We only need to provide conguration file of OpenVPN client to users. This configuration file contains the MTU setting. This is most cases.
And even then, when you connect to external LAN devices to the OpenVPN router, you will be limited to the jumbo frame limit of 9000 bytes. Most folks don't know how to configure jumbo frames and will leave MTU at the default 1500.
But I captured the packets of phycial Ethernet, their size is not above 1500. I guess it just affects the MTU of vitrual tunnel interface, not the physical Ethernet. See below:
enp0s31f6 Link encap:Ethernet HWaddr 40:b0:34:ec:2f:91
inet addr:192.168.2.2 Bcast:192.168.2.255 Mask:255.255.255.0
inet6 addr: fe80::60a2:c73c:a31:e14f/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:237819837 errors:0 dropped:22942 overruns:0 frame:0
TX packets:52747925 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:348040825300 (348.0 GB) TX bytes:51324449886 (51.3 GB)
Interrupt:16 Memory:e1200000-e1220000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:2081881 errors:0 dropped:0 overruns:0 frame:0
TX packets:2081881 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:176931662 (176.9 MB) TX bytes:176931662 (176.9 MB)
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.3.4 P-t-P:192.168.3.4 Mask:255.255.255.0
inet6 addr: fe80::f215:af9e:b50:30a2/64 Scope:Link
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:48000 Metric:1
RX packets:61543 errors:0 dropped:0 overruns:0 frame:0
TX packets:30967 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:2927940876 (2.9 GB) TX bytes:1610264 (1.6 MB)
Would you have a try? I think it's a very good tip when we use the NSS crypto engine because the bottleneck is not crypto when NSS crypto engine is fast enough.
@quarky I misunderstood your point. Please ignore this reply. Yes, if I connect a LAN device and let router running VPN server forward packets to VPN client which connects router from WAN, I see the low performance again. I will try configure the jumbo frame in the LAN device(maybe also need to enable jumbo frame in the switch of router) to see if I can get higher performance.
As much as I would like to have some magic setting(s) to improve performance instantly ... well, if only it were so easy.
The only way I can see performance improving for OpenVPN for the ipq806x SoCs, is to try to emulate what the QCA folks did for the ipq807x SoCs. I don't see any other way around it at the moment.
The reason I like this magic setting is I have another Cavium OCTEON III CN70XX router, which already supports OpenSSL HW offload as below. So I can use option "engine octeon" in the OpenVPN to get better performance. But the bottleneck is division between kernel-space and user-space processes instead of encryption and decryption routines.
root@Openwrt:/etc/openvpn# openssl speed -evp aes-128-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 14253950 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 11043990 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 4856292 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 1486099 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 198781 aes-128-cbc's in 3.00s
OpenSSL 1.0.2d 9 Jul 2015
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,int) des(idx,cisc,2,long) aes(partial) blowfish(idx)
compiler: mips64-octeon-linux-gnu-gcc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_THREADS -DDSO_DLFCN -DHAVE_DLFCN_H -I/home/tony.he/vpn-router/vpn-router.git/staging_dir/target-ml
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 76021.07k 235605.12k 414403.58k 507255.13k 542804.65k
The only way I can see performance improving for OpenVPN for the ipq806x SoCs, is to try to emulate what the QCA folks did for the ipq807x SoCs
Yes. I confirmed that IPQ807x SoCs doesn't need to set the high MTU value while the performance is very good(at least 700+) when forwarding the packets between LAN devices and WAN VPN client.
Wow, the OpenSSL figures for the octeon engine are quite impressive. The 'solution' (for OpenVPN offload) I'm currently trying for the ipq806x SoCs are generic enough to be applicable to other SoCs as well, if I manage to get it working. Currently hitting a wall with the socket driver tho. Haiz.
is there a qsdk11 firmware available? or only qsdk10?