Comparative Throughput Testing Including NAT, SQM, WireGuard, and OpenVPN

Overview

While there's been a lot of individuals testing various routers, the question still always seems to come up, "What can I expect from Router X?"

I won't say that my testing is going to definitively answer those questions, but at least it should be a self-consistent data set.

To be very clear, this is a benchmark. Like most benchmarks, the performance you get in your own environment may be different. If you have a craptastic ISP, a lossy ISP connection, a flakey cable, hosts pummeling your LAN with broadcast traffic, a different traffic distribution, a poorly implemented router, or any of an uncountable number of other things, the results you see may not meet the levels reported here. It is your responsibility to use any information provided here to assist you in making your decision.

Testing Outline

Testing was performed using flent which provides "stress testing" by using multiple streams. The three tests shown are tcp_8down, tcp_8up, (with ICMP ping) and RRUL (4 up, 4 down, with 3 UDP ping and ICMP ping). IPv4 was used. The standard, 60-second test duration (for an overall 70-second run duration) was used. These multi-stream tests are more stressful than single-stream testing.

This testing uses full (1500 MTU), TCP packets. If your traffic consists of a large fraction of small packets (such as VOIP), the PPS (packet-per-second) rate will be much higher for a given bandwidth. These effects are not explored in this thread, though they may further limit the performance in your environment.

Flow offloading is not enabled in the config. At the present time (September, 2019) it apparently does not work on master. (2019-09-27 – Appears to now be resolved on master)

Testing includes irqbalance running from boot for multi-core devices, unless noted. Note that it needs to be enabled in /etc/config/irqbalance to have an effect. At least one device under some conditions seems to perform worse with irqbalance enabled (IPQ4019). Performing tests with your device in your environment is suggested.

Default CPU-speed governors and settings were used. Many suggest that "tweaks" here can improve start-up and steady-state performance. If your tweaks let you exceed these numbers in practice, that's great!

SQM using piece_of_cake.qos was applied to the WAN interface for NAT/routing, or to the VPN's tunnel interface for WireGuard and OpenVPN. The same bandwidth target was applied for upstream as well as downstream. Overhead of 22 was used for Ethernet, 82 for WireGuard1, and 95 for OpenVPN2. The overhead values are believed to be close to correct, but are not prescriptive.

After staring at too many of these runs early on in this effort, I realized how much of a judgement call it was to look at a run and say if it was good or bad. Also, the sheer number of runs needed, even at just over a minute each, pushed me to automate the testing (even automated, the full set of runs takes five or six hours per device). I settled on the following criteria for a "good" SQM run:

  • Coefficient of variance of bandwidth of all eight TCP streams under 1%, or
  • Standard deviation of bandwidth of all eight TCP streams under 0.02 Mbps

Basically, if the streams go out of balance from each other by a "tiny" bit, then SQM is starting to break down. The second criteria comes into play for low-bandwidth runs (typically under 10 Mbps, aggregate), where the output from flent for a single stream might be 0.12 Mbps and the variance gets dominated by the rounding precision and/or single packets become significant in the flow measurement variance.

When the ICMP-ping time exceeded 10 ms, the highest throughput with a ICMP-ping time of under 10 ms is shown as well.

The number shown is the reported throughput, not the SQM bandwidth-target setting.

LuCI is not "logged in" as early testing on a single-core device suggested that this can significantly degrade throughput performance. LuCI is installed and nginx / OpenSSL is installed and running. Running LuCI and generating page content will likely reduce the performance seen on single-core devices.

Wireless is not enabled.

WireGuard and OpenVPN were installed on Debian 10 ("Buster") on two upper-range x86_64/AMD64 machines. An AMD Ryzen 5 2600X was configured as the VPN "server" and netserver host and an Intel i3-7100T was driving the test as the "LAN" client. OpenVPN is configured for UDP, without compression. The test harness is capable of over 900 Mbps rates for Ethernet and WireGuard using netperf (effectively at the media limits with iperf3). It is capable of ~500 Mbps for OpenVPN.

As I'm not recommending any specific devices in this thread, I've listed them by target, clock speed, number of cores, and the SoC.

All tested devices have at least 128 MB of RAM.

Speeds are in Mbps, aggregate, as reported by flent (which uses netperf). Non-SQM runs are the median of five runs. SQM runs required two "successes" with less than two non-successful runs at a given bandwidth target. The parenthetical numbers are the median ping time in ms. Results rounded to at most two significant figures.

With a minimum step size in the SQM bandwidth-target search around 5%, differences of around 5-10% may not be signficant in the SQM results. As ping time often increases abruptly as SQM starts to get overstressed (which is where these results are measured), I wouldn't consider ping time for SQM results as more than "interesting information" about the benefits of SQM when the router is stressed, not just the line.

Note that the RRUL test has upstream and downstream flows "competing" with each other. Especially noted in VPN testing, there can be a huge imbalance between the two without SQM, with one direction or the other dominating, or even effectively excluding the other.

Remember, the tcp_8up or tcp_8down tests' numbers don't imply that you can get that throughput in both directions simultaneously. Look at the RRUL numbers for simultaneous throughput. Remember that the RRUL numbers are already the total of upstream and downstream.

I would suggest selecting a device that exceeds your needs by a comfortable margin as measures of performance other than throughput (such as latency and stream "fairness") are often much better at slightly lower bandwidths than at the limits of performance.

Q&A

Router X's ping time is so high. Router Y's is lower and it's cheaper. Router X sucks!

Remember that this is a test to estimate the limits of performance, not to estimate latency under reasonable operating conditions. You shouldn't be regularly pushing your router this hard. At reasonable rates, the tested routers should all provide reasonable latency of a couple milliseconds or less, with another couple added if a VPN is involved for the slower SoCs. Even just backing down the SQM bandwidth target a small amount can significantly reduce the latency.

The note says that downstream OpenVPN dominates the results. Does that mean it only works well in one direction?

As with ping time, these numbers are at the limits of performance. When CPU or other limits come into play, "balance" can be greatly impacted by the code involved, kernel prioritization, implementation details of the TCP stack and Ethernet drivers, and the like. Virtually all of the OpenVPN tests without SQM and with competing upstream and downstream streams resulted in virtually only downstream results. Back off of these limits by reducing the throughput you're asking for (fewer requests or SQM) and reasonable, simultaneous upstream and downstream performance can be achieved.

WireGuard also exhibits imbalance, though not to the extremes seen in the OpenVPN testing.

My iperf3 results are better.

That wouldn't surprise me. There's less to keep track of with a single stream, especially with SQM in play. Similar comments apply for things like the DSLReports test, which is a download test then an upload test.

Why didn't you test flow offloading?

Because at the present time (September, 2019) it apparently does not work on master. If it gets patched on master, I'll probably come back and test it. I'm curious as well about how it performs. (2019-09-27 – Appears to now be resolved on master)

Why haven't you tested Router X?

Likely because I don't own it, it's too old, or I haven't gotten to it yet.

But, it's faster if you set...

This primarily around providing general guidance if a device is "sufficient" for an application, especially for users that do not build their own images or highly configure OpenWrt. Things are pretty much stock with these configs. Any improvements over stock config or build are a bonus.

Have a great tweak? I'd be interested in hearing about it on another thread, as would many others!

Results

Differences in SQM results of less than 5-10% may not be significant.

as the smallest SQM step size explored was ~5%. Remember also that at the limits of performance of the device, dropping the SQM bandwidth target slightly can significantly reduce the latency.

Note that 10% of 920 Mbps is nearly 100 Mbps.

Results rounded to two significant figures or less (this is more than the step size of the SQM tests).

If you skipped to here, there is no flow offloading configured, the CPU-speed (governor) parameters have not been changed from the defaults, and there are no other "performance tweaks" known to be applied in the build or configuration. piece_of_cake.qos is used for SQM testing. Wireless is not enabled.

Key:

1720
(6)
Throughput in Mbits/s
ping time in ms

When two results are shown side-by-side in the same cell, one represents the performance seen with a ping time of 10 ms or less. The other is the "by the test" limit where SQM was starting to "fall apart".

Routing/NAT

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL 8 Dn / SQM 8 Up / SQM RRUL / SQM
x86_64 1500-2500 4 Celeron J4105 Realtek RT8111A 920
(4)
940
(4)
1720
(6)
900
(1)
920
(1)
1610
(2)
" " " " No irqbalance 920
(5)
940
(5)
1720
(5)
900
(1)
920
(1)
1620
(2)
x86_64 1000-1400 4 AMD
GX-412TC
Intel i211AT 920
(4)
940
(5)
1740
(7)
770L
(2)
930
(2)
700
(1)
ipq40xx 717 4 IPQ4019 Single RGMII3 920
(4)
600
(44)
840
(46)
210L
(4)
210
(7)
210
(6)
" " " " No irqbalance 920
(6)
740
(20)
1030
(20)
220L
(3)
240
(5)
230
(4)
ath79 775 1 QCA9563 Single MAC 470
(9)
380
(9)
400
(9)
210L
(4)
151
(3)
185
(9)
ath79 720 1 QCA9558 Dual MAC 390
(8)
320
(9)
340
(13)
210L
(4)
160
(3)
196
(6)
ath79 650 1 QCA9531 100 Mbps phys 67
(1)
86
(3)
184
(15)
93
(4)
92
(3)
165
(6)
ramips 580 1 MT7628N 100 Mbps phys 94
(6)
94
(7)
157
(87)
81
(3)
78
(3)
75
(4)

L The downstream, SQM throughput seems to be limited by something other than the SQM bandwidth target itself. Increasing the bandwidth target past this point did not significantly increase throughput, although the streams remained in balance.

WireGuard, Routing/NAT

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL 8 Dn / SQM 8 Up / SQM RRUL / SQM
x86_64 1500-2500 4 Celeron J4105 Realtek RT8111A 880
(5)
900
(7)
1670
(10)
860
(2)
880
(1)
1470
(2)
" " " " No irqbalance 880
(5)
900
(6)
1680
(10)
870
(1)
880
(1)
1480
(2)
x86_64 1000-1400 4 AMD
GX-412TC
Intel i211AT 550
(18)
460
(8)
500
(27)
350L
(4)
340
(2)
340
(3)
ipq40xx 717 4 IPQ4019 Single RGMII3 270
(52)
220
(104)
230
(48)
157
(8)
124
(6)
142
(8)
" " " " No irqbalance 350
(20)
240
(60)
270
(45)
170
(10)
150
(8)
161
(8)
ath79 775 1 QCA9563 Single MAC 80
(33)
75
(34)
75
(37)
41
(9)
50
(9)
46
(10)
ath79 720 1 QCA9558 Dual MAC 78
(38)
74
(40)
74
(49)
44
(10)
51
(9)
46
(10)
ath79 650 1 QCA9531 100 Mbps phys 51
(15)
67
(59)
69
(63)
26
(4)
35   41
(8)  (13)
32
(6)
ramips 580 1 MT7628N 100 Mbps phys 48
(34)
46
(65)
48
(77)
29     33
(10)  (15)
28
(8)
27
(10)

L The downstream, SQM throughput seems to be limited by something other than the SQM bandwidth target itself. Increasing the bandwidth target past this point did not significantly increase throughput, although the streams remained in balance.

OpenVPN, Routing/NAT

OpenVPN tested at the device's limits with the bi-directional, RRUL test and without SQM results in the vast majority of bandwidth (often 90% or more) being downstream.


Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL 8 Dn / SQM 8 Up / SQM RRUL / SQM
x86_64 1500-2500 4 Celeron J4105 Realtek RT8111A 240
(18)
200
(5)
240
(14)
157
(5)
155
(1)
151
(4)
" " " " No irqbalance 210
(20)
196
(5)
240
(16)
156  174
(1)   (12)
170
(4)
163
(5)
x86_64 1000-1400 4 AMD
GX-412TC
Intel i211AT 68
(58)
56
(16)
68
(55)
48
(5)
49
(9)
49
(9)
ipq40xx 717 4 IPQ4019 Single RGMII3 31
(83)
26
(29)
33
(76)
23
(10)
20
(9)
20
(7)
" " " " No irqbalance 31
(86)
24
(32)
33
(79)
19   22
(5)  (20)
20
(8)
18    22
(10)  (35)
ath79 775 1 QCA9563 Single MAC 20
(200)
15
(60)
20
(188)
12   14
(8)  (17)
12   13
(7)  (47)
10    14
(10)  (70)
ath79 720 1 QCA9558 Dual MAC 21
(188)
16
(57)
21
(175)
13    15
(8)  (48)
13   14
(8)  (17)
11    14
(10)  (57)
ath79 650 1 QCA9531 100 Mbps phys 16
(220)
12
(73)
17
(230)
10   11
(9)  (19)
10   11
(8)  (72)
9    11
(5)  (74)
ramips 580 1 MT7628N 100 Mbps phys 14
(290)
10
(88)
14
(280)
8     9
(9)  (20)
8     9
(9)  (40)
5      9
(10)  (90)

Notes

On VPN

Encryption is often more expensive than is decryption.

The RRUL test has upstream and downstream flows "competing" with each other. Without SQM, the results shown are often dominated by one direction or the other. For OpenVPN, the effect is nearly complete with "only" downstream traffic showing significant bandwidth (often 90% or more). Interestingly, for WireGuard, the upstream direction seems to get more of the traffic and there is still a reasonable downstream flow.

If latency is important to you, using SQM on the tunnel interface may help significantly (at the cost of overall bandwidth).

On Bandwidth

The link bandwidth will always exceed the upper-layer, payload throughput. This is due to a variety of factors including, for example:

  • Collision avoidance and sync on the media
  • Media-layer (Ethernet) framing
  • IP framing
  • TCP framing
  • Control, ACK packets
  • Other packets on the link (such as ARP and broadcasts)
  • Packet loss

The rough, theoretical throughput limit for TCP over GigE (without jumbo frames) is ~940 Mbps4.

WireGuard or OpenVPN encapsulation further reduces this to ~900 Mbps1,2.

ACK packets reduce the "opposite" direction capacity by a bit over 5% 5.

It is not clear if netperf measures bandwidth including IP and TCP headers or not. Similarly, I think that it probably does not include Ethernet framing.

netperf documentation may be found at https://hewlettpackard.github.io/netperf/doc/netperf.html and https://github.com/HewlettPackard/netperf, along with source.

flent documentation may be found at https://flent.org/

Firmware Source References

The following commits are the base on which the various images were built (all builds from master:

  • x86_64 – commit 1c0290c5cc, CommitDate: Fri Aug 30 20:45:40 2019 +0200
  • IPQ4019 – commit c3a78955f3, CommitDate: Mon Aug 26 18:21:13 2019 +0200
  • QCA9558 – commit 921675a2d1, CommitDate: Sat Aug 24 08:55:33 2019 +0800
  • QCA9563 – commit b133e466b0, CommitDate: Wed Aug 14 12:36:37 2019 +0200
  • QCA9531 – commit b133e466b0, CommitDate: Wed Aug 14 12:36:37 2019 +0200

Note significant changes have occurred on master since the other devices' builds.

  • MT7682N - commit 2fedf023e4, CommitDate: Wed Nov 27 20:20:31 2019 +0100

Footnotes

1 Additional 60-byte overhead for WireGuard for IPv4 (80 bytes for IPv6)

2 Additional 73-byte overhead based on a reported 1427 MTU for OpenVPN

3 The present (2019) DTS/driver for the ipq40xx is believed to only use one path, although two interfaces are revealed.

4 1500 bytes MTU less 20 bytes minimal IPv4 header less 20 bytes minimal TCP header gives 1460 maximum payload. Ethernet framing of 22 bytes (with 802.1Q VLAN tagging) plus 8 byte sync and start-of-frame delimiter plus 12 byte minimal interpacket gap gives 1542 byte-times on the wire. 1460/1542 ~ 94.7%, under ideal timing and no other clients or packets on the wire. Note that IPv6 headers are larger (40 bytes, minimum) and will reduce the factor further.

5 Minimal Ethernet packet (for a "naked ACK") is 84 byte-times on the wire. 84/1542 ~ 5.4%

Edit history:

  • 2019-09-16
  • 2019-09-17
    • Quantified ACK-packet bandwidth effect
  • 2019-09-27
  • 2019-10-03
    • Added "key" for table values
  • 2019-11-28
    • Added MediaTek MT7828N results
30 Likes

Stock CFLAGS or did you use -O2 and/or any other optimization on the OpenWrt devices?

Stock CFLAGS, yes. No changes to CPU governors or Ethernet settings from whatever "stock" sets.

Here's an example diffconfig (EFI enabled with https://github.com/openwrt/openwrt/pull/1968, "build_details" is a local package that captures details of my source tree in the image)

CONFIG_TARGET_x86=y
CONFIG_TARGET_x86_64=y
CONFIG_TARGET_x86_64_Generic=y
CONFIG_DEVEL=y
CONFIG_BUSYBOX_CUSTOM=y
CONFIG_BUILD_LOG=y
CONFIG_BUSYBOX_CONFIG_FEATURE_EDITING_SAVEHISTORY=y
CONFIG_BUSYBOX_CONFIG_FEATURE_REVERSE_SEARCH=y
CONFIG_BUSYBOX_CONFIG_FEATURE_VERBOSE_CP_MESSAGE=y
CONFIG_CCACHE=y
CONFIG_DOWNLOAD_FOLDER="/home/jeff/devel/openwrt_dl"
CONFIG_EFI_IMAGES=y
CONFIG_NGINX_HEADERS_MORE=y
CONFIG_NGINX_HTTP_ACCESS=y
CONFIG_NGINX_HTTP_AUTH_BASIC=y
CONFIG_NGINX_HTTP_AUTOINDEX=y
CONFIG_NGINX_HTTP_BROWSER=y
CONFIG_NGINX_HTTP_CACHE=y
CONFIG_NGINX_HTTP_CHARSET=y
CONFIG_NGINX_HTTP_EMPTY_GIF=y
CONFIG_NGINX_HTTP_FASTCGI=y
CONFIG_NGINX_HTTP_GEO=y
CONFIG_NGINX_HTTP_GZIP=y
CONFIG_NGINX_HTTP_LIMIT_CONN=y
CONFIG_NGINX_HTTP_LIMIT_REQ=y
CONFIG_NGINX_HTTP_MAP=y
CONFIG_NGINX_HTTP_MEMCACHED=y
CONFIG_NGINX_HTTP_PROXY=y
CONFIG_NGINX_HTTP_REFERER=y
CONFIG_NGINX_HTTP_REWRITE=y
CONFIG_NGINX_HTTP_SCGI=y
CONFIG_NGINX_HTTP_SPLIT_CLIENTS=y
CONFIG_NGINX_HTTP_SSI=y
CONFIG_NGINX_HTTP_UPSTREAM_HASH=y
CONFIG_NGINX_HTTP_UPSTREAM_IP_HASH=y
CONFIG_NGINX_HTTP_UPSTREAM_KEEPALIVE=y
CONFIG_NGINX_HTTP_UPSTREAM_LEAST_CONN=y
CONFIG_NGINX_HTTP_USERID=y
CONFIG_NGINX_HTTP_UWSGI=y
CONFIG_NGINX_NAXSI=y
CONFIG_NGINX_PCRE=y
CONFIG_OPENSSL_ENGINE=y
CONFIG_OPENSSL_OPTIMIZE_SPEED=y
CONFIG_OPENSSL_WITH_ASM=y
CONFIG_OPENSSL_WITH_CHACHA_POLY1305=y
CONFIG_OPENSSL_WITH_CMS=y
CONFIG_OPENSSL_WITH_DEPRECATED=y
CONFIG_OPENSSL_WITH_ERROR_MESSAGES=y
CONFIG_OPENSSL_WITH_PSK=y
CONFIG_OPENSSL_WITH_SRP=y
CONFIG_OPENSSL_WITH_TLS13=y
CONFIG_OPENVPN_openssl_ENABLE_DEF_AUTH=y
CONFIG_OPENVPN_openssl_ENABLE_FRAGMENT=y
CONFIG_OPENVPN_openssl_ENABLE_LZ4=y
CONFIG_OPENVPN_openssl_ENABLE_LZO=y
CONFIG_OPENVPN_openssl_ENABLE_MULTIHOME=y
CONFIG_OPENVPN_openssl_ENABLE_PF=y
CONFIG_OPENVPN_openssl_ENABLE_PORT_SHARE=y
CONFIG_OPENVPN_openssl_ENABLE_SERVER=y
CONFIG_OPENVPN_openssl_ENABLE_SMALL=y
CONFIG_PACKAGE_build-details=y
CONFIG_PACKAGE_ca-bundle=y
CONFIG_PACKAGE_diffutils=y
CONFIG_PACKAGE_ethtool=y
CONFIG_PACKAGE_findutils=y
CONFIG_PACKAGE_findutils-find=y
CONFIG_PACKAGE_findutils-locate=y
CONFIG_PACKAGE_findutils-xargs=y
CONFIG_PACKAGE_git=y
CONFIG_PACKAGE_grub2-efi=y
CONFIG_PACKAGE_htop=y
CONFIG_PACKAGE_i2c-tools=y
CONFIG_PACKAGE_ip-bridge=y
CONFIG_PACKAGE_ip-full=y
CONFIG_PACKAGE_iperf3=y
CONFIG_PACKAGE_iptables-mod-conntrack-extra=y
CONFIG_PACKAGE_iptables-mod-ipopt=y
CONFIG_PACKAGE_irqbalance=y
CONFIG_PACKAGE_jansson=y
CONFIG_PACKAGE_kmod-crypto-crc32c=y
CONFIG_PACKAGE_kmod-crypto-hash=y
CONFIG_PACKAGE_kmod-fs-ext4=y
CONFIG_PACKAGE_kmod-fs-vfat=y
CONFIG_PACKAGE_kmod-ifb=y
CONFIG_PACKAGE_kmod-ipt-conntrack-extra=y
CONFIG_PACKAGE_kmod-ipt-ipopt=y
CONFIG_PACKAGE_kmod-ipt-raw=y
CONFIG_PACKAGE_kmod-lib-crc16=y
CONFIG_PACKAGE_kmod-nls-base=y
CONFIG_PACKAGE_kmod-nls-cp437=y
CONFIG_PACKAGE_kmod-nls-iso8859-1=y
CONFIG_PACKAGE_kmod-nls-utf8=y
CONFIG_PACKAGE_kmod-sched-cake=y
CONFIG_PACKAGE_kmod-sched-core=y
CONFIG_PACKAGE_kmod-tun=y
CONFIG_PACKAGE_kmod-udptunnel4=y
CONFIG_PACKAGE_kmod-udptunnel6=y
CONFIG_PACKAGE_kmod-wireguard=y
CONFIG_PACKAGE_less=y
CONFIG_PACKAGE_libacl=y
CONFIG_PACKAGE_libattr=y
CONFIG_PACKAGE_libcap=y
CONFIG_PACKAGE_libelf=y
CONFIG_PACKAGE_libi2c=y
CONFIG_PACKAGE_libiwinfo=y
CONFIG_PACKAGE_libiwinfo-lua=y
CONFIG_PACKAGE_liblua=y
CONFIG_PACKAGE_liblucihttp=y
CONFIG_PACKAGE_liblucihttp-lua=y
CONFIG_PACKAGE_liblzo=y
CONFIG_PACKAGE_libmnl=y
CONFIG_PACKAGE_libncurses=y
CONFIG_PACKAGE_libopenssl=y
CONFIG_PACKAGE_libopenssl-conf=y
CONFIG_PACKAGE_libpcap=y
CONFIG_PACKAGE_libpcre=y
CONFIG_PACKAGE_libpopt=y
CONFIG_PACKAGE_libubus-lua=y
CONFIG_PACKAGE_libusb-1.0=y
CONFIG_PACKAGE_lua=y
CONFIG_PACKAGE_luci-app-firewall=y
CONFIG_PACKAGE_luci-app-openvpn=y
CONFIG_PACKAGE_luci-app-sqm=y
CONFIG_PACKAGE_luci-app-wireguard=y
CONFIG_PACKAGE_luci-base=y
CONFIG_PACKAGE_luci-lib-ip=y
CONFIG_PACKAGE_luci-lib-jsonc=y
CONFIG_PACKAGE_luci-lib-nixio=y
CONFIG_PACKAGE_luci-mod-admin-full=y
CONFIG_PACKAGE_luci-mod-network=y
CONFIG_PACKAGE_luci-mod-status=y
CONFIG_PACKAGE_luci-mod-system=y
CONFIG_PACKAGE_luci-proto-ipv6=y
CONFIG_PACKAGE_luci-proto-ppp=y
CONFIG_PACKAGE_luci-proto-wireguard=y
CONFIG_PACKAGE_luci-ssl-nginx=y
CONFIG_PACKAGE_luci-theme-bootstrap=y
CONFIG_PACKAGE_nginx-mod-luci-ssl=y
CONFIG_PACKAGE_nginx-ssl=y
CONFIG_PACKAGE_openssl-util=y
CONFIG_PACKAGE_openvpn-openssl=y
CONFIG_PACKAGE_procps-ng=y
CONFIG_PACKAGE_procps-ng-free=y
CONFIG_PACKAGE_procps-ng-kill=y
CONFIG_PACKAGE_procps-ng-pgrep=y
CONFIG_PACKAGE_procps-ng-pkill=y
CONFIG_PACKAGE_procps-ng-pmap=y
CONFIG_PACKAGE_procps-ng-ps=y
CONFIG_PACKAGE_procps-ng-pwdx=y
CONFIG_PACKAGE_procps-ng-skill=y
CONFIG_PACKAGE_procps-ng-slabtop=y
CONFIG_PACKAGE_procps-ng-snice=y
CONFIG_PACKAGE_procps-ng-tload=y
CONFIG_PACKAGE_procps-ng-top=y
CONFIG_PACKAGE_procps-ng-uptime=y
CONFIG_PACKAGE_procps-ng-vmstat=y
CONFIG_PACKAGE_procps-ng-w=y
CONFIG_PACKAGE_procps-ng-watch=y
CONFIG_PACKAGE_rpcd=y
CONFIG_PACKAGE_rpcd-mod-rrdns=y
CONFIG_PACKAGE_rsync=y
CONFIG_PACKAGE_sqm-scripts=y
CONFIG_PACKAGE_tc=y
CONFIG_PACKAGE_tcpdump-mini=y
CONFIG_PACKAGE_terminfo=y
CONFIG_PACKAGE_usbutils=y
CONFIG_PACKAGE_uwsgi-cgi=y
CONFIG_PACKAGE_uwsgi-cgi-luci-support=y
CONFIG_PACKAGE_wireguard=y
CONFIG_PACKAGE_wireguard-tools=y
CONFIG_PACKAGE_zlib=y
CONFIG_RSYNC_acl=y
CONFIG_RSYNC_xattr=y
CONFIG_RSYNC_zlib=y

Important factual information for all Newbies

Openwrt is slower than stock firmware for throughput in real life usage, in some cases by as much as 80%.

2 Likes

That looks like an opinion when you state it with no actual facts...

8 Likes

Very good work. It would be nice to add ARM CPU both Armv8 with AES instructions and AArch64 and AArch32 without them link.

1 Like

Not only there are numerous posts proving this issue, there is also an unquestioned explanation for this behaviour (hardware vs software NAT), and a couple of projects about possible solutions (fast-path and flow-offloading).

9 Likes

Back on topic...

These results are posted here rather than in the lead post as use of flow offload on master requires a patch to the sources, as well as that non-default compilation (-O2 vs. -Os) and settings (performance speed governor) were used. Multiple requests have been made to have OpenWrt adopt the "mainstream" -O2 optimization (or even the more aggressive -O3), though the project continues to use the -Os "Optimize for size" setting, even for devices that do not have image-size issues. It is my understanding that flow offload is available for many devices in the 18.06 builds.

As expected, the performance of the routing/NAT increases significantly for the devices tested (a key objective of the switch-core design) and there appears to be a slight increase in the performance of advanced features that require ongoing CPU use (SQM, VPN). The improvement in SQM and VPN performance may be due to CPU offload, the choice of -O2, that all cores on the IPQ4019 are running at the same speed, or some combination of the foregoing.

I don't have plans at this time to test the other two SoC-based devices with flow offload enabled, qualify the differences in performance due to the more-standard -O2 (or -O3) compiler optimizations, or the impact of changing the CPU-speed governor. The test scripts are linked in the first post for anyone interested in exploring these effects further.

(Notes from the lead post and its tables in relation to stream balance and SQM behavior likely apply here as well.)

Routing/NAT – Flow Offload Enabled, -O2

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL 8 Dn / SQM 8 Up / SQM RRUL / SQM
ipq40xx 717 4 IPQ4019 Single RGMII 920
(6)
940
(5)
1280
(4)
240
(3)
340
(7)
270
(3)
ath79 775 1 QCA9563 Single MAC 870
(6)
640
(6)
660
(5)
230
(2)
260
(3)
260
(4)

WireGuard, Routing/NAT – Flow Offload Enabled, -O2

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL 8 Dn / SQM 8 Up / SQM RRUL / SQM
ipq40xx 717 4 IPQ4019 Single RGMII 400
(18)
280
(57)
300
(45)
197
(8)
171
(7)
196
(9)
ath79 775 1 QCA9563 Single MAC 88
(33)
91
(39)
89
(33)
43
(10)
48
(9)
45   47
(10)  (12)

OpenVPN, Routing/NAT – Flow Offload Enabled, -O2

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL 8 Dn / SQM 8 Up / SQM RRUL / SQM
ipq40xx 717 4 IPQ4019 Single RGMII 39
(68)
30
(24)
37
(66)
23   25
(9)  (59)
22
(4)
22
(5)
ath79 775 1 QCA9563 Single MAC 24
(167)
17
(55)
23
(155)
13   15
(9)  (66)
13   14
(8)  (40)
11   15
(10)  (64)

Edit History: 2019-09-20 -- Transcription error in WireGuard numbers for QCA9563 device corrected.

4 Likes

Thanks for awesome set of numbers!
Can you clarify what do the numbers in parenthesis mean, ie 24**(167)**

1 Like

I was looking for this as well, but it is in Jeff's excellent methods section already. Maybe adding it to the table headers like "(RTT)" would make this clearer for casual users of the table, that might have forgotten all the details already?

1 Like

image
24 Mbps throughput, with 167 ms ping time

(In my opinion, that kind of ping time is a good argument for running SQM when you're pushing your router's CPU hard, not just your ISP line.)


image

15 Mbps throughput with 64 ms ping time

but since ping time was over 10 ms, I found what was just under 10 ms and reported

11 Mbps throughput with 10 ms ping time


Edit: Key added above the first table in the lead post. Suggestions for additional improvements are welcome!

1 Like

@Trenton @zakporter Please open a new topic for your issues.
Thanks.

3 Likes

Nice work. It's pretty hard to find performance numbers on SQM so these benchmarks is appreciated.

Did you do anything to constrain the clock speed on Celeron J4105 and AMD GX-412TC so you're not just measuring performance at a temporary boost state?

Given that Celeron J4105 seems to max out the SQM gigabit benchmark, it's hard to tell how much headroom is left. It would be useful to know CPU core usage during the benchmark too.

I also noticed that routing seems to be entirely single-threaded with these benchmarks. Is this an inherent limitation of Linux? I did some benchmarking of my EdgeRouter X with NAT and CAKE on EdgeOS and saw that upstream and downstream combined were greater than any single direction. That suggests that routing can scale to at least 2 cores, though there might be other factors causing that.

2 Likes

Was hoping to get some CPU-load numbers for you in the Celeron J4105, but haven't quite yet.

I didn't constrain the clock speed on any of the devices shown in the lead tables. The runs load the router for 60 seconds, so I tend to believe that it isn't "burst" performance. Unless you consider 900 Mbit/s * 60 s > 6 Gbyte transferred a "burst".

It does seem that load from interface management, routing/NAT, and queue management (SQM) does not distribute itself well over multiple cores. I was a little surprised by that myself. I poked at it a tiny bit, but not enough to get a sense of what was going on. The surprising, self-limiting performance of the 1 GHz-class AMD64, the IPQ4019 and the GigE-enabled ath79 SoCs with SQM definitely caught my eye as well. That it is downstream doesn't surprise me as, the way Linux works, it seems you need to accept the packet, then queue it in an intermediate, virtual interface for bandwidth management, in contrast to upstream where you can simply queue and manage the transmit queue. There might be some "magic" possible with manual core affinity, but then again, you're messing with a single skb (packet buffer) which I would hope isn't copied around at all.

From general linux performance testing on multi-cpu systems you might want to look on per core utilization, between user/system and specifically irq/sirq (parts of system). Often times a single core is 100% utilized with IRQ handling, being the system bottleneck, while other cores are somewhat idle.

I've used mpstat and top to get it, ie this is my 1 cpu router doing ~240Mbi/sec and choking on software irqs (last column) at the peak of doing test against fast.com:

# while sleep 2; do top -n1 -b |grep "CPU"|grep -vE "grep|PPID"; done
CPU:  0.0% usr  0.0% sys  0.0% nic  100% idle  0.0% io  0.0% irq  0.0% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic 83.3% idle  0.0% io  0.0% irq 16.6% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic 91.6% idle  0.0% io  0.0% irq  8.3% sirq
CPU:  6.6% usr  0.0% sys  0.0% nic 13.3% idle  0.0% io  0.0% irq 80.0% sirq
CPU:  0.0% usr  8.3% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq 91.6% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq  100% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic 90.9% idle  0.0% io  0.0% irq  9.0% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic  100% idle  0.0% io  0.0% irq  0.0% sirq

Check if your router supports top -1 to show per CPU distribution, then you might be able to pinpoint the bottleneck, ie on this 4 core x64 linux a single core does all irqs, utilizing up to 17% of a single core for software irqs (si, second to last column):

$ while sleep 2; do top -n1 -b -1 |grep "%Cpu"|grep -vE "grep|PPID";echo; done
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  5.6 us,  5.6 sy,  0.0 ni, 83.3 id,  0.0 wa,  5.6 hi,  0.0 si,  0.0 st

%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  5.9 us,  0.0 sy,  0.0 ni, 94.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  6.2 us,  6.2 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 93.3 id,  0.0 wa,  0.0 hi,  6.7 si,  0.0 st

%Cpu0  :  6.2 us, 12.5 sy,  0.0 ni, 81.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi, 17.6 si,  0.0 st

%Cpu0  :  6.7 us,  0.0 sy,  0.0 ni, 93.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 11.8 us,  5.9 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi, 17.6 si,  0.0 st

%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  5.9 us, 11.8 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  5.9 us, 11.8 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi, 12.5 si,  0.0 st

%Cpu0  :  0.0 us,  6.2 sy,  0.0 ni, 93.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 94.4 id,  0.0 wa,  0.0 hi,  5.6 si,  0.0 st
%Cpu2  :  5.6 us, 16.7 sy,  0.0 ni, 77.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 88.2 id,  0.0 wa,  0.0 hi, 11.8 si,  0.0 st
3 Likes

Just a related fact, traffic shaping requires timely access to the CPU, so latency is even more important than computational bandwidth, if a shaper fails to inject a packet into the underlaying layer in time, there is going to be a "(micro-)bubble" in the queue which will lead to less bandwidth efficiency and increased delay. My gut feeling is that it is this property that makes SQM/traffic-shapers quite sensitive to frequency-scaling/power-saving features, especially when it takes long to ramp the CPU back up again.

1 Like

This is great work! It would be great to have this in the wiki so people can figure out what specs they need for a given WAN bandwidth.

1 Like

About irqbalance:
When the CPU is getting maxed out, the scheduler will try to keep irqbalance at the top of the list (because the OTHER scheduler is biased to run the processes that wait the most first). The problem is that real-time code like the interrupts itself and particularly the netfilter stack will stop irqbalance from correctly migrating the IRQ to another CPU. In a CPU at top, migrating IRQs will reduce performance because it will interfere with the CPU caching.
irabalance is a program that is better run in the real-time scheduler and it's purpose is not to improve raw performance but to allow the kernel to queue jobs more efficiently. If the CPU is not starving, a real benefit is measurable in terms of responsiveness/latency.

2 Likes

Plus, on some devices, important parts of the hardware (wifi and Ethernet subsystems, for example) use an IRQ that cannot migrate and must always be served by the same CPU. In these cases "irqbalance" cannot improve the performance.

1 Like

Test results for an inexpensive MT7628N-based device from a reputable manufacturer added.

Note that there have been significant changes on master since the earlier results, including a compiler changes.

2 Likes