Comparative Throughput Testing Including NAT, SQM, WireGuard, and OpenVPN

I don't know if this has made openwrt yet.

[PATCH 5.6 029/177] wireguard: queueing: preserve flow hash across packet scrubbing

[ Upstream commit c78a0b4a78839d572d8a80f6a62221c0d7843135 ]

It's important that we clear most header fields during encapsulation and
decapsulation, because the packet is substantially changed, and we don't
want any info leak or logic bug due to an accidental correlation. But,
for encapsulation, it's wrong to clear skb->hash, since it's used by
fq_codel and flow dissection in general. Without it, classification does
not proceed as usual. This change might make it easier to estimate the
number of innerflows by examining clustering of out of order packets,
but this shouldn't open up anything that can't already be inferred
otherwise (e.g. syn packet size inference), and fq_codel can be disabled
anyway.

Furthermore, it might be the case that the hash isn't used or queried at
all until after wireguard transmits the encrypted UDP packet, which
means skb->hash might still be zero at this point, and thus no hash
taken over the inner packet data. In order to address this situation, we
force a calculation of skb->hash before encrypting packet data.

Of course this means that fq_codel might transmit packets slightly more
out of order than usual. Toke did some testing on beefy machines with
high quantities of parallel flows and found that increasing the
reply-attack counter to 8192 takes care of the most pathological cases
pretty well.

3 Likes

Rather happy with this. Squint at the bottom for the "after".

6 Likes

Holy shit that is a massive improvement for bufferbloat on wireguard! Is this already included in the kernel version for mainline linux? And when will this hit the Openwrt version? I am assuming both ends of the link require this patch for the full benefits?

Thought it was there

2 Likes

By the way, is this improvement by default, or do you need to set up SQM on the wireguard interface to reap these benefits?

no, sqm on the main egress interface (or line rate on the wifi) - or even line rate on ethernet if that's your bottleneck - "just works".

Note there's a cake patch also, I don't know if that's in openwrt yet, either.

really do want some benchmarking of the differences here in the real world. I figure that MOST of the time wireguard is rate limited by the egress interface, not by crypto, in the openwrt world, but I'd like to know more.... if there is anyone here that can do a before/after with stuff like the rrul test, or for example, voip over wireguard while under other loads - it would be nice.

If it's limited more by crypto, well, I'd long planned to stick something called "crypto queue limits", but
that fix was WAY more invasive and we've not got around to it.

1 Like

well, that would be best yes. Better levels of FQ-anything on either side, though, tend to drive both sides towards better multiplexing in general.

@jeff you still setup to test some stuff? Cool wireguard patch....

Another commit of interest on this front.

/me kisses @ldir on both cheeks and goes to compile openwrt from head.

Here are some measurements on a Netgear WNDR3700v2 running openwrt 19.07.2 (mostly default settings, no performance related tweaks)
It has an Atheros AR7161 CPU (ath79 family).

Command line used:

flent tcp_8down -H server.name
flent tcp_8up -H server.name
flent rrul -H server.name

Routing/NAT (plain IP with DHCP and PPPoE)

No Flow Offload

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL
ath79 680 1 AR7161 DHCP/NAT 239
(10)
383
(7)
285
(13)
ath79 680 1 AR7161 PPPoE/NAT 347
(8)
344
(8)
372
(11)

With Flow Offload

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL
ath79 680 1 AR7161 DHCP/NAT 684
(4)
883
(3)
711
(5)
ath79 680 1 AR7161 PPPoE/NAT 583
(4)
683
(5)
602
(5)

Note: no SQM tested

Raw data here: https://pastebin.ubuntu.com/p/kbSfF8X4mW/

1 Like

Mmh, ist seems odd, that you seem to get more throughput with costly PPPoE than with DHCP which should only cost you a bit computations once every timeout, while pppoe costs at every single packet. Any idea why?

No, but even directly PC-to-PC (so no router in between) sometimes does not go at full speed.

For example when I run the PPPoE server in one VM and the client in another (on same hardware), I get very low speed (400 mbps, almost slower that my little router). Also running the server in VM or bare metal gives different results (for example: one is fast for the tcp_8down test, but also has bigger ping, while the other has noticeable lower ping, but also lower transfer rate).

The PC (server) is an Intel i5-2320 (4 core, 3GHz), the other PC (client, laptop) has i3-5005U ( 2GHz,2 core/4 threads).

Weird, but I guess those numbers are the minimum. More might be possible in other circumstances.

PS: I used Ubuntu 20.04 x64 as PPPoE server (and client PC)

Hi,

I'm interested in the RK3328 SOC, any ideas regarding the specs it would get?

Did anybody else see a Wireguard performance loss on ipq40xx (Zyxel NBG6617/GL-inet B1300) between 19.07.3 to 19.07.4?
I discovered this, when I updated from 19.07.3 to 19.07.5 that my Wireguard throughput dropped from about 350-400 to 250-280 when I downloaded from the Internet.

To test this, I did some synthetic benchmarks without Internet in a Gigabit network environment, just to exclude my ISP from this calculation, which resulted in the same numbers...

The current master branch is also affected.

"Performance" as cpufreq scheduler, irqbalance was running and I had software offload activated.

Anybody else with the same experience?

What i find interesting is that a dual core 1ghz soc could almost beat a x86_64 4 core 1.5-2.5ghz cpu.
Plus in some tests it seems like the dual core 1gh soc almost outperforms a x86_64 cpu.
Something tells me that something is wrong? Maybe it requires some special changes/tweaks for best results on x86_64 platform? Have you tested with cpu policies set to performance or tweaked on-demand settings to ramp up to highest clock possible?

I sadly only have a IPQ8065 router and no x86_64 and i don't think using a VM is reliable since it would use the same network card/interface which would provide inaccurate results if you would test it from vm to vm/main pc..

I have this single core NETGEAR R6220 (880 MHz).

In my preliminary tests, SQM+Cake+Piece of cake cut my download from 200 Mbps to more like 106-130 Mbps. Shaving off 10% iteratively from the up/down didn't do anything. Top didn't appear to show CPU maxing out. (over 30% idle).

Would a new, faster router give me speeds more like 80% of non-SQM? (160+)? Or maybe I set something wrong? I double checked the WAN and it seemed to be correct.

My connection: cable, able to sustain 200 Mbps on long downloads. Fast.com shows 250 Mbps. With SQM it showed around 150. Pre-SQM the other tests showed 180-220. I don't have any to share because I had to bash the settings a bit to get them to run and I'm not sure why. So I doubt they are forum-optimal settings. I can run more tests but first I want to know if the router is even capable of high speeds with SQM.

You say you are getting both 106-130 Mbps CAKE SQM, and 150 Mbps CAKE SQM. Assuming you meant that after some tuning you are getting up to 150 Mbps using OpenWrt 21.02, that sounds about right to me on your hardware.

My ER-X delivers 130-150 Mbps CAKE SQM running 21.02. SQM runs on a single core (or thread) as I recall, so the two cores and four threads on the 880 Mhz MT7621AT ER-X, compared to the single core and two threads on your 880 Mhz MT7621ST R6220 should not increase its SQM capability that much.

My ER-X could handle 165-190 Mbps CAKE SQM on 19.07, before 21.02 converted MT7621 targets to DSA and I know not what else that slowed things down. If an extra ~25% SQM throughput is worth re-configuring your network file from DSA to swconfig, you could try downgrading OpenWrt to 19.07?

Also pay attention to SIRQ, as well as Idle, often that's where a maxout occurs.

I don't know that architecture off the top of my head, if it's multi core and you're running 21.x.x.x, you might try the new Packet Steering setting under Network, Interfaces, Global Network Settings. It seems to spread network SIRQ's across CPU cores.

You might also try irqbalance, it seems to do similar but not the same kinds of things (hardware IRQ balancing only?) Both may benefit you, as well. Experiment... Might help squeeze out more performance.

Edit: Just noticed egnnic said it's a single core, two thread. Give it a try anyway, maybe it might help in a two threaded situation?

1 Like

Thanks. I did more extensive tests here: