Comparative Throughput Testing Including NAT, SQM, WireGuard, and OpenVPN

hmm yea, i was talking about nat+sqm tho...

I think NAT+SQM is also single threaded?

There's a cool new cake patch someone needs to slam into openwrt, here:

There was also a patch to wireguard that went by earlier, which has hopefully also made it in....

To answer the sqm question, yes, dang it, sqm is essentially single threaded. I'd love some help towards making a multicore shaper. Fq_codel_fast helpers, test and testers?


I am curious, what patch are you talking about here? :slight_smile:

I don't know if this has made openwrt yet.

[PATCH 5.6 029/177] wireguard: queueing: preserve flow hash across packet scrubbing

[ Upstream commit c78a0b4a78839d572d8a80f6a62221c0d7843135 ]

It's important that we clear most header fields during encapsulation and
decapsulation, because the packet is substantially changed, and we don't
want any info leak or logic bug due to an accidental correlation. But,
for encapsulation, it's wrong to clear skb->hash, since it's used by
fq_codel and flow dissection in general. Without it, classification does
not proceed as usual. This change might make it easier to estimate the
number of innerflows by examining clustering of out of order packets,
but this shouldn't open up anything that can't already be inferred
otherwise (e.g. syn packet size inference), and fq_codel can be disabled

Furthermore, it might be the case that the hash isn't used or queried at
all until after wireguard transmits the encrypted UDP packet, which
means skb->hash might still be zero at this point, and thus no hash
taken over the inner packet data. In order to address this situation, we
force a calculation of skb->hash before encrypting packet data.

Of course this means that fq_codel might transmit packets slightly more
out of order than usual. Toke did some testing on beefy machines with
high quantities of parallel flows and found that increasing the
reply-attack counter to 8192 takes care of the most pathological cases
pretty well.


Rather happy with this. Squint at the bottom for the "after".


Holy shit that is a massive improvement for bufferbloat on wireguard! Is this already included in the kernel version for mainline linux? And when will this hit the Openwrt version? I am assuming both ends of the link require this patch for the full benefits?

Thought it was there

1 Like

By the way, is this improvement by default, or do you need to set up SQM on the wireguard interface to reap these benefits?

no, sqm on the main egress interface (or line rate on the wifi) - or even line rate on ethernet if that's your bottleneck - "just works".

Note there's a cake patch also, I don't know if that's in openwrt yet, either.

really do want some benchmarking of the differences here in the real world. I figure that MOST of the time wireguard is rate limited by the egress interface, not by crypto, in the openwrt world, but I'd like to know more.... if there is anyone here that can do a before/after with stuff like the rrul test, or for example, voip over wireguard while under other loads - it would be nice.

If it's limited more by crypto, well, I'd long planned to stick something called "crypto queue limits", but
that fix was WAY more invasive and we've not got around to it.

well, that would be best yes. Better levels of FQ-anything on either side, though, tend to drive both sides towards better multiplexing in general.

@jeff you still setup to test some stuff? Cool wireguard patch....

Another commit of interest on this front.

/me kisses @ldir on both cheeks and goes to compile openwrt from head.

Here are some measurements on a Netgear WNDR3700v2 running openwrt 19.07.2 (mostly default settings, no performance related tweaks)
It has an Atheros AR7161 CPU (ath79 family).

Command line used:

flent tcp_8down -H
flent tcp_8up -H
flent rrul -H

Routing/NAT (plain IP with DHCP and PPPoE)

No Flow Offload

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL
ath79 680 1 AR7161 DHCP/NAT 239
ath79 680 1 AR7161 PPPoE/NAT 347

With Flow Offload

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL
ath79 680 1 AR7161 DHCP/NAT 684
ath79 680 1 AR7161 PPPoE/NAT 583

Note: no SQM tested

Raw data here:

1 Like

Mmh, ist seems odd, that you seem to get more throughput with costly PPPoE than with DHCP which should only cost you a bit computations once every timeout, while pppoe costs at every single packet. Any idea why?

No, but even directly PC-to-PC (so no router in between) sometimes does not go at full speed.

For example when I run the PPPoE server in one VM and the client in another (on same hardware), I get very low speed (400 mbps, almost slower that my little router). Also running the server in VM or bare metal gives different results (for example: one is fast for the tcp_8down test, but also has bigger ping, while the other has noticeable lower ping, but also lower transfer rate).

The PC (server) is an Intel i5-2320 (4 core, 3GHz), the other PC (client, laptop) has i3-5005U ( 2GHz,2 core/4 threads).

Weird, but I guess those numbers are the minimum. More might be possible in other circumstances.

PS: I used Ubuntu 20.04 x64 as PPPoE server (and client PC)


I'm interested in the RK3328 SOC, any ideas regarding the specs it would get?

Did anybody else see a Wireguard performance loss on ipq40xx (Zyxel NBG6617/GL-inet B1300) between 19.07.3 to 19.07.4?
I discovered this, when I updated from 19.07.3 to 19.07.5 that my Wireguard throughput dropped from about 350-400 to 250-280 when I downloaded from the Internet.

To test this, I did some synthetic benchmarks without Internet in a Gigabit network environment, just to exclude my ISP from this calculation, which resulted in the same numbers...

The current master branch is also affected.

"Performance" as cpufreq scheduler, irqbalance was running and I had software offload activated.

Anybody else with the same experience?

What i find interesting is that a dual core 1ghz soc could almost beat a x86_64 4 core 1.5-2.5ghz cpu.
Plus in some tests it seems like the dual core 1gh soc almost outperforms a x86_64 cpu.
Something tells me that something is wrong? Maybe it requires some special changes/tweaks for best results on x86_64 platform? Have you tested with cpu policies set to performance or tweaked on-demand settings to ramp up to highest clock possible?

I sadly only have a IPQ8065 router and no x86_64 and i don't think using a VM is reliable since it would use the same network card/interface which would provide inaccurate results if you would test it from vm to vm/main pc..