Comparative Throughput Testing Including NAT, SQM, WireGuard, and OpenVPN

That should be right. The MT7621A is a dual-core MIPS 1004Kc, each with SMT(2).

isn't it weird that ath79 and ipq40xx perform almost the same (SQM) while one is single core mips and the other quad core arm? are the 3 other cores used at all?

anyway, for what it's worth, on my mt7621A dir-860l B1 NAT is about 800Mbps and SQM is between 200 and 300 MBps (cake, piece of cake)

1 Like

OpenVPN is single threaded. IPQ40xx could probably do three OpenVPN sessions with same speed though.

hmm yea, i was talking about nat+sqm tho...

I think NAT+SQM is also single threaded?

There's a cool new cake patch someone needs to slam into openwrt, here:

https://lists.bufferbloat.net/pipermail/cake/2020-May/005257.html

There was also a patch to wireguard that went by earlier, which has hopefully also made it in....

To answer the sqm question, yes, dang it, sqm is essentially single threaded. I'd love some help towards making a multicore shaper. Fq_codel_fast helpers, test and testers?

2 Likes

I am curious, what patch are you talking about here? :slight_smile:

I don't know if this has made openwrt yet.

[PATCH 5.6 029/177] wireguard: queueing: preserve flow hash across packet scrubbing

[ Upstream commit c78a0b4a78839d572d8a80f6a62221c0d7843135 ]

It's important that we clear most header fields during encapsulation and
decapsulation, because the packet is substantially changed, and we don't
want any info leak or logic bug due to an accidental correlation. But,
for encapsulation, it's wrong to clear skb->hash, since it's used by
fq_codel and flow dissection in general. Without it, classification does
not proceed as usual. This change might make it easier to estimate the
number of innerflows by examining clustering of out of order packets,
but this shouldn't open up anything that can't already be inferred
otherwise (e.g. syn packet size inference), and fq_codel can be disabled
anyway.

Furthermore, it might be the case that the hash isn't used or queried at
all until after wireguard transmits the encrypted UDP packet, which
means skb->hash might still be zero at this point, and thus no hash
taken over the inner packet data. In order to address this situation, we
force a calculation of skb->hash before encrypting packet data.

Of course this means that fq_codel might transmit packets slightly more
out of order than usual. Toke did some testing on beefy machines with
high quantities of parallel flows and found that increasing the
reply-attack counter to 8192 takes care of the most pathological cases
pretty well.

3 Likes

Rather happy with this. Squint at the bottom for the "after".

6 Likes

Holy shit that is a massive improvement for bufferbloat on wireguard! Is this already included in the kernel version for mainline linux? And when will this hit the Openwrt version? I am assuming both ends of the link require this patch for the full benefits?

Thought it was there

2 Likes

By the way, is this improvement by default, or do you need to set up SQM on the wireguard interface to reap these benefits?

no, sqm on the main egress interface (or line rate on the wifi) - or even line rate on ethernet if that's your bottleneck - "just works".

Note there's a cake patch also, I don't know if that's in openwrt yet, either.

really do want some benchmarking of the differences here in the real world. I figure that MOST of the time wireguard is rate limited by the egress interface, not by crypto, in the openwrt world, but I'd like to know more.... if there is anyone here that can do a before/after with stuff like the rrul test, or for example, voip over wireguard while under other loads - it would be nice.

If it's limited more by crypto, well, I'd long planned to stick something called "crypto queue limits", but
that fix was WAY more invasive and we've not got around to it.

1 Like

well, that would be best yes. Better levels of FQ-anything on either side, though, tend to drive both sides towards better multiplexing in general.

@jeff you still setup to test some stuff? Cool wireguard patch....

Another commit of interest on this front.

/me kisses @ldir on both cheeks and goes to compile openwrt from head.

Here are some measurements on a Netgear WNDR3700v2 running openwrt 19.07.2 (mostly default settings, no performance related tweaks)
It has an Atheros AR7161 CPU (ath79 family).

Command line used:

flent tcp_8down -H server.name
flent tcp_8up -H server.name
flent rrul -H server.name

Routing/NAT (plain IP with DHCP and PPPoE)

No Flow Offload

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL
ath79 680 1 AR7161 DHCP/NAT 239
(10)
383
(7)
285
(13)
ath79 680 1 AR7161 PPPoE/NAT 347
(8)
344
(8)
372
(11)

With Flow Offload

Target Clock Cores SoC / CPU Notes 8 Dn 8 Up RRUL
ath79 680 1 AR7161 DHCP/NAT 684
(4)
883
(3)
711
(5)
ath79 680 1 AR7161 PPPoE/NAT 583
(4)
683
(5)
602
(5)

Note: no SQM tested

Raw data here: https://pastebin.ubuntu.com/p/kbSfF8X4mW/

1 Like

Mmh, ist seems odd, that you seem to get more throughput with costly PPPoE than with DHCP which should only cost you a bit computations once every timeout, while pppoe costs at every single packet. Any idea why?

No, but even directly PC-to-PC (so no router in between) sometimes does not go at full speed.

For example when I run the PPPoE server in one VM and the client in another (on same hardware), I get very low speed (400 mbps, almost slower that my little router). Also running the server in VM or bare metal gives different results (for example: one is fast for the tcp_8down test, but also has bigger ping, while the other has noticeable lower ping, but also lower transfer rate).

The PC (server) is an Intel i5-2320 (4 core, 3GHz), the other PC (client, laptop) has i3-5005U ( 2GHz,2 core/4 threads).

Weird, but I guess those numbers are the minimum. More might be possible in other circumstances.

PS: I used Ubuntu 20.04 x64 as PPPoE server (and client PC)