Comparative Throughput Testing Including NAT, SQM, WireGuard, and OpenVPN

image
24 Mbps throughput, with 167 ms ping time

(In my opinion, that kind of ping time is a good argument for running SQM when you're pushing your router's CPU hard, not just your ISP line.)


image

15 Mbps throughput with 64 ms ping time

but since ping time was over 10 ms, I found what was just under 10 ms and reported

11 Mbps throughput with 10 ms ping time


Edit: Key added above the first table in the lead post. Suggestions for additional improvements are welcome!

1 Like

@Trenton @zakporter Please open a new topic for your issues.
Thanks.

3 Likes

Nice work. It's pretty hard to find performance numbers on SQM so these benchmarks is appreciated.

Did you do anything to constrain the clock speed on Celeron J4105 and AMD GX-412TC so you're not just measuring performance at a temporary boost state?

Given that Celeron J4105 seems to max out the SQM gigabit benchmark, it's hard to tell how much headroom is left. It would be useful to know CPU core usage during the benchmark too.

I also noticed that routing seems to be entirely single-threaded with these benchmarks. Is this an inherent limitation of Linux? I did some benchmarking of my EdgeRouter X with NAT and CAKE on EdgeOS and saw that upstream and downstream combined were greater than any single direction. That suggests that routing can scale to at least 2 cores, though there might be other factors causing that.

2 Likes

Was hoping to get some CPU-load numbers for you in the Celeron J4105, but haven't quite yet.

I didn't constrain the clock speed on any of the devices shown in the lead tables. The runs load the router for 60 seconds, so I tend to believe that it isn't "burst" performance. Unless you consider 900 Mbit/s * 60 s > 6 Gbyte transferred a "burst".

It does seem that load from interface management, routing/NAT, and queue management (SQM) does not distribute itself well over multiple cores. I was a little surprised by that myself. I poked at it a tiny bit, but not enough to get a sense of what was going on. The surprising, self-limiting performance of the 1 GHz-class AMD64, the IPQ4019 and the GigE-enabled ath79 SoCs with SQM definitely caught my eye as well. That it is downstream doesn't surprise me as, the way Linux works, it seems you need to accept the packet, then queue it in an intermediate, virtual interface for bandwidth management, in contrast to upstream where you can simply queue and manage the transmit queue. There might be some "magic" possible with manual core affinity, but then again, you're messing with a single skb (packet buffer) which I would hope isn't copied around at all.

From general linux performance testing on multi-cpu systems you might want to look on per core utilization, between user/system and specifically irq/sirq (parts of system). Often times a single core is 100% utilized with IRQ handling, being the system bottleneck, while other cores are somewhat idle.

I've used mpstat and top to get it, ie this is my 1 cpu router doing ~240Mbi/sec and choking on software irqs (last column) at the peak of doing test against fast.com:

# while sleep 2; do top -n1 -b |grep "CPU"|grep -vE "grep|PPID"; done
CPU:  0.0% usr  0.0% sys  0.0% nic  100% idle  0.0% io  0.0% irq  0.0% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic 83.3% idle  0.0% io  0.0% irq 16.6% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic 91.6% idle  0.0% io  0.0% irq  8.3% sirq
CPU:  6.6% usr  0.0% sys  0.0% nic 13.3% idle  0.0% io  0.0% irq 80.0% sirq
CPU:  0.0% usr  8.3% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq 91.6% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq  100% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic 90.9% idle  0.0% io  0.0% irq  9.0% sirq
CPU:  0.0% usr  0.0% sys  0.0% nic  100% idle  0.0% io  0.0% irq  0.0% sirq

Check if your router supports top -1 to show per CPU distribution, then you might be able to pinpoint the bottleneck, ie on this 4 core x64 linux a single core does all irqs, utilizing up to 17% of a single core for software irqs (si, second to last column):

$ while sleep 2; do top -n1 -b -1 |grep "%Cpu"|grep -vE "grep|PPID";echo; done
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  5.6 us,  5.6 sy,  0.0 ni, 83.3 id,  0.0 wa,  5.6 hi,  0.0 si,  0.0 st

%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  5.9 us,  0.0 sy,  0.0 ni, 94.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  6.2 us,  6.2 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 93.3 id,  0.0 wa,  0.0 hi,  6.7 si,  0.0 st

%Cpu0  :  6.2 us, 12.5 sy,  0.0 ni, 81.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi, 17.6 si,  0.0 st

%Cpu0  :  6.7 us,  0.0 sy,  0.0 ni, 93.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 11.8 us,  5.9 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi, 17.6 si,  0.0 st

%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  5.9 us, 11.8 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  5.9 us, 11.8 sy,  0.0 ni, 82.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi, 12.5 si,  0.0 st

%Cpu0  :  0.0 us,  6.2 sy,  0.0 ni, 93.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 94.4 id,  0.0 wa,  0.0 hi,  5.6 si,  0.0 st
%Cpu2  :  5.6 us, 16.7 sy,  0.0 ni, 77.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 88.2 id,  0.0 wa,  0.0 hi, 11.8 si,  0.0 st
4 Likes

Just a related fact, traffic shaping requires timely access to the CPU, so latency is even more important than computational bandwidth, if a shaper fails to inject a packet into the underlaying layer in time, there is going to be a "(micro-)bubble" in the queue which will lead to less bandwidth efficiency and increased delay. My gut feeling is that it is this property that makes SQM/traffic-shapers quite sensitive to frequency-scaling/power-saving features, especially when it takes long to ramp the CPU back up again.

1 Like

This is great work! It would be great to have this in the wiki so people can figure out what specs they need for a given WAN bandwidth.

3 Likes

About irqbalance:
When the CPU is getting maxed out, the scheduler will try to keep irqbalance at the top of the list (because the OTHER scheduler is biased to run the processes that wait the most first). The problem is that real-time code like the interrupts itself and particularly the netfilter stack will stop irqbalance from correctly migrating the IRQ to another CPU. In a CPU at top, migrating IRQs will reduce performance because it will interfere with the CPU caching.
irabalance is a program that is better run in the real-time scheduler and it's purpose is not to improve raw performance but to allow the kernel to queue jobs more efficiently. If the CPU is not starving, a real benefit is measurable in terms of responsiveness/latency.

2 Likes

Plus, on some devices, important parts of the hardware (wifi and Ethernet subsystems, for example) use an IRQ that cannot migrate and must always be served by the same CPU. In these cases "irqbalance" cannot improve the performance.

1 Like

Test results for an inexpensive MT7628N-based device from a reputable manufacturer added.

Note that there have been significant changes on master since the earlier results, including a compiler changes.

2 Likes

I assume WireGuard is not multithreaded yet. Would be interesting to get results for mt7621. I assume those SMT cores would be quite worthless.

1 Like

I am seeing around 120 Mb/s via Wireguard on my mt7621 device. Pretty decent I'd say.

Edit: Whoops, it was 120 Mbit/s, not 120 MB/s.

3 Likes

Any affordable cheap but functional MT7621 devices out there?

I'd like to test one, but they don't really have any other use for me, so hard to justify the spend. I've got enough spare routers on the bench already.

Disclaimer: I don't own any mt7621 devices myself

(until recently, the Xiaomi Mi Router 3g v1 would have been part of the list, but that's no longer built and silently replaced by v2, which requires flashing the SPI-NOR chip externally and has gotten rid of USB ports)

1 Like

Netgear r6220 is single core device, and will not be a good representation of what mt7621 devices are capable of.

3 Likes

Ah, the dreaded difference between mt7621st and mt7621a, these seem to be equipped with mt7621a instead:

  • Netgear r6260
  • Netgear r6350
  • Netgear r6850
3 Likes

The Ubiquiti EdgeRouter X seems to be liked by some.

Has PoE but no WiFi.

1 Like

Good suggestion. It's also pretty cheap due to lacking WiFi. A perfect benchmark candidate :slight_smile:

Hi guys, Im newbie for here, some other devices cheap and near 20mb with OPENVPN?

welcome, create a new thread regarding your specific needs :santa:

1 Like