RPi4 routing performance numbers

dlakelan · January 27, 2020, 11:08pm

I got myself an RPi 4 since I've been recommending it to people, and had the following experience routing iperf3 via ipv6 in a test network:

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  50.0 MBytes   419 Mbits/sec                  
[  5]   1.00-2.00   sec  60.6 MBytes   508 Mbits/sec                  
[  5]   2.00-3.00   sec  60.4 MBytes   506 Mbits/sec                  
[  5]   3.00-4.00   sec  60.6 MBytes   509 Mbits/sec                  
[  5]   4.00-5.00   sec  61.3 MBytes   514 Mbits/sec                  
[  5]   5.00-6.00   sec  61.2 MBytes   514 Mbits/sec                  
[  5]   6.00-7.00   sec  61.2 MBytes   513 Mbits/sec

The setup was as follows:

AmazonBasics usb3 ASIX based chipset USB ethernet adapter

Jan 27 21:34:30 pitest1 kernel: [  827.375819] usb 2-1: new SuperSpeed Gen 1 USB device number 2 using xhci_hcd
Jan 27 21:34:31 pitest1 kernel: [  827.412265] usb 2-1: New USB device found, idVendor=0b95, idProduct=1790, bcdDevice= 1.00
Jan 27 21:34:31 pitest1 kernel: [  827.412273] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Jan 27 21:34:31 pitest1 kernel: [  827.412278] usb 2-1: Product: AX88179
Jan 27 21:34:31 pitest1 kernel: [  827.412283] usb 2-1: Manufacturer: ASIX Elec. Corp.

Attached to a laptop, and the Pi built in ethernet attached to a switch connected to my main network.

I configured ipv6 only, to avoid complexity, installed dnsmasq on the Pi to provide router advertisements, and did the test between a beefy server on my regular network, and my laptop. Running on the pi was raspbian, with nftables and a couple of empty default tables. So this is pure routing with no firewall and no queueing / SQM.

I installed irqbalance on the pi, which got me up to this 500+ Mbps from about 400 before.

During this run, the CPU idle was about 80 to 85%, whereas 75% would be saturating one core. The kernel softirq thread was taking something like 50-60% of one core.

No SQM/queueing was running here, so it's pure routing, with nftables but zero firewall in place (3 empty tables).

I don't know what's causing it to fail to hit the full speed. When I plugged the ASIX USB device directly into my laptop and went direct laptop->server I got 900+ Mbps, so the USB device is capable of handling near line speed.

It seems like something is slow about transferring packets across USB3 for the PI, but if that could be debugged, it could well be possible to route and SQM a full gigabit, given that about 80% of the available cycles are unused... Of course this is based on the idea we might be able to split the load between different CPUs, like if queueing on upload and download run on separate cores, and/or separate cores per network interface.

Any thoughts on what might be causing the slowdown?

dlakelan · January 28, 2020, 12:07am

Ok, to follow up, I figured out how to enable Receive Packet Steering:

and enabled it on my ethernet devices:

echo 2 > /sys/class/net/eth1/queues/rx-0/rps_cpus
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus

so that CPU1 handling received packets on eth1 (USB)
so that CPU0 handling received interrupts on eth0

The result was 65% cpu idle, and one ksoftirqd/2 using 100% of the CPU... (?) with 35.5% softirq shown in top -d 1

bandwidth is now 830Mbps or so

[  5]   5.00-6.00   sec  99.2 MBytes   832 Mbits/sec                  
[  5]   6.00-7.00   sec  99.0 MBytes   830 Mbits/sec                  
[  5]   7.00-8.00   sec  99.0 MBytes   830 Mbits/sec                  
[  5]   8.00-9.00   sec  99.0 MBytes   830 Mbits/sec

So that feels like progress...

I bumped this to 860 or so via some kind of adjustment to the interrupt affinity for the eth0, and briefly had about 920 at one point... so the possibility is there to route line-speed... At no point does the device go below about 50% idle, so there are basically 2 cores not in use even routing almost a full gigabit.

EDIT: further info.

I installed the simple nftables firewall from my other thread QoS and nftables ... some findings to share

And I put in a custom hfsc based shaper on both eth0 and eth1, that I have used before, and with those in place, the Pi will route 575 Mbps, it shows 67% idle and 34% softirq.

The big issue seems to be failure to be able to multi-thread the handling of packets. I would have thought the two ethernet devices would have split across two CPUs and then we'd be seeing a higher level of throughput, but apparently not.

Here are the /proc/interrupt outputs:

           CPU0       CPU1       CPU2       CPU3       
 17:          0          0          0          0     GICv2  29 Level     arch_timer
 18:      87012    5643314      31811      46223     GICv2  30 Level     arch_timer
 23:        339          0          0          0     GICv2 114 Level     DMA IRQ
 31:       3560          0          0          0     GICv2  65 Level     fe00b880.mailbox
 34:       6554          0          0          0     GICv2 153 Level     uart-pl011
 37:          0          0          0          0     GICv2  72 Level     dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb3
 38:          0          0          0          0     GICv2 169 Level     brcmstb_thermal
 39:      25015          0          0          0     GICv2 158 Level     mmc1, mmc0
 45:          0          0          0          0     GICv2 106 Level     v3d
 47:    4251576          0          0          0     GICv2 189 Level     eth0
 48:   20076230          0          0          0     GICv2 190 Level     eth0
 54:         51          0          0          0     GICv2  66 Level     VCHIQ doorbell
 55:          0          0          0          0     GICv2 175 Level     PCIe PME, aerdrv
 56:    8839918          0          0          0  Brcm_MSI 524288 Edge      xhci_hcd
FIQ:              usb_fiq
IPI0:          0          0          0          0  CPU wakeup interrupts
IPI1:          0          0          0          0  Timer broadcast interrupts
IPI2:       9669      13473     199712      95848  Rescheduling interrupts
IPI3:        474       2470       1309       1222  Function call interrupts
IPI4:          0          0          0          0  CPU stop interrupts
IPI5:      27560       1708       1753       2936  IRQ work interrupts
IPI6:          0          0          0          0  completion interrupts
Err:          0

I wonder if someone like @moeller0 has an idea or knows someone who has an idea about how to improve interrupt handling and speed up shaping etc.

moeller0 · January 28, 2020, 6:46am

I actually wonder whether RPS XPS the way you do it, and the way it is officially recommended is actually a good solution for dual-core SoCs (including those with SMT that present themselves as quad-cores to the OS). At least for some dual-core mvebu SoCs setting RPS to three resulted in better performance... I believe this is because with RPS probably both shaper instances will run on the non-interrupt processing CPU and since shaping is even more computationally demanding than NIC interrupt servicing this seems to be a sub-optimal solution when shaping is involved.
Since the RPi4 is a real quad-core the above probably does not apply.
But note that the settings 1 and 2 for RPS you use seem sub-optimal as well, as these will push RPS for each NIC to a single core each which for bidirectional traffic shaping might not be ideal, so have a look at OpenWrt's /etc/hotplug.d/net/20_smp_tune (this is from memory so might not be exactly correct, but you will find it) for how to set XPS for all CPUs butvthrle interrupt processing one(s).

dlakelan · January 28, 2020, 12:36pm

will take a look. it was interesting fiddling around with the rps settings, I tried 3 and c to allow each NIC to have two CPUs but as I remember that worked worse, though I'll try it again. one issue is I'm trying to route just one iperf3 stream, so I think it hashes the stream and always sends it to the same processor anyway. with several streams like a dslreports speed test you'd probably get different results. But I'm trying to route a gigabit, so getting the WAN involved is not optimal.

I tried adjusting xps and rps on both nics. the xps doesn't seem to be supported on the USB nic, but rps is... fiddling around with these didn't seem to make any major difference so long as I had them sufficiently enabled. I'm currently echoing c to all of them, so I get CPU 2 and 3 available.

the issue I think is I'm routing a single stream, which means that it all gets hashed to the same place and so there is just one softirq thread involved and it maxes out at 100%.

Good news is that doing that produces 580+ Mbps sustained routed, firewalled, and HFSC shaped! Hey that's not bad.

I do suspect that if I had multiple streams going that they would spread over multiple CPUs and I'd be able to saturate the LAN interfaces. It also might help to use a bond between the two NICs rather than one for "wan" and one for "lan".

in my opinion the results show that the RPi is a capable router when combined with a smart switch. There is probably a better USB NIC than the amazonbasics one I got. Maybe I'll try a TP-Link which uses a Realtek chipset.

If I were looking for an inexpensive router for anything less than 500Mbps I would choose the RPI 4 and a tp-link sg108e smart switch. Total cost less than $150 and even if all you have is say 10Mbps, it will grow to almost anything you can get today... probably gigabit included in aggregate, though single stream does seem to be limited to around 500 reliably with shaping, maybe 800 without shaping.

dlakelan · January 28, 2020, 5:13pm

Interestingly, I went to my main router and ran a speed test while watching netdata. It uses way more CPU during the download phase of a speedtest than it does during the upload phase. This appears to be due to a dramatically higher interrupt rate. (softirq rates about 115k/s during download phase, and about 47k/s during upload phase)

I don't really understand this, because every packet my PC sends has to be received by the router, and then sent by the router... and every packet my PC receives has to be received by the router and sent by the router... so it's not like it's receiving fewer packets during upload than its receiving during download or anything like that. And it's not due to the shaper because it's actually hardware interrupts that are higher during the download phase (and disabling the shaper had little effect)...

What would make the download phase of a speedtest produce dramatically more interrupts? (tagging some people who might have some knowledge here @moeller0 who seems to be the latency expert and @jeff who did a bunch of benchmarking recently, feel free to tag a few others if you can think of someone who knows the nitty gritty of this stuff) If I could figure this out I might be able to tune things both on my main router and the PI to keep the interrupt rate down and maybe get a bunch more performance out of either of them.

Network topology for main router:

ISP Device -> Smart Switch <-bonded interface w vlans-> Router 
                   |
           More switches (but how do you *know* she is a switch?) ---> PC

EDIT: Further data....

Adding a second iperf3 instance on my main router, so I can send from my laptop to two separate servers, setting the HFSC scheduler to rate limit at 940Mbps, the overall rate goes to 433*2 = 866Mbps with shaping and the CPU usage drops to about 55% idle. Strangely the ksoftirqd is using about 45% CPU under this load whereas it's using about 100% CPU when trying to route a single very fast stream (which it does around 720 Mbps). This seems like something where with two separate streams it calms down due to spreading the load among different CPUs, or chunking the scheduling of the qdiscs into fewer interrupts or something. In any case. I think it's safe to say that under realistic routing loads, the PI can handle 500Mbps with shaping, without a problem, and with tuning (possibly default tuning under OpenWrt) could handle 700+Mbps single stream, and 850Mbps at least aggregated across multiple streams.

All this while using about 50% of its processor power... So you could run a squid proxy, or a NAS based on a USB3 spinning disk, or some kind of network monitoring with the remaining CPU.

dlakelan · January 30, 2020, 4:07pm

More numbers: After setting my network on fire and burning it to the ground and rebuilding it yesterday (it wasn't a good day). I am back with a new TP-Link USB 3 adapter (the UE300, $10 on amazon) for the RPi, and some new numbers:

Iperf from my laptop connected via the TP-Link USB Eth through the RPi to my desktop...

[  5]  39.00-40.00  sec   110 MBytes   922 Mbits/sec                  
[  5]  40.00-41.00  sec   110 MBytes   922 Mbits/sec                  
[  5]  41.00-42.00  sec   110 MBytes   923 Mbits/sec                  
[  5]  42.00-43.00  sec   110 MBytes   919 Mbits/sec                  
[  5]  43.00-44.00  sec   110 MBytes   923 Mbits/sec                  
[  5]  44.00-45.00  sec   110 MBytes   924 Mbits/sec                  
[  5]  45.00-46.00  sec   110 MBytes   923 Mbits/sec                  
[  5]  46.00-47.00  sec   110 MBytes   919 Mbits/sec                  
[  5]  47.00-48.00  sec   110 MBytes   921 Mbits/sec                  
[  5]  48.00-49.00  sec   102 MBytes   859 Mbits/sec

This is with HFSC shaping !! and I am watching it do the HFSC shaping using

watch tc -s qdisc show dev eth0

And it really is shaping a gigabit...

so how much CPU is that going to require?

top - 08:07:05 up 2 days, 13:15,  1 user,  load average: 0.10, 0.08, 0.02
Tasks: 116 total,   1 running, 115 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 97.5 id,  0.0 wa,  0.0 hi,  2.3 si,  0.0 st
MiB Mem :   3906.0 total,   3259.2 free,    120.6 used,    526.2 buff/cache
MiB Swap:    100.0 total,    100.0 free,      0.0 used.   3618.5 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                               
    9 root      20   0       0      0      0 S   1.2   0.0   2:24.87 ksoftirqd/0                           
  334 root      20   0   27656     80      0 S   0.6   0.0   0:28.83 rngd

I am rather flabbergasted...
@moeller0 how is this possible?

EDIT: I set the shaper to 900Mbps and got this:

[  5]  16.00-17.00  sec   106 MBytes   889 Mbits/sec                  
[  5]  17.00-18.00  sec   106 MBytes   890 Mbits/sec                  
[  5]  18.00-19.00  sec   106 MBytes   889 Mbits/sec                  
[  5]  19.00-20.00  sec   106 MBytes   889 Mbits/sec                  
[  5]  20.00-21.00  sec   106 MBytes   889 Mbits/sec                  
[  5]  21.00-22.00  sec   106 MBytes   889 Mbits/sec                  
[  5]  22.00-23.00  sec   106 MBytes   889 Mbits/sec                  
[  5]  23.00-24.00  sec   106 MBytes   889 Mbits/sec                  
[  5]  24.00-25.00  sec   106 MBytes   889 Mbits/sec                  
[  5]  25.00-26.00  sec   106 MBytes   889 Mbits/sec

rock solid speed, same CPU usage.

moeller0 · January 30, 2020, 4:10pm

Sad to hear that, hope the issue is solved and today is a better day!

Puzzled, but press "1", please, in top to get per-CPU output (that should work for real top, will not work for busybox top like in OpenWrt)?

Question, how did you configure the shaper and are you using the stab option to account for the >= 38 bytes of overhead on ethernet (42 bytes if a VLAN is used)?

Entropy512 · January 30, 2020, 4:13pm

My guess - not all USB to Ethernet adapters are created equal, and not all drivers for them are either.

It appears that the Realtek chipset is either much better designed than the ASIX one, it has better drivers, or both.

dlakelan · January 30, 2020, 4:17pm

Here you go with the per cpu:

top - 08:14:28 up 2 days, 13:22,  1 user,  load average: 0.00, 0.02, 0.00
Tasks: 115 total,   1 running, 114 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 91.9 id,  0.0 wa,  0.0 hi,  8.1 si,  0.0 st
%Cpu1  :  0.0 us,  0.4 sy,  0.0 ni, 95.6 id,  0.0 wa,  0.0 hi,  4.1 si,  0.0 st
%Cpu2  :  0.0 us,  0.2 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem :   3906.0 total,   3258.0 free,    121.4 used,    526.6 buff/cache
MiB Swap:    100.0 total,    100.0 free,      0.0 used.   3617.9 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                               
  334 root      20   0   27656     80      0 S   0.4   0.0   0:29.86 rngd                                  
15681 pi        20   0   10312   2936   2432 R   0.4   0.1   0:00.07 top                                   
    9 root      20   0       0      0      0 S   0.2   0.0   2:27.82 ksoftirqd/0

@Entropy512
Yes, no doubt the TP-Link USB is way better, so that's one part of it... but I'd have expected routing and shaping a gigabit to require more CPU than about 10% of one core and 5% of another, or whatever. (especially considering that my x86 j1900 uses about 90% of each of 3 cores to do a similar thing)

EDIT: I ran the same iperf3 test and simultaneously did a ping flood (every 200ms) and got latency of between 1ms and 2ms consistently.

Edit2: Also @moeller0, I am using stab option to account for 35 bytes of overhead, not quite right, but basically doesn't make any difference unless I start flooding the net with a gigabit of packets smaller than ~200-300 bytes.

Edit3: The Pi pinging my desktop takes about 0.250 ms, whereas the laptop on the other side of the Pi which is connected to the TP-Link takes about 0.950 ms so it looks like maybe the USB device adds on the order of about 0.5 ms in latency. I'm guessing maybe there are decent buffers, as well as some kind of DMA, and it can do an order of magnitude less interrupts than the ASIX based version, which is fabulous because half a millisecond is a totally worth it extra delay in order to be able to reach these kinds of numbers for such a low price.

I am now SERIOUSLY considering replacing my entire j1900 box with two RPis.... One as router and shaper and proxy, the other as fileserver based on a USB3 RAID enclosure. Power savings and soforth alone are probably significant, not to mention it'll last a lot longer on the UPS when the power goes out.

Gingernut · January 30, 2020, 4:47pm

Sorry to ask as I don't know if you metioned it but are you running these tests with Raspbian or Openwrt loaded on the RPI4?

Thks

dlakelan · January 30, 2020, 4:58pm

Hi, yes I did mention it in passing but happy to re-mention for clarity.

I am using Raspbian for these tests. I downloaded the minimal server version of Raspbian, installed some basic packages, and I'm running an nftables firewall on the Pi during the tests.

I would not expect major differences with OpenWrt, in fact it might be a little better tuned out of the box.

Mushoz · January 30, 2020, 6:54pm

Pity that the builtin ethernet is so much worse. Do you think there is any room for improvement there? Having an external usb ethernet port is less than ideal.

dlakelan · January 30, 2020, 6:56pm

Sorry, if I wasn't clear. The built-in ethernet is fine. It's the cheap Amazonbasics USB external that I was using in the previous tests that's garbage. I replaced that USB external with a better USB external, and now

If you want to do gigabit, you need two adapters. If you're ok with less than 500Mbps you can use one adapter and VLANs with a smart switch.

moeller0 · January 30, 2020, 6:58pm

But at $10 for the decent USB3 gigabit ethernet adapter, please, please think twice before opting for the VLAN construct

dlakelan · January 30, 2020, 7:01pm

Sure, but if you're routing a 100 or 200 MBps VDSL or DOCSIS connection, the VLAN construct is fine and avoids a dongle hanging off the shelf and pulling the router onto the floor. Another option is stick with the USB and add some velcro tape

Mushoz · January 30, 2020, 7:01pm

It should do 1 Gbit/s full duplex, so it will be fine for up to 1 Gbit/s connections.

Meh, if you don't need it why bother? I have a 500/500 mbit connection at home, so a VLAN construction won't bottleneck it at all.

moeller0 · January 30, 2020, 7:01pm

+1; or, I dare say it, duct tape!

dlakelan · January 30, 2020, 7:04pm

If one person uploads a file to google drive, while another person downloads a file from google drive... each person will probably get say 400Mbps. With a second ethernet they could each get 1Gbps.

(this is however a rather theoretical situation... in practice with 500/500 and regular home usage you are unlikely to have a big problem)

moeller0 · January 30, 2020, 7:05pm

Future proofing, basically, and ease of mind, one bespoke piece of configuration less to deal with. But hey, I am old school and always prefer separate wires over VLANs (partly because the first managed switches I worked with and I never got friends and I will bear a grudge for a long time ;))
Also you can save the managed switch which probably comes in slightly above the ~15EUR for the UE300.

@dlakelan wat does linux report about the innards of that tp-link? (They have a combo Gb ethernet with USB3 hub for 16EUR which I am tempted to try, escpeciasly if it uses the same ethernet chip).

Mushoz · January 30, 2020, 7:09pm

That would only be an issue if you have a 1/1 Gbit connection. I have two things that make it a non-issue, even in the worst load possible:

"Just" 500/500 mbit/s. This is low enough for it to never be an issue.
Connection doesn't seem to be full duplex, so that cuts the required bandwidth in half as well. Ie, upload + download cannot be higher than 500 mbit. I blame my ISP