Terrible results using RPi4 + SQM to control buffer bloat... anyone else with this setup comment?

My ISP provides 1 gigabit downstream. Using a RPi4 to run SQM, I am finding that with either cake/piece_of_cake or fq_codel/simplest_tbf or fq_codel/simplest (750000 down and 22000 up), the best downstream I can achieve is around 565 Mbps. Watching htop on the RPi4, I see core0 is consistently loaded >99%. I read about the RPi4 being able to shape gigalan symmetrical connections so am wondering why mine is falling short.

Can someone with a fast connection and a RPi4 confirm the claim?

I am using DSA to run two VLANs on this RPi4, so perhaps that is limiting me assuming others with the hardware/connection and confirm better downstream speeds.

EDIT: For this entire thread with the exception of the final post, I believe I had the SQM target on a bridged interface rather than on the physical one. Putting it on the physical one gave the lowest CPU usage so keep that in mind when reading everything else in the thread.

I had irqbalance installed and active when I tested. Also had packet steering enabled. I can run a test again disabling it and see if it makes a difference. Key is SQM is single threaded though.

Are you using a USB Ethernet? If so which one, what driver?

1 Like

Hi, I can confirm it works. Connecting my computer directly to the RPi4 I can max out my gigabit connection and my upload bandwidth. My ISP's provided connection is 1000/50.

1 Like

SQM does not work well with connections with such a big difference in speeds.

Actually, it does. No probs here.

I get slightly slower packet steering disabled: CPU saturation on one core and throughput around 470-500 Mbps.

It seems that ksoftirqd/0 is the process saturating during the download phase.

Yes, it uses kmod-usb-net-asix-ax88179. I did a test with iperf3 (no SQM) and found that this adapter is as fast as the internal NIC on the RPi4. I do not think it is to blame.

The test for bufferbloat does download first, then upload. The key I am seeing is single core saturation which I believe is limiting the speed achieved.

Can you share your SQM settings with me? How about your network setup? Are you using VLANs? Thanks!

Well, look at the load on the CPU during such a test... Regarding htop:
"Pressing F1 in htop, shows the color legend for the CPU bars, and F2 Setup → Display options → Detailed CPU time (System/IO-Wait/Hard-IRQ/Soft-IRQ/Steal/Guest), allows to enable the display of the important (for network loads) Soft-IRQ category."

I would not be amazed if the USB dongle by itself causes a noticeable CPU load, and addong SQM then drives the COU into overload...

I could switch it, currently eth0 (onboard NIC) is for the LAN and eth1 (USB) is for the incoming WAN connection.

Running as-is without SQM enabled, gives me 908-940 Mbps downstream (with CPU saturation) albeit with some significant bufferbloat.

Can you share the output of cat /proc/interrupts? Only a single CPU core is used, which is why you see the slow speed.

I believe SQM is only single threaded. Here:

# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
 11:     398640     570957     358560     607649     GICv2  30 Level     arch_timer
 18:       7440          0          0          0     GICv2  65 Level     fe00b880.mailbox
 21:          2          0          0          0     GICv2 153 Level     uart-pl011
 22:          0          0          0          0     GICv2 112 Level     bcm2708_fb DMA
 24:         45          0          0          0     GICv2 114 Level     DMA IRQ
 31:          1          0          0          0     GICv2  66 Level     VCHIQ doorbell
 32:       9884          0          0          0     GICv2 158 Level     mmc1, mmc0
 38:        558    2936572          0          0     GICv2 189 Level     eth0
 39:        271          0          0    3737227     GICv2 190 Level     eth0
 45:          0          0          0          0     GICv2 175 Level     PCIe PME, aerdrv
 46:    3296892          0          0          0  BRCM STB PCIe MSI 524288 Edge      xhci_hcd
IPI0:      1852       1886       2155       1952       Rescheduling interrupts
IPI1:    131283     227722     449323     501978       Function call interrupts
IPI2:         0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0          0          0       Timer broadcast interrupts
IPI5:      9299       4887       2029       2983       IRQ work interrupts
IPI6:         0          0          0          0       CPU wake-up interrupts
Err:          0

Maybe, but I can equally load both cpu cores in my wrt3200acm under load with SQM on a fiber link. What version of OpenWRT are you running?

Snapshot from 2 days ago.

Ok, did you try using eth0 for WAN? Alternatively you can move SQM to LAN and swap download/upload limits unless the line is symmetrical.

Yes, I just finished swapping. It's about 8% faster but still about 1/2 what it should be.

Original setup:
WAN = eth1 (USB)
LAN = eth0 (internal)
Average download with SQM = 568 (580,561,562 ran 3x)
Average download without SQM = 907 (908,906 ran 2x)

Reversed setup:
WAN = eth0 (internal)
LAN = eth1 (USB)
Average download with SQM = 615 (622,617,605 ran 3x)
Average download without SQM = 776 (797,754 ran 2x)

It might be due to the fact that I am using VLANs via DSA on this setup. I have to in order to maintain my two networks on the dumb AP (main and guest).

What is the CPU usage in both cases?

Someone over at the launch RPI4 thread had a similar issue with an asix based dongle. Up next they tried a rtl8153 based dongle, and they could do gigabit speeds in both directions with CPU cycles to spare.

The TPLink UE300 is rtl8153 based (uses kmod-usb-net-rtl8152), see if you can get that one.

2 Likes

Same... one core maxes out >99%.

I will look for one, thanks for the suggestion. Can you link the thread you referenced?

Sure.

You'll have to search around a bit. I'd linked to something related in the OP there, you can start there.

See also my RPi4 performance thread, this is exactly the issue. The ASIX driver seems to do a LOT more interrupts than the Realtek, which probably aggregates them.

RPi4 routing performance numbers

3 Likes