Terrible results using RPi4 + SQM to control buffer bloat... anyone else with this setup comment?

darksky · August 30, 2021, 12:46pm

Yes, I just finished swapping. It's about 8% faster but still about 1/2 what it should be.

Original setup:
WAN = eth1 (USB)
LAN = eth0 (internal)
Average download with SQM = 568 (580,561,562 ran 3x)
Average download without SQM = 907 (908,906 ran 2x)

Reversed setup:
WAN = eth0 (internal)
LAN = eth1 (USB)
Average download with SQM = 615 (622,617,605 ran 3x)
Average download without SQM = 776 (797,754 ran 2x)

It might be due to the fact that I am using VLANs via DSA on this setup. I have to in order to maintain my two networks on the dumb AP (main and guest).

fantom-x · August 30, 2021, 1:03pm

What is the CPU usage in both cases?

jrambo99 · August 30, 2021, 1:05pm

Someone over at the launch RPI4 thread had a similar issue with an asix based dongle. Up next they tried a rtl8153 based dongle, and they could do gigabit speeds in both directions with CPU cycles to spare.

The TPLink UE300 is rtl8153 based (uses kmod-usb-net-rtl8152), see if you can get that one.

darksky · August 30, 2021, 1:20pm

Same... one core maxes out >99%.

I will look for one, thanks for the suggestion. Can you link the thread you referenced?

jrambo99 · August 30, 2021, 1:21pm

Sure.

You'll have to search around a bit. I'd linked to something related in the OP there, you can start there.

dlakelan · August 30, 2021, 2:01pm

See also my RPi4 performance thread, this is exactly the issue. The ASIX driver seems to do a LOT more interrupts than the Realtek, which probably aggregates them.

RPi4 routing performance numbers

amteza · August 30, 2021, 7:21pm

Sure thing, disclaimer, I'm using @ldir SQM's Netify experimental branch. All my devices are connected by WiFi and DSCP marking aligns the best possible with WMM. divserv4 is utilised in my script.

My eth1 is a U300 USB NIC connected to my WAN (1000/50 Mbps). eth0 is internal NIC connected to my NanoHD.

Steering and irqbalance are active, no VLAN are defined and RPi4 is overclocked to 2 Ghz.

config queue 'eth1'
	option debug_logging '0'
	option verbosity '5'
	option qdisc 'cake'
	option linklayer 'ethernet'
	option overhead '44'
	option qdisc_advanced '1'
	option qdisc_really_really_advanced '1'
	option squash_dscp '0'
	option squash_ingress '0'
	option ingress_ecn 'ECN'
	option egress_ecn 'ECN'
	option interface 'eth0'
	option upload '46000'
	option enabled '1'
	option download '950000'
	option iqdisc_opts 'nat ingress'
	option eqdisc_opts 'nat ack-filter docsis'
	option script 'ctinfo_4layercake.qos'

darksky · August 30, 2021, 7:30pm

A bit that's over my head in your reply. Not sure what the Netify branch is (github link?) nor sure about DSCP or divserv4.

amteza · August 30, 2021, 7:48pm

FYI, https://github.com/ldir-EDB0/sqm-scripts/tree/sqmqosnfa

What I meant is that I'm not using simply piece_of_cake, that's why I included the "disclaimer" for. I apologise for making it more confusing, I'll go back to my lockdown cave now.

darksky · August 30, 2021, 8:16pm

Ah thanks. Are those within the sqm-scripts-extra package or something you add manually? What does it offer vs cake/piece_of_cake to the average user wanting to control upstream bufferbloat? You mentioned something about alignment with WMM and some other stuff.

moeller0 · August 30, 2021, 8:44pm

Nope sqm-scripts-extra is essentially abandoned, sorry. Not enough stuff to put there.

amteza · August 30, 2021, 9:52pm

I didn't know about sqm-scripts-extra. This offers you flexibility to tag packets to specific priority buckets: Background, Best Effort, Video and Voice. In my household with home schooling and working from home due to constant COVID lockdowns this is very relevant. Is not easy to manage people gaming, streaming and video conferencing in parallel. This work from @ldir makes it easy. Netify helps identifying the different types of streams on the fly, works well enough, not perfect, tho'.

When I referred to WMM, I meant that I'm tagging packets aligning DSCP with WMM, so this works as best as possible over WiFi too.

dlakelan · August 30, 2021, 10:35pm

The packet tagging and WMM and such are all very interesting etc, but will not be factors related to how much CPU usage there is really. The issue is the use of an ASIX dongle because those drivers seem to cause a lot more interrupt handling. A simple fix is to replace with the UE300 from TP-Link for ~$15 it cuts interrupt overhead by a factor of 10 or more.

darksky · September 5, 2021, 10:58am

OK, I finally got a UE300 (rtl8150) and ran the waveform bufferbloat speed tests comparing it to the Trendnet (asix-ax88179).

The other difference is I am now running the RPi4 @ 2,000 MHz:

/boot/config.txt

# cat /boot/config.txt 
################################################################################
# Bootloader configuration - config.txt
################################################################################

################################################################################
# For overclocking and various other settings, see:
# https://www.raspberrypi.org/documentation/configuration/config-txt/README.md
################################################################################

# OpenWrt config
include distroconfig.txt

[all]
# Place your custom settings here.
over_voltage=6
arm_freq=2000

Conclusion: UE300 achieves approx 25% higher speeds with the more or less the CPU load. Bufferbloat is controlled in each case.

Trendnet

CPU use: max 99+% on single core
Latency: 19 ms unloaded, +0 ms download active, +0 ms upload active
Speed: 670.5 Mbps down (average of two runs 672 and 669)

TP-Link UE300

CPU use: max 98% on single core (did not saturate)
Latency: 17 ms unloaded, +2 ms download active, +0 ms upload active
Speed: 839 Mbps down (average of two runs 826 and 852)

Both tests used fq_codel/simplest_tbf with 850000 (download speed) and 24000 (upload speed). Packet steering was enabled for both.

As a control, no SQM (using Trendnet):

CPU use: 40% max on single core
Latency: 32 ms unloaded, +39 ms download active, +0 ms upload active
Speed: 909 Mbps down and 24.4 up

dlakelan · September 5, 2021, 2:49pm

Are you running irqbalance? Because that may help a lot. I can not only shape a gigabit, but I can do it over a squid proxy at the native clock freq. So there should be a lot of headroom on this device.

darksky · September 5, 2021, 2:57pm

Yes, it is running. The issue is the SQM is single threaded. See this pic in post #7 of this thread.

dlakelan · September 5, 2021, 3:10pm

It's strange because I managed to shape a gigabit with something like 10-15% of one core in the test thread. Now I was using hfsc but I don't expect cake to be 10x as cpu intense

jrambo99 · September 5, 2021, 5:56pm

@darksky Are you using snapshots?

There seems to be a regression in SQM performance lately:

I guess this will get fixed at some point, but in the meantime you can try an older version of sqm-scripts as cited, that should get you results more in line with @dlakelan's

darksky · September 5, 2021, 7:05pm

Yes, built against HEAD. I guess I could downgrade to the commit from Dec of 2020.

moeller0 · September 5, 2021, 7:50pm

If you do, please try to incrementally add the 3 commits between Dec 2020 and Feb 28 2021 to see which of these innocent looking changes is responsible for the observed CPU load.