SQM/QoS can saturate the CPU/is this expected or can the code be improved?

darksky · December 12, 2019, 5:43pm

I recently observed that a 225 Mbps downstream connection is limited to approx 150 Mbps if SQM/QoS is enabled due to the saturation of the router's CPU. Link. This is running 19.07.0-rc2 on a R7800 which is one of the most powerful CPUs supported by OpenWRT, dual core 1.70 GHz CPU. You can read through the linked thread for details but in summary the recommended SQM, cake/piece_of_cake.qos saturates the one of the CPUs which causes the limited speed.

I am wondering if there can be improvements to the SQM to optimize CPU usage and to allow for faster throughput. For example, an ISP is in my city advertising 1000 Mbps downstream. If I am tuning for bufferbloat, I would think the CPU bound nature would severely limit it. Thanks!

jeff · December 12, 2019, 5:46pm

This is why you will find so many recommendations of x86_64 for lines over a few hundred Mbps.

SQM, by its nature, requires the CPU to examine and queue every packet. "Tricks" of using the switch fabric ("flow offload") don't significantly help the way that they do when only routing and NAT are involved.

The IPQ806X platform has some "known challenges" around open-source support of the "NSS" cores, that, as I understand it, are used by the proprietary QSDK to accelerate network performance. It is not at all clear if, assuming they are ever supported, they would be applicable to SQM, due to the need for active, queue management of the interfaces (which is how SQM works).

darksky · December 12, 2019, 7:36pm

I wonder if the code can be made to use multiple threads?

jeff · December 12, 2019, 7:43pm

The code for qdisc (queue disciplines) and the remainder of the TCP/IP stack is effectively that of upstream Linux. I would imagine that the performance demands of 10 Gbps, 100 Gbps, and faster on 8-, 12-, and more-core devices have already driven that exploration. One of the challenges is that there is a single queue for a given packet flow (by the definition of a queue) for a given interface, so throwing processors at it likely doesn't help. The cross-thread synchronization probably costs more than just processing the packet.

darksky · December 12, 2019, 8:00pm

So something upstream of the CPU could be the rate limiting part, in other words, perhaps it is multithreaded. Just seems odd to me that htop shows >95% on a single core.

jeff · December 12, 2019, 8:14pm

If you think about how SQM works, then the challenges with multi-threading may be clearer

When the "clock ticks", the thread needs to make a decision as to if it is time to pass the head of the queue to the interface. So, that thread needs to gain exclusive lock on

Head of the queue
Current state variables

If it is "time", then it additionally needs to gain exclusive lock on

Interface "input"

send the packet, update the state variables, remove the packet from the head of the queue.

At high rates, there can only be a single thread operating on the queue at any given time. So while you might be able to spread that load over multiple threads, your time to process a packet goes from

Evaluate if to send
Send
Update state

to

Gain two locks
Evaluate if to send
Gain lock
Send
Release one lock
Update state
Release two locks

at best. If the locks aren't gained and released at the same time (as shown above), there is the possibility of deadlock.

So while you can spread the load over multiple cores, at least as I think through it, you still can't go faster than a single thread can go. The cost of managing the locks would seem to only slow down the overall speed.

moeller0 · December 12, 2019, 8:22pm

Well, there is a bit interference coming in from the powersaving features, I believe, as on a turris omnia with similar SoC (but without any frequency and power scaling/saving features enabled) I can shape bidirectional simultaneous traffic at up to ~500/500 Mbps (gross shaper setting) sure that was without any wifi active, but with default firewall and NAT (wan IP over dhcp). So I would think it should be possible to get a tad more than 150Mbps out of your router, but the r7800 threads make it clear, that is not going to be easy.
The challenge for low latency traffic shaping is not necessarily CPU sustained throughput, but rather that the latency budget is quite tight, and if the shaper needs CPU cycles in the next X milliseconds and the SoC can not deliver these in time the shaper will waste bandwidth and throughput will suffer but not catastrophically as latency tends to stay within reasonable bounds).

moeller0 · December 12, 2019, 8:24pm

What should work though on a multi-core SOC like the r7800's (even if it is not the default) is to move the ingress and the egress shaper instances onto different CPUs. By default they will both compete for the same CPU... which does not help.

slh · December 12, 2019, 8:26pm

Assuming NSS/ NPU offloading would be supported, enabling it would make SQM impossible (same story as hardware flow-offloading for mt7621) - the whole trick of h/w acceleration is to make large parts of your packet flow bypass the main SOC and the kernel's/ netfilter's supervision, while SQM needs exactly this per-packet supervision to function.

--
Yes, the proprietary NSS firmware does implement a crude form of QoS (streamboost) itself, which can run hardware accelerated on the NSS/ NPU cores, but that's distinct from SQM and is black(box) magic.

fuller · December 13, 2019, 8:22am

as mentioned, the router workload depends mostly on the rate of packets (pps), not mbps.

packets sizes are varrying by factor of >15!
average packet size is mostly around 500byte (more if you stream/torrent, less if you use voip and gaming), so a mbps->pps factor of ~3 for the average.

hence a router should be able to process ~3x linerate (in mbps) to be able to cope with a reasonable ammount of small packets at line rate, better 5x, ideally 10x.

moeller0 · December 13, 2019, 8:35am

That seems especially relevant since the traditional speedtests basically all use maximally sized packets and hence emphasize achievable bandwidth while not saturating PPS at all.

fuller · December 13, 2019, 8:56am

by reasoning about speedtests i recon that a "normal download" (two fullsize packets plus one ack in return) alread gives a baseline average of around (2*1500+100)/3=1033 bytes per packet, so my factors are a bit of...

running a router that can just about reach linerate with downloading is ill advised.

moeller0 · December 13, 2019, 9:11am

Which is often not true anymore, with Linux hosts GRO/GSO will result in considerable lower ACK rate (but for estimating a reasonable case scenario, 1ACK per Two fullMTU packets still has merits). But note that data and ACK packets for speedtests typically happen in opposite directions (which is good in that it is easier to spread over multiple CPUs if available).

Yes, at least one needs to be conscious that in that configuration achieving line rate is pretty much the best-case scenario and will not be terribly robust against any interference.

esters · December 18, 2019, 2:00pm

I will jump into this conversation as it also affects my current router and network speed. At the moment I have a TP-Link C2600 (IPQ8064) running OpenWRT 18.06.5.
My ISP recently upgraded the bandwidth speed from 100 to 200 mbit download/upload. If I have SQM enabled with the recommended scheduler (cake/piece_of_cake) and testing it with speedtest it is limited to 100mbit. If SQM is disabled I can get around 170-180mbit/s.
And I am not sure what could be the cause. Would this mean, as suggested in the 2nd post that for QoS/SQM to work I need at least a x86-64 router because of the per-packet processing ?
At the moment I am looking at Mikrotik offerings (Such as hAP ac2), but I use WireGuard which is not available there.

EDIT: My question is - What device would be sufficient for SQM/QoS to sustain 200 mbit network speed ?

darksky · December 18, 2019, 2:03pm

Maybe ... have you tried an alternative queuing discipline and script? fq_codel and simple.qos gives me throughput >215 Mbps and good bufferbloat scores on my R7800.

esters · December 18, 2019, 2:04pm

Unfortunately I did not, I tested my network speed with SQM disabled. I will give it a try. Thanks!

dlakelan · December 18, 2019, 2:08pm

a typical x86 mini PC would do it no question, and a lot more. I think the Raspberry PI 4 would do it no problem, even if you don't use an extra USB NIC. you'd need a smart switch and to use VLANs. I also think the Linksys WRT32X or 3200 would do it fine. The espressobin would do it as well.

moeller0 · December 18, 2019, 3:54pm

That seems a bit odd, but could be cased by interference between frequency scaling/power saving and the low latency CPU demand of traffic shapers.

You could try to disable frequency scaling (assuming IPQ8064 does that in the first place) and/or you could switch to fq_codel/simple.qos on OpenWrt 19.07-RC there you can edit /usr/lib/sqm/defaults.sh:
Change [ -z "$SHAPER_BURST_DUR_US" ] && SHAPER_BURST_DUR_US=1000 to say [ -z "$SHAPER_BURST_DUR_US" ] && SHAPER_BURST_DUR_US=10000 to allow for 10 ms CPU latency, that will cause an additional 9ms increase in delay, but might get you back more bandwidth (but first try fq_codel/simple.qos without the edit).

That depends a bit on your traffic mix / packet size distributions, but an x86 or even an mvebu based ARM router will allow to do that for normal cases (for worst case saturating loads with minimal packet-sizes, x86_64 is the only affordable game in town).

huaracheguarache · May 4, 2020, 8:52am

Can you please explain how to do this? I'm trying to fully utilize the underwhelming power of my R7800.

Wizballs · May 4, 2020, 11:20am

Are you using any of the R7800 recommended (according to some other threads) CPU on-demand scaling settings in /etc/rc.local

echo 800000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
echo 800000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq
echo 35 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold
echo 10 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor