General Discussion of SQM

mindwolf · February 3, 2019, 12:55am

it's difficult to approximate my wan ingress max rate goodput as there is a shaper on the line (which is could be a good thing given what shaper is actually in use). My best guess from my modem <> dslam is that the synce rate for ingress 60022kbits, with some form of overhead averaging about 1.2%, which leaves an average of 59300kbits. This is where the shaper comes in, leaving about 55500kbits for throughput. The egress side show the same 1.2% but is not shaped in the modem, which is in ip passthrough mode. it's throughput is about 12800kbits, leaving 12064 on the table.

shm0 · September 15, 2019, 4:42am

Is it me or does cake use more cpu power?
I remember that cake was able to shape up to ~ 400 Mbit/s on my device.
CPU usage around ~80% on core and ~30% on the other one.
Now it is 100% and 50% and speed caps out at ~310 Mbit/s.
HTB + fq_codel is a bit better.
CPU usage ~90% and ~40% caps out at ~340 MBit/s
(Configured at 360 Mbit/s - Overhead = ~340 MBit/s, seems good)

fuller · September 15, 2019, 10:29am

it's not you

there was effort by @dtaht to make it more lean
but for the meantime tbf+fq_codel seems to perform more efficient.

eginnc · September 15, 2019, 12:05pm

OpenWrt piece of CAKE and EdgeOS HTB/FQ_CODEL both top out ~185 Mbps down on my Edgerouter X (MT7621). Not much difference between them. I flashed OpenWrt hoping CAKE would get a little more out of the multi core (2C/4T) 880 MHz MIPS CPU than HTB/FQ_CODEL, but no. Still sticking with OpenWrt though - more flexible with its packages, just as stable and less worry about it phoning home.

moeller0 · September 15, 2019, 5:51pm

If you run OpenWrt master builds, you could try sqm-scripts simple.qos/fq_codel again, while manually editing the following section in /usr/lib/sqm/defaults.sh"

# HTB without a sufficiently large burst/cburst value is a bit CPU hungry
# so allow to specify the permitted burst in the time domain (microseconds)
# so the user has a feeling for the associated worst case latency cost
# set to zero to use htb default butst of one MTU
[ -z "$SHAPER_BURST_DUR_US" ] && SHAPER_BURST_DUR_US=1000

Setting SHAPER_BURST_DUR_US to say 5000 instead of 1000 will add 5 more milliseconds to your delay, but should give better performance under load, when the shaper gets the CPU less often than once every millisecond. You can go wild and configure different values for down- and upstream direction and same for the shaper quantum, but unless you know why you would want that just setting the bi-directional value should be sufficient. This feature has not seen much explicit testing yet, so if you try, please report back if/how it works (ideally with links to speedtest results).
Thanks in advance...

shm0 · September 15, 2019, 6:54pm

Thanks.
There is no difference between tbf and htb on my device.
// Edit
Seems like, it is not a "problem" with cake, because today there is also high cpu usage with htb+fq_codel.
Weird.
// Edit2

Did some testing to night.
Shaper Settings:
Down: 360000
Up: 36000
Overhead: 22
MPU: 64
Packet Queue Hard Limit: 10240 (fq_codel default)
HTB Burst Time: 150ms
Ingress+Egress: No ECN
Cake ingress settings: ingress nat wash dual-dsthost
Cake egress settings: egress nat wash dual-srchost

No SQM:

Starting speedtest for 60 seconds per transfer session.
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
...................................................................
 Download: 417.77 Mbps
   Upload:  36.29 Mbps
  Latency: [in msec, 67 pings, 0.00% packet loss]
      Min:  13.100
    10pct:  13.600
   Median:  48.500
      Avg:  48.655
    90pct:  67.000
      Max:  88.200

cake (piece of cake)
Concurrent Test Mode

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
...................................................................
 Download: 295.74 Mbps
   Upload:  25.52 Mbps
  Latency: [in msec, 67 pings, 0.00% packet loss]
      Min:  13.000
    10pct:  13.900
   Median:  23.100
      Avg:  28.373
    90pct:  43.700
      Max:  74.400

cake (piece of cake)
Sequential Test Mode

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are sequential, each with 8 simultaneous streams.
.............................................................
 Download: 319.35 Mbps
  Latency: [in msec, 60 pings, 0.00% packet loss]
      Min:  10.500
    10pct:  11.700
   Median:  16.400
      Avg:  23.912
    90pct:  21.800
      Max: 432.000
.............................................................
   Upload:  32.14 Mbps
  Latency: [in msec, 61 pings, 0.00% packet loss]
      Min:  12.700
    10pct:  13.500
   Median:  15.500
      Avg:  15.544
    90pct:  17.300
      Max:  19.100

htb+fq_codel (simplest)
Concurrent Test Mode

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
 Download: 313.86 Mbps
   Upload:  28.14 Mbps
  Latency: [in msec, 66 pings, 0.00% packet loss]
      Min:  13.200
    10pct:  16.400
   Median:  26.200
      Avg:  28.930
    90pct:  39.800
      Max:  79.800

htb+fq_codel (simplest)
Sequential Test Mode

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are sequential, each with 8 simultaneous streams.
Download: 335.50 Mbps
  Latency: [in msec, 61 pings, 0.00% packet loss]
      Min:  11.500
    10pct:  13.000
   Median:  18.100
      Avg:  20.385
    90pct:  23.300
      Max:  31.900
.............................................................
   Upload:  24.68 Mbps
  Latency: [in msec, 61 pings, 0.00% packet loss]
      Min:  13.000
    10pct:  17.400
   Median:  21.700
      Avg:  23.067
    90pct:  28.700
      Max:  42.400

htb+fq_codel (simplest, target 10ms, interval 20ms, quantum 500)

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
Download: 291.63 Mbps
   Upload:  28.12 Mbps
  Latency: [in msec, 66 pings, 0.00% packet loss]
      Min:  13.000
    10pct:  17.000
   Median:  24.900
      Avg:  25.323
    90pct:  30.900
      Max:  59.100

tbf+fq_codel (simplest_tbf)

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
 Download: 313.70 Mbps
   Upload:  28.05 Mbps
  Latency: [in msec, 66 pings, 0.00% packet loss]
      Min:  13.500
    10pct:  14.300
   Median:  26.700
      Avg:  28.555
    90pct:  40.600
      Max:  69.300

tbf+fq_codel (simplest_tbf, target 10ms, interval 20ms, quantum 500)

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
 Download: 315.04 Mbps
   Upload:  27.96 Mbps
  Latency: [in msec, 65 pings, 0.00% packet loss]
      Min: 112.000 (Is actually 12ms? Maybe bug in the speedtest script..)
    10pct:  16.300
   Median:  25.100
      Avg:  26.595
    90pct:  31.700
      Max:  44.400

Even without SQM, CPU is already hitting its limits.
Why the upload speed is fluctuating that much. I don't know.
At first I what you say it is because of congestion of the link
but then there would be also a latency increase?
By default fq_codel is also attached on the cpu ports.
On my device that means 8 fq_codel instances. 1 per hardware queue.
I still haven't figured out how the packets are distributed over the queues but that is a different story.
But basically all packets that forwarded to the internet are going through 2 qdisc instances.
First through the fq_codel instance on the cpu port and then through sqm configured qdisc on the wan port. Maybe this is causing to much overhead?
Because in the No SQM Test, the latency isn't that bad...
Makes me believe that fq_codel on the cpu port is already doing a good job.
(Settings are target 1ms, interval 20ms)
I will try it again with multiq on the cpu port in a few minutes.

//edit3
With multiq
Same settings as above.

No SQM:

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
 Download: 404.25 Mbps
   Upload:  36.57 Mbps
  Latency: [in msec, 66 pings, 0.00% packet loss]
      Min:  10.200
    10pct:  17.100
   Median:  53.700
      Avg:  56.747
    90pct:  68.900
      Max:  93.200

htb+fq_codel:

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
 Download: 314.92 Mbps
   Upload:  28.03 Mbps
  Latency: [in msec, 67 pings, 0.00% packet loss]
      Min: 115.000
    10pct:  16.300
   Median:  25.300
      Avg:  29.931
    90pct:  40.300
      Max:  94.800

cake

Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
 Download: 322.65 Mbps
   Upload:  25.80 Mbps
  Latency: [in msec, 66 pings, 0.00% packet loss]
      Min:  15.000
    10pct:  16.200
   Median:  27.700
      Avg:  30.064
    90pct:  46.900
      Max:  66.100

Hmm. Not that much of a difference.
I know I had cake shaping 350 Mbit/s with around ~ 80 % CPU Usage.
Without SQM it is nearly quite as high. Did something change in the kernel?

moeller0 · September 16, 2019, 8:56am

Thanks for the testing and the data!

I take it that means SHAPER_BURST_DUR_US=150000 ?

This means the theoretical goodput limits are:
360.000 * ((1500-20-20) / (1522)) = 345.34 Mbps
36.000 * ((1500-20-20) / (1522)) = 34.53 Mbps
In simultaneous tests the reverse ACK traffic for each flow will also need to be accommodated, but is not reported by netperf (as it does not know about this as these terminate in the kernel), explaining some of the differences observed between simultaneous and sequential testing.

Could well be...

shm0 · September 17, 2019, 7:54am

Hmm..
When i set a higher bandwidth limit in cake like 450MBit/s.
cake is able to push ~375 Mbit/s.
I have the feeling that is something my ISP is doing or it is an oddity of docsis.
The 150ms burst seems quite high.
But with a higher burst time it is more likely that the configured bandwidth caps are reached.
With the default 1ms.
The download is ramping up really really slowly.
The first seconds resting in the mid 150 Mbit/s range and takes long time to reach 300 MBit/s.
Sometimes only reaching ~230 Mbit/s.
With a burst time between 10-20 ms, it noticeably gets better.

moeller0 · September 17, 2019, 9:16pm

What does tc -s qdisc report as quantum for cake in both cases?

Could well be.

Yes, that is a setting interesting for testing, but certainly not for production use (TCP flows will not be happy if the RTT under load jump up by 2 * 150ms).

+1; the question is, what is a reasonable compromise? My gut feeling is that maybe 5-10ms might be acceptable, but I also belief this is a policy decision each sqm-operator needs to take for them selves, assuming that it actually helps with bandwidth

With 150ms per direction or 300ms total I am not too amazed, that TCP reacts sluggish.

That is also a rather more palatable range than 150

shm0 · September 18, 2019, 11:48pm

I meant with the 1ms default burst time.

How does using the burst/cburst feature of htb reduce cpu time?
I guess, only using cburst reduces cpu time because no shaping is applied?

cburst bytes

Amount of bytes that can be burst at 'infinite' speed, in other words, as fast as the interface can transmit them. For perfect evening out, should be equal to at most one average packet. Should be at least as high as the highest cburst of all children.

So setting an extreme high cburst value should allow overshooting for the configured bandwidth.
After testing, yes it seems like it does.

Down the page they also explain how to calculate the burst value.
For my configured caps the values equal 10ms burst in sqm scripts. (with 100hz Kernel)
450/45 Kilobytes.
I guess that explains why the down/up rates ramp up so nicely.
(with 10+ms burst time over the default 1ms)
So conclusion sqm burst time = kernel tick ? 100Hz = 10 ms, 250Hz = 4 ms and so on?
However, this will only work good with 1 traffic class, I guess?
Because 2 or more classes trying to burst for the line limit, seems not so good.

//edit
Some speed test, while watching youtube HD video
htb+fq_codel (360/36Mbit/s )
10 ms burst time
quantum ingress/egress: 1024/512
queue hard limit: 10240p (fq_codel default)
fq_codel mem limit: 4mb (OpenWRT default)
target/interval: 5ms/100ms (fq_default)

2019-09-19 02:12:35 Testing against netperf-eu.bufferbloat.net (ipv4) with 12 simultaneous sessions while pinging heise.de (60 seconds in each direction)
.............................................................
 Download:  330.33 Mbps
  Latency: (in msec, 60 pings, 0.00% packet loss)
      Min: 11.100 
    10pct: 13.500 
   Median: 19.300 
      Avg: 21.337 
    90pct: 26.200 
      Max: 81.200
..................................................................
   Upload:  35.83 Mbps
  Latency: (in msec, 67 pings, 0.00% packet loss)
      Min: 13.400 
    10pct: 18.000 
   Median: 23.000 
      Avg: 23.710 
    90pct: 28.600 
      Max: 60.400

Looks good to me

moeller0 · September 19, 2019, 5:59am

You are correct, burst by itself is not making the required work smaller, what it does though is relaxing the timing constraints when then shaper needs to execute next. If we allow only the default burst of one packet, we only have the packet transmission time at the set bandwidth before the shaper needs to run to keep sending at the set rate. If the shaper is executed too late the interface will have been idle for a while and that translates into wasted bandwidth. At a certain point the system is not going to be able to execute the shaper reliably enough with the default 1MTU burst buffer, and then setting the buffer higher helps. To allow for higher rates the new sqm-scripts code defaults to a 1 millisecond equivalent buffer, but this is user configurable on purpose. In case this is helpful we can also expose this in the GUI...

mindwolf · September 20, 2019, 6:21pm

Why not choose a 5ms buffer? lol

moeller0 · September 20, 2019, 6:30pm

??? For the sqm-scripts default we opted for something sane, but small, trading-off latency for bandwidth is a policy decision each network's operator needs to take individually. We just opted for something sane that does not come at a large bandwidth cost (the old implicit buffer scaling code also aimed for <= 1 ms so the default behavior is similar as in the past, but changing it is now possible...)

mindwolf · September 26, 2019, 7:45pm

Maybe try setting the average base rtt + 5 / 10ms as the burst when sqm is active.

moeller0 · September 26, 2019, 7:59pm

Well, the burst buffer is added to the base RTT under load, so with burst set to 100ms a 10ms RTT can ballon up to 210ms (assuming burst is set identical for ingress and egress as is the default). Ideally burst should be set as low as possible ( ~ one MTU) and only increased if the shaper does not get the access to the CPU reliably, and all increases should be accompanied with bufferbloat testing, so the latency cost of increasing burst become obvious...

mindwolf · September 30, 2019, 2:29am

Random question; What is the difference between mq and multiq? Any man pages I find seem to be synonymous, not siting a difference.

shm0 · September 30, 2019, 11:35am

mq exposes all hardware queues, so it is possible to attach a qdisc of choice to each queue.
multiq doesn't expose the queues, it sends packets in a round robin manner through the queues.
//correction
it is possible to assign traffic to specific queues with:
tc ... skbedit queue_mapping

I must agree, finding docs on mq is a bit hard.
How the packets are distributed over the mq queues? I'm not sure.
But in theory it should be configurable by RSS/RPS/XPS but I can't get this working.
And it should also be possible to set up some tc rules to assign traffic to a queue.
At least on my device (WRT1200).

mindwolf · September 30, 2019, 3:36pm

Last night I tried out multiq with some tc filters. it shows 8 bands for use. I didn't see any traffic?

tc filter add dev eth1 parent 1:0 protocol ip prio 1 u32 \
  match ip protocol 0 0x00 action skbedit queue_mapping 1

tc filter add dev eth1 parent 1:0 protocol ipv6 prio 2 u32 \
  match ip6 protocol 0 0x00 action skbedit queue_mapping 2

shm0 · October 1, 2019, 7:05pm

What do you mean by "I didn't see any traffic?" ?

When replacing the qdisc on eth1 did you specify the handle id?
For example:
tc qdisc add dev eth1 root handle 1: multiq

My tc u32 skills are a bit limited
But is it protocol 0 0x00 or protocol 0 0 ?

mindwolf · October 5, 2019, 8:33pm

I'm just beginning to learn tc filters. That one I pulled from the simple.qos script. Could be a mistake in that script we just caught by luck? It would explain no traffic! I'll test now and see if it makes a difference.