it's difficult to approximate my wan ingress max rate goodput as there is a shaper on the line (which is could be a good thing given what shaper is actually in use). My best guess from my modem <> dslam is that the synce rate for ingress 60022kbits, with some form of overhead averaging about 1.2%, which leaves an average of 59300kbits. This is where the shaper comes in, leaving about 55500kbits for throughput. The egress side show the same 1.2% but is not shaped in the modem, which is in ip passthrough mode. it's throughput is about 12800kbits, leaving 12064 on the table.
Is it me or does cake use more cpu power?
I remember that cake was able to shape up to ~ 400 Mbit/s on my device.
CPU usage around ~80% on core and ~30% on the other one.
Now it is 100% and 50% and speed caps out at ~310 Mbit/s.
HTB + fq_codel is a bit better.
CPU usage ~90% and ~40% caps out at ~340 MBit/s
(Configured at 360 Mbit/s - Overhead = ~340 MBit/s, seems good)
it's not you
there was effort by @dtaht to make it more lean
but for the meantime tbf+fq_codel seems to perform more efficient.
OpenWrt piece of CAKE and EdgeOS HTB/FQ_CODEL both top out ~185 Mbps down on my Edgerouter X (MT7621). Not much difference between them. I flashed OpenWrt hoping CAKE would get a little more out of the multi core (2C/4T) 880 MHz MIPS CPU than HTB/FQ_CODEL, but no. Still sticking with OpenWrt though - more flexible with its packages, just as stable and less worry about it phoning home.
If you run OpenWrt master builds, you could try sqm-scripts simple.qos/fq_codel again, while manually editing the following section in /usr/lib/sqm/defaults.sh"
# HTB without a sufficiently large burst/cburst value is a bit CPU hungry
# so allow to specify the permitted burst in the time domain (microseconds)
# so the user has a feeling for the associated worst case latency cost
# set to zero to use htb default butst of one MTU
[ -z "$SHAPER_BURST_DUR_US" ] && SHAPER_BURST_DUR_US=1000
Setting SHAPER_BURST_DUR_US to say 5000 instead of 1000 will add 5 more milliseconds to your delay, but should give better performance under load, when the shaper gets the CPU less often than once every millisecond. You can go wild and configure different values for down- and upstream direction and same for the shaper quantum, but unless you know why you would want that just setting the bi-directional value should be sufficient. This feature has not seen much explicit testing yet, so if you try, please report back if/how it works (ideally with links to speedtest results).
Thanks in advance...
Thanks.
There is no difference between tbf and htb on my device.
// Edit
Seems like, it is not a "problem" with cake, because today there is also high cpu usage with htb+fq_codel.
Weird.
// Edit2
Did some testing to night.
Shaper Settings:
Down: 360000
Up: 36000
Overhead: 22
MPU: 64
Packet Queue Hard Limit: 10240 (fq_codel default)
HTB Burst Time: 150ms
Ingress+Egress: No ECN
Cake ingress settings: ingress nat wash dual-dsthost
Cake egress settings: egress nat wash dual-srchost
No SQM:
Starting speedtest for 60 seconds per transfer session.
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
...................................................................
Download: 417.77 Mbps
Upload: 36.29 Mbps
Latency: [in msec, 67 pings, 0.00% packet loss]
Min: 13.100
10pct: 13.600
Median: 48.500
Avg: 48.655
90pct: 67.000
Max: 88.200
cake (piece of cake)
Concurrent Test Mode
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
...................................................................
Download: 295.74 Mbps
Upload: 25.52 Mbps
Latency: [in msec, 67 pings, 0.00% packet loss]
Min: 13.000
10pct: 13.900
Median: 23.100
Avg: 28.373
90pct: 43.700
Max: 74.400
cake (piece of cake)
Sequential Test Mode
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are sequential, each with 8 simultaneous streams.
.............................................................
Download: 319.35 Mbps
Latency: [in msec, 60 pings, 0.00% packet loss]
Min: 10.500
10pct: 11.700
Median: 16.400
Avg: 23.912
90pct: 21.800
Max: 432.000
.............................................................
Upload: 32.14 Mbps
Latency: [in msec, 61 pings, 0.00% packet loss]
Min: 12.700
10pct: 13.500
Median: 15.500
Avg: 15.544
90pct: 17.300
Max: 19.100
htb+fq_codel (simplest)
Concurrent Test Mode
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
Download: 313.86 Mbps
Upload: 28.14 Mbps
Latency: [in msec, 66 pings, 0.00% packet loss]
Min: 13.200
10pct: 16.400
Median: 26.200
Avg: 28.930
90pct: 39.800
Max: 79.800
htb+fq_codel (simplest)
Sequential Test Mode
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are sequential, each with 8 simultaneous streams.
Download: 335.50 Mbps
Latency: [in msec, 61 pings, 0.00% packet loss]
Min: 11.500
10pct: 13.000
Median: 18.100
Avg: 20.385
90pct: 23.300
Max: 31.900
.............................................................
Upload: 24.68 Mbps
Latency: [in msec, 61 pings, 0.00% packet loss]
Min: 13.000
10pct: 17.400
Median: 21.700
Avg: 23.067
90pct: 28.700
Max: 42.400
htb+fq_codel (simplest, target 10ms, interval 20ms, quantum 500)
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
Download: 291.63 Mbps
Upload: 28.12 Mbps
Latency: [in msec, 66 pings, 0.00% packet loss]
Min: 13.000
10pct: 17.000
Median: 24.900
Avg: 25.323
90pct: 30.900
Max: 59.100
tbf+fq_codel (simplest_tbf)
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
Download: 313.70 Mbps
Upload: 28.05 Mbps
Latency: [in msec, 66 pings, 0.00% packet loss]
Min: 13.500
10pct: 14.300
Median: 26.700
Avg: 28.555
90pct: 40.600
Max: 69.300
tbf+fq_codel (simplest_tbf, target 10ms, interval 20ms, quantum 500)
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
Download: 315.04 Mbps
Upload: 27.96 Mbps
Latency: [in msec, 65 pings, 0.00% packet loss]
Min: 112.000 (Is actually 12ms? Maybe bug in the speedtest script..)
10pct: 16.300
Median: 25.100
Avg: 26.595
90pct: 31.700
Max: 44.400
Even without SQM, CPU is already hitting its limits.
Why the upload speed is fluctuating that much. I don't know.
At first I what you say it is because of congestion of the link
but then there would be also a latency increase?
By default fq_codel is also attached on the cpu ports.
On my device that means 8 fq_codel instances. 1 per hardware queue.
I still haven't figured out how the packets are distributed over the queues but that is a different story.
But basically all packets that forwarded to the internet are going through 2 qdisc instances.
First through the fq_codel instance on the cpu port and then through sqm configured qdisc on the wan port. Maybe this is causing to much overhead?
Because in the No SQM Test, the latency isn't that bad...
Makes me believe that fq_codel on the cpu port is already doing a good job.
(Settings are target 1ms, interval 20ms)
I will try it again with multiq on the cpu port in a few minutes.
//edit3
With multiq
Same settings as above.
No SQM:
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
Download: 404.25 Mbps
Upload: 36.57 Mbps
Latency: [in msec, 66 pings, 0.00% packet loss]
Min: 10.200
10pct: 17.100
Median: 53.700
Avg: 56.747
90pct: 68.900
Max: 93.200
htb+fq_codel:
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
Download: 314.92 Mbps
Upload: 28.03 Mbps
Latency: [in msec, 67 pings, 0.00% packet loss]
Min: 115.000
10pct: 16.300
Median: 25.300
Avg: 29.931
90pct: 40.300
Max: 94.800
cake
Measure speed to netperf-eu.bufferbloat.net (IPv4) while pinging 1.1.1.1.
Download and upload sessions are concurrent, each with 8 simultaneous streams.
..................................................................
Download: 322.65 Mbps
Upload: 25.80 Mbps
Latency: [in msec, 66 pings, 0.00% packet loss]
Min: 15.000
10pct: 16.200
Median: 27.700
Avg: 30.064
90pct: 46.900
Max: 66.100
Hmm. Not that much of a difference.
I know I had cake shaping 350 Mbit/s with around ~ 80 % CPU Usage.
Without SQM it is nearly quite as high. Did something change in the kernel?
Thanks for the testing and the data!
I take it that means SHAPER_BURST_DUR_US=150000
?
This means the theoretical goodput limits are:
360.000 * ((1500-20-20) / (1522)) = 345.34 Mbps
36.000 * ((1500-20-20) / (1522)) = 34.53 Mbps
In simultaneous tests the reverse ACK traffic for each flow will also need to be accommodated, but is not reported by netperf (as it does not know about this as these terminate in the kernel), explaining some of the differences observed between simultaneous and sequential testing.
Could well be...
Hmm..
When i set a higher bandwidth limit in cake like 450MBit/s.
cake is able to push ~375 Mbit/s.
I have the feeling that is something my ISP is doing or it is an oddity of docsis.
The 150ms burst seems quite high.
But with a higher burst time it is more likely that the configured bandwidth caps are reached.
With the default 1ms.
The download is ramping up really really slowly.
The first seconds resting in the mid 150 Mbit/s range and takes long time to reach 300 MBit/s.
Sometimes only reaching ~230 Mbit/s.
With a burst time between 10-20 ms, it noticeably gets better.
What does tc -s qdisc
report as quantum for cake in both cases?
Could well be.
Yes, that is a setting interesting for testing, but certainly not for production use (TCP flows will not be happy if the RTT under load jump up by 2 * 150ms).
+1; the question is, what is a reasonable compromise? My gut feeling is that maybe 5-10ms might be acceptable, but I also belief this is a policy decision each sqm-operator needs to take for them selves, assuming that it actually helps with bandwidth
With 150ms per direction or 300ms total I am not too amazed, that TCP reacts sluggish.
That is also a rather more palatable range than 150
I meant with the 1ms default burst time.
How does using the burst/cburst feature of htb reduce cpu time?
I guess, only using cburst reduces cpu time because no shaping is applied?
cburst bytes
Amount of bytes that can be burst at 'infinite' speed, in other words, as fast as the interface can transmit them. For perfect evening out, should be equal to at most one average packet. Should be at least as high as the highest cburst of all children.
So setting an extreme high cburst value should allow overshooting for the configured bandwidth.
After testing, yes it seems like it does.
Down the page they also explain how to calculate the burst value.
For my configured caps the values equal 10ms burst in sqm scripts. (with 100hz Kernel)
450/45 Kilobytes.
I guess that explains why the down/up rates ramp up so nicely.
(with 10+ms burst time over the default 1ms)
So conclusion sqm burst time = kernel tick ? 100Hz = 10 ms, 250Hz = 4 ms and so on?
However, this will only work good with 1 traffic class, I guess?
Because 2 or more classes trying to burst for the line limit, seems not so good.
//edit
Some speed test, while watching youtube HD video
htb+fq_codel (360/36Mbit/s )
10 ms burst time
quantum ingress/egress: 1024/512
queue hard limit: 10240p (fq_codel default)
fq_codel mem limit: 4mb (OpenWRT default)
target/interval: 5ms/100ms (fq_default)
2019-09-19 02:12:35 Testing against netperf-eu.bufferbloat.net (ipv4) with 12 simultaneous sessions while pinging heise.de (60 seconds in each direction)
.............................................................
Download: 330.33 Mbps
Latency: (in msec, 60 pings, 0.00% packet loss)
Min: 11.100
10pct: 13.500
Median: 19.300
Avg: 21.337
90pct: 26.200
Max: 81.200
..................................................................
Upload: 35.83 Mbps
Latency: (in msec, 67 pings, 0.00% packet loss)
Min: 13.400
10pct: 18.000
Median: 23.000
Avg: 23.710
90pct: 28.600
Max: 60.400
Looks good to me
You are correct, burst by itself is not making the required work smaller, what it does though is relaxing the timing constraints when then shaper needs to execute next. If we allow only the default burst of one packet, we only have the packet transmission time at the set bandwidth before the shaper needs to run to keep sending at the set rate. If the shaper is executed too late the interface will have been idle for a while and that translates into wasted bandwidth. At a certain point the system is not going to be able to execute the shaper reliably enough with the default 1MTU burst buffer, and then setting the buffer higher helps. To allow for higher rates the new sqm-scripts code defaults to a 1 millisecond equivalent buffer, but this is user configurable on purpose. In case this is helpful we can also expose this in the GUI...
Why not choose a 5ms buffer? lol
??? For the sqm-scripts default we opted for something sane, but small, trading-off latency for bandwidth is a policy decision each network's operator needs to take individually. We just opted for something sane that does not come at a large bandwidth cost (the old implicit buffer scaling code also aimed for <= 1 ms so the default behavior is similar as in the past, but changing it is now possible...)
Maybe try setting the average base rtt + 5 / 10ms as the burst when sqm is active.
Well, the burst buffer is added to the base RTT under load, so with burst set to 100ms a 10ms RTT can ballon up to 210ms (assuming burst is set identical for ingress and egress as is the default). Ideally burst should be set as low as possible ( ~ one MTU) and only increased if the shaper does not get the access to the CPU reliably, and all increases should be accompanied with bufferbloat testing, so the latency cost of increasing burst become obvious...
Random question; What is the difference between mq and multiq? Any man pages I find seem to be synonymous, not siting a difference.
mq exposes all hardware queues, so it is possible to attach a qdisc of choice to each queue.
multiq doesn't expose the queues, it sends packets in a round robin manner through the queues.
//correction
it is possible to assign traffic to specific queues with:
tc ... skbedit queue_mapping
I must agree, finding docs on mq is a bit hard.
How the packets are distributed over the mq queues? I'm not sure.
But in theory it should be configurable by RSS/RPS/XPS but I can't get this working.
And it should also be possible to set up some tc rules to assign traffic to a queue.
At least on my device (WRT1200).
Last night I tried out multiq with some tc filters. it shows 8 bands for use. I didn't see any traffic?
tc filter add dev eth1 parent 1:0 protocol ip prio 1 u32 \
match ip protocol 0 0x00 action skbedit queue_mapping 1
tc filter add dev eth1 parent 1:0 protocol ipv6 prio 2 u32 \
match ip6 protocol 0 0x00 action skbedit queue_mapping 2
What do you mean by "I didn't see any traffic?" ?
When replacing the qdisc on eth1 did you specify the handle id?
For example:
tc qdisc add dev eth1 root handle 1: multiq
My tc u32 skills are a bit limited
But is it protocol 0 0x00
or protocol 0 0
?
I'm just beginning to learn tc filters. That one I pulled from the simple.qos script. Could be a mistake in that script we just caught by luck? It would explain no traffic! I'll test now and see if it makes a difference.