SQM cake: traffic prioritisation

Gigabit · October 11, 2020, 3:41pm

@moeller0 I am very bemused.

So I ran a speed test with Speedtest.net on the wired PC and I checked the ping directly on the Archer C7.

root@OpenWrt:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=117 time=8.528 ms
64 bytes from 8.8.8.8: seq=1 ttl=117 time=8.322 ms
64 bytes from 8.8.8.8: seq=2 ttl=117 time=8.129 ms
64 bytes from 8.8.8.8: seq=3 ttl=117 time=9.549 ms
64 bytes from 8.8.8.8: seq=4 ttl=117 time=9.308 ms
64 bytes from 8.8.8.8: seq=5 ttl=117 time=8.478 ms
64 bytes from 8.8.8.8: seq=6 ttl=117 time=9.211 ms
64 bytes from 8.8.8.8: seq=7 ttl=117 time=8.682 ms
64 bytes from 8.8.8.8: seq=8 ttl=117 time=9.592 ms
64 bytes from 8.8.8.8: seq=9 ttl=117 time=8.618 ms
64 bytes from 8.8.8.8: seq=10 ttl=117 time=12.825 ms
64 bytes from 8.8.8.8: seq=11 ttl=117 time=10.183 ms
64 bytes from 8.8.8.8: seq=12 ttl=117 time=10.067 ms
64 bytes from 8.8.8.8: seq=13 ttl=117 time=9.720 ms
64 bytes from 8.8.8.8: seq=14 ttl=117 time=9.705 ms
64 bytes from 8.8.8.8: seq=15 ttl=117 time=10.932 ms
64 bytes from 8.8.8.8: seq=16 ttl=117 time=10.619 ms
64 bytes from 8.8.8.8: seq=17 ttl=117 time=10.579 ms
64 bytes from 8.8.8.8: seq=18 ttl=117 time=8.099 ms
64 bytes from 8.8.8.8: seq=19 ttl=117 time=10.397 ms
64 bytes from 8.8.8.8: seq=20 ttl=117 time=8.050 ms
64 bytes from 8.8.8.8: seq=21 ttl=117 time=9.420 ms

This is the kind of latency increase I would be expecting (?) to see under load, I can't understand why torrents would make such a big difference, as far as I am aware Speedtest.net uses a multi-threaded test and if all the traffic is being put into the "best effort" queue there is surely no difference in classification.

Please can you explain this?

Gigabit · October 11, 2020, 3:43pm

I ran the same ping test with SQM set to 25000Kbps on the downstream.

root@OpenWrt:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=117 time=8.042 ms
64 bytes from 8.8.8.8: seq=1 ttl=117 time=8.335 ms
64 bytes from 8.8.8.8: seq=2 ttl=117 time=9.005 ms
64 bytes from 8.8.8.8: seq=3 ttl=117 time=8.927 ms
64 bytes from 8.8.8.8: seq=4 ttl=117 time=8.213 ms
64 bytes from 8.8.8.8: seq=5 ttl=117 time=8.352 ms
64 bytes from 8.8.8.8: seq=6 ttl=117 time=8.781 ms
64 bytes from 8.8.8.8: seq=7 ttl=117 time=8.777 ms
64 bytes from 8.8.8.8: seq=8 ttl=117 time=9.124 ms
64 bytes from 8.8.8.8: seq=9 ttl=117 time=8.677 ms
64 bytes from 8.8.8.8: seq=10 ttl=117 time=8.416 ms
64 bytes from 8.8.8.8: seq=11 ttl=117 time=8.641 ms
64 bytes from 8.8.8.8: seq=12 ttl=117 time=8.718 ms
64 bytes from 8.8.8.8: seq=13 ttl=117 time=8.521 ms
64 bytes from 8.8.8.8: seq=14 ttl=117 time=8.779 ms
64 bytes from 8.8.8.8: seq=15 ttl=117 time=8.291 ms
64 bytes from 8.8.8.8: seq=16 ttl=117 time=8.277 ms
64 bytes from 8.8.8.8: seq=17 ttl=117 time=7.981 ms
64 bytes from 8.8.8.8: seq=18 ttl=117 time=8.197 ms
64 bytes from 8.8.8.8: seq=19 ttl=117 time=8.430 ms
64 bytes from 8.8.8.8: seq=20 ttl=117 time=8.293 ms
64 bytes from 8.8.8.8: seq=21 ttl=117 time=8.388 ms
64 bytes from 8.8.8.8: seq=22 ttl=117 time=8.192 ms

So is my SQM therefore still set too high at 50000Kbps?

moeller0 · October 11, 2020, 6:15pm

Not sure, but speedtest.net uses a (few) handful of flows for a short time, while your torrent client might operate dozens of concurrent long running flows in both direction, assuming you did not bother to limit the number of concurrent flows in the client...
Flow fairness as sqm's default will deal well with slight flow imbalances, like a handful of speedtest flows versus a handful of browsing flows, but that will simply not work as expected with heavy torrenting.

Not sure, could be. Just try it out.

moeller0 · October 12, 2020, 11:29am

BY the way, you can configure the fast.com test to also measure the upload rate and you can increase the duration, I tend to recommend 30-60 seconds....

Gigabit · October 12, 2020, 2:44pm

@moeller0 What kind of latency increase should I see under load with SQM? Is a few ms acceptable, should it be no increase at all?

moeller0 · October 12, 2020, 3:52pm

SQM will use either fq_codel or cake and both use CoDel's core idea of only allowing a relative small standing queue (called the target, and defaulting to 5% of the interval, or 5ms with the default interval of 100ms), but a larger period for flows to react (called the interval, defaulting to 100ms). Each packet's waiting time in the queue is measured on dequeue as sojourn time, if the sojourn time exceeded target for a full interval Codel will mark or drop a packet to signal that flow to slow down. In essence with responsive flows this will result in an average latency increase under load of target milliseconds (5 by default) per direction, but empirically it often feels more like 2*target per loaded direction.
The FQ part in fq_codel and cake will in addition to that logic isolate flows stochastically from each other and treat each flow independently for signalling, and giving sparse flows a small boost (interactive traffic is often sparse so this is a very helpful heuristic). But if a packet is in transfer the uplink will be blocked while the packets is "on the wire" so for very slow links the expected delay is not only affected by the desired latency target, but also by the time to actually transfer a packet at the given rate. SQM and cake will automatically adjust the latency target for each link (and tier in cake's case with differserv modes active) based on the configured paper rate, as empirically a latency target smaller than the time to transfer at least ~2 full packets leads to a severe throughput loss.

So in short the expected average delay on a loaded link with SQM/cake depends on:

the configured shaper rate and link rate
the configured interval (the default 100ms works well for general internet usage, but short RTT CDN connections can profit from an interval set to <= 10ms (but only if there are no longer RTT flows in the mix, as these will suffer in the presence of short RTT flows if interval is below the long RTT flow's RTT) 3) the responsiveness of the flows, especially on ingress SQM/cake needs some cooperation, e.g. non-responsive DOS-type traffic will cause large delays, because SQM sits typically on the wrong end for ingress traffic (and for real DOS attacks even on the ISPs side of a home link is probably not a sufficient position to mitigate an attack, but I digress).

As I tried to explain, SQM/cake will introduce some delay under load, albeit in the 5_10ms per congested/saturated direction; whether that is acceptable is a judgement call every network admin will need to take. IMHO that is perfectly fine as it is orders of magnitude better than the occasionally multiple thousands of milliseconds some poor soul see on their internet access links with out competent AQM.

Ideally yes, but that seems an unrealistic goal that could be achieved by things like perfect synchronisation of all senders with the time slot in the AQM so that each packet gets transmitted over the bottleneck immediately. There is work on making traffic behave a bit better than standard TCP but even there the nominal goal is 1 millisecond of queueing delay. To put that number in perspective, many flows over the internet acquire >> 1 ms delay variation/jitter along their journey.
IMHO, I am not opposed to 1ms queueing delay, but compared to the status quo of >> 100ms the ~10ms that SQM/cake deliver today are a bigger game changer. (Sure from 100 to 10 is the same factor as from 10 to 1, but in reality most applications can tolerate a certain latency budget quite well and there saving of the first 90 ms has simply more impact then shaving of the next 9ms).

Hope that helps.

Gigabit · October 12, 2020, 5:02pm

I'm not sure we're talking on the same wavelength, I'm not referring to the abstract, I'm referring to my connection.

With the cap at 25Mbps, I see no latency increase at all, at 50Mbps I see a few ms up to about 10-15ms maximum. Is that saying that 50Mbps is too high?

moeller0 · October 12, 2020, 6:14pm

I think I understood that, but I just do not think I should tell you what you should consider acceptable Not my call to make.

As I tried to explain, with cake on a saturated link you will see >= 5-10ms delay per direction, that is not optional but really unavoidable if the link is truely saturated for long enough.
That said, if you see higher delays with the shaper set to 50 than to 25 then, assuming the latency under load @50Mbps is not acceptable, 50 Mbps seems too high.

Gigabit · October 12, 2020, 6:28pm

Okay thanks but I am bemused why I have to lose over half of my bandwidth then to get an acceptable result?

Recall I am syncing at 64Mbps and I have to cap my speed to 25Mbps, I'm sure this is not typical?

moeller0 · October 12, 2020, 7:20pm

Well, do you?

I thing it might make sense to start a binary search for the highest shaper rate that still yields acceptable latency under load increase.... We know 50 Mbps seems to high, while 25 seems okay, so maybe try 37.5 next? If still acceptable try 43.75, if not try 31.25....

But yes, sometimes the requiresdrate "sacrifice" to get low latency under load increases can be significant. That is onevof thecreasons why I believe that every network admin needs to make a policy decision what trade-off he/she is willing to accept.

ldir · October 12, 2020, 7:26pm

I haven't read all the way through this thread so I've probably missed something. In terms of deliberate capacity loss due to cake in my setup I run egress at 99.5% of upstream vdsl modem rate (ie 20000kbps ends up at 19900kbps configured). Ingress I run at 96.25% of downstream modem rate (ie 80000kbps ends up at 77000kbps)

Note I have ingress mode enabled on the ingress shaper, also 'bridged-ptm ether-vlan mtu 72' to help ensure vdsl framing, overheads and frame sizes respected/calculated. As a sky customer I don't have to get involved with PPPoE (thank goodness) so if PPPoE is involved you'll need the ' pppoe-ptm' keyword instead of 'bridged-ptm'. UK vdsl also has a vlan tag overhead hence the 'ether-vlan' tag.

In terms of induced latency on busy link, I typically see around 3ms in the egress direction and 7ms on the ingress though ingress is usually a bit more spiky.

In the graph below, 3am-9am was my daily backup to onedrive going through and just after 9am to end of day was a large download where I was deciding to actually test one of my backups (it worked)

https://www.thinkbroadband.com/broadband/monitoring/quality/share/5bf57e52bbdcec699e3141cb95622dc901a0fc26-09-10-2020

In my setup backup traffic is categorised as 'bulk' traffic so everyone was oblivious to the fact the link was 'full' and normal browsing, downloads, streaming just worked...the backup traffic gave way to everything else.

moeller0 · October 12, 2020, 7:49pm

I would not use the two ppm variants, as I consider this to be too much work for too little gain. PTM#s 64/65 encoding really is ignorant of packet sizes and hence is cheaper to account simply by reducing the shaper rate to <= 100*64/65 = 98.46 % of the sync rate, instead of going an much approximate calculation for every packet (this is different from ATM/AAL5 which requires per packet adjustments)... Especially, since quite a lot of ISPs use secondary traffic shapers so that the sync is not the relevant limit anyway... ,)

Well, always good to see numbers from real life usage, thanks.
That said, my >= 5 ms per congested direction really is not an approximation but really a fundamental property of how codec-type AQMs work with their default target of 5ms, if faced with saturation of sufficient duration

Gigabit · October 13, 2020, 5:51pm

Hi, are you able to post your full SQM config please?

ldir · October 13, 2020, 6:11pm

Sure, beware I use a custom sqm script 'ctinfo_5layercake.qos' which combines a patched version of cake implementing a 'diffserv5' 5 tin classification system and 'ctinfo/connmark dscp' to store DSCPs across connection egress/ingress paths so DSCPs and hence classifications can be restored on packet ingress.

# cat /etc/config/sqm 

config queue
	option debug_logging '0'
	option ingress_ecn 'ECN'
	option interface 'eth0'
	option qdisc_advanced '1'
	option egress_ecn 'ECN'
	option qdisc_really_really_advanced '1'
	option squash_dscp '0'
	option squash_ingress '0'
	option linklayer 'none'
	option enabled '1'
	option script 'ctinfo_5layercake.qos'
	option iqdisc_opts 'dual-dsthost bridged-ptm ether-vlan nat mpu 72 ingress'
	option eqdisc_opts 'dual-srchost bridged-ptm ether-vlan nat mpu 72 ack-filter'
	option verbosity '1'
	option qdisc 'cake'
	option upload '19900'
	option download '77000'

And the modified sqm scripts https://github.com/ldir-EDB0/sqm-scripts/commit/e701873cf6393d056360dbc4f174db602ca02f09

Patches to cake for 'diffserv5' https://git.openwrt.org/?p=openwrt/staging/ldir.git;a=shortlog;h=refs/heads/mine

and some domains filled in by dnsmasq (/etc/config/dhcp)

        list ipset '/zoom.us/Zoom4,Zoom6'                                                                                           
        list ipset '/googlevideo.com/Vid4,Vid6'                                                                                     
        list ipset '/nflxvideo.net/rangeA-netflix.cdn.enbgk.isp.sky.com/Vid4,Vid6'                                                  
        list ipset '/aiv-cdn.net/r.cloudfront.net/aiv-delivery.net/Vid4,Vid6'                                                       
        list ipset '/s.loris.llnwd.net/as-dash-uk-live.bbcfmt.hs.llnwd.net/aod-dash-uk-live.bbcfmt.hs.llnwd.net/aod-dash-uk-live.aka
        list ipset '/vs-dash-uk-live.akamaized.net/Vid4,Vid6'                              
        list ipset '/cdn.bllon.isp.sky.com/live.bidi.net.uk/Vid4,Vid6'                     
        list ipset '/ssl-bbcdotcom.2cnt.net/Vid4,Vid6'                                     
        list ipset '/fbcdn.net/Vid4,Vid6'                                                  
        list ipset '/ttvnw.net/Vid4,Vid6'                                                  
        list ipset '/vevo.com/Vid4,Vid6'                                                   
        list ipset '/millicast.com/Vid4,Vid6'                                              
        list ipset '/xirsys.com/Vid4,Vid6'                                                 
        list ipset '/audio-fa.scdn.cot/Vid4,Vid6'                                          
        list ipset '/deezer.com/Vid4,Vid6'                                                 
        list ipset '/sndcdn.com/Vid4,Vid6'                                                 
        list ipset '/last.fm/Vid4,Vid6'                                                    
        list ipset '/v.redd.it/Vid4,Vid6'                                                                                           
        list ipset '/ttvnw.net/Vid4,Vid6'                                                  
        list ipset '/ms-acdc.office.com/windowsupdate.com/update.microsoft.com/Bulk4,Bulk6'
        list ipset '/1drv.ms/Bulk4,Bulk6'                                    
        list ipset '/1drv.com/Bulk4,Bulk6'                                   
        list ipset '/graph.microsoft.com/BE4,BE6'                            
        list ipset '/web.whatsapp.com/BE4,BE6'

Gigabit · October 13, 2020, 7:00pm

Thanks. I am totally bemused then why I have to lose so much bandwidth to get this to work.

Somebody else said earlier it's because the issue is outside my network, I still don't really know what this means.

Gigabit · October 13, 2020, 7:14pm

Could it be a router issue? Is my Archer C7 too old? Would a new router be better?

@ldir what are you using?

ldir · October 14, 2020, 8:01am

I use an APU2 - since you're not using the built-in wifi I think the c7 should be good enough...just. Check cpu usage.

moeller0 · October 14, 2020, 9:19am

One quick and dirty way of doing that is to log into the router via SSH and then run top -d 1 and then observe the value in the % idle column at the top, if that value get close to say 10% percent (on a single core router, for traffic shaping it is typically the sirq% that go up when idle% goes down) or lower that would indicate your router running out of CPU cycles to spend on traffic shaping, which would result in undesired latency under load increases. CPU time accounting is a bit peculiar so looking at the time the CPU does nothing (idle) is the simplest was to gauge the CPU load. With multicore CPUs you either need a top variant that can show values per CPU or you need to adjust the reference value; on a dual core CPU, idle 50% could mean either one CPU is maxed out (0%) and the other is fully idle (100%) or a mixture. Keep in mind that can ATM really only runs on a single CPU and hence a single maxed CPU can already be s a sign of critical overload.

Gigabit · October 14, 2020, 11:44am

Whilst downloading a single torrent @moeller0

moeller0 · October 14, 2020, 12:23pm

Well, that looks a lot like CPU overload. The issue is that traffic shaping is not necessarily that CPU bandwidth dependent, but to maintain low latencies SQM needs get access to the CPU quickly enough that there are no "gaps" in the transmission (at the desired shaper-rate). For a number of modern multicore SoCs SQM runs intro issues, as its sustained load is not high enough to keep the CPU in the high-frequency/high-power regime, but then the scaled down CPU is not powerful enough to process SQMs bursty loads in a timely enough fashion and the throughput suffers.

What services are you running on your router? Sometimes ostensibly sane and cheap services can require way more CPU cycles than one expects (case in point SQM itself, traffic shaping takes way more CPU cycles than it should).