Help prioritizing games with alternative qdisc design

moeller0 · November 15, 2020, 8:33pm

For HFSC you probably need to add the "stab" option to the invocation of the root of your HFSC hierarchy...

moeller0 · November 15, 2020, 8:40pm

something like:
stab mtu 2047 tsize 512 mpu 68 overhead ${OH} linklayer atm

seems like a better match for a real ATM/AAL5 link...

Knomax · November 15, 2020, 8:46pm

replace it like this?

tc qdisc replace dev $DEV stab mtu 2047 tsize 512 mpu 68 overhead ${OH} linklayer atm handle 1: root hfsc default 3

moeller0 · November 15, 2020, 9:06pm

yes, probably, in sqm we do something like
tc qdisc replace dev $DEV handle 1: root stab mtu 2047 tsize 512 mpu 68 overhead ${OH} linklayer atm hfsc default 3

So you might need to try which invocation works (try tc -s qdisc and tc -d qdisc)

dlakelan · November 15, 2020, 9:12pm

Thanks. we also iterated on the idea of really using the proper bandwidth for the game... and fixed some syntax errors... here's where we are now:

#!/bin/sh
WAN=eth0.2 # change this to your WAN device name
UPRATE=650 #change this to your kbps upload speed
LAN=eth0.1
DOWNRATE=3000 #change this to about 80% of your download speed (in kbps)
GAMEUP=450
GAMEDOWN=1200

setqdisc () {
DEV=$1
RATE=$2
OH=37
let highrate=$RATE*90/100
let lowrate=$RATE*10/100
let gamerate=$3

## for ethernet / DOCSIS / VDSL etc use this
#tc qdisc replace dev $DEV stab overhead $OH linklayer ethernet handle 1: root hfsc default 3

# for old school DSL with ATM use this:
tc qdisc replace dev $DEV handle 1: root stab mtu 2047 tsize 512 mpu 68 overhead ${OH} linklayer atm hfsc default 3


tc class add dev $DEV parent 1: classid 1:1 hfsc ls m2 ${RATE}kbit ul m2 ${RATE}kbit

# high prio class
tc class add dev $DEV parent 1:1 classid 1:2 hfsc rt m1 ${highrate}kbit d 20ms m2 ${gamerate}kbit

tc class add dev $DEV parent 1:1 classid 1:3 hfsc ls m1 ${lowrate}kbit d 20ms m2 ${highrate}kbit

tc qdisc add dev $DEV parent 1:2 pfifo limit 10
tc qdisc add dev $DEV parent 1:3 pfifo limit 10

}

setqdisc $WAN $UPRATE $GAMEUP

## uncomment this to do the download direction via output of LAN
setqdisc $LAN $DOWNRATE $GAMEDOWN

Knomax · November 15, 2020, 9:42pm

@moeller0 Sorry for not posting...testing @dlakelan script...make changes..fix errors etc.
@dlakelan Will post the fixed script.

dlakelan · November 15, 2020, 9:58pm

We made some other changes for more "robust" shell scripting, etc. here's the current version which @Knomax says doesn't give any errors:

#!/bin/sh
WAN=eth0.2 # change this to your WAN device name
UPRATE=650 #change this to your kbps upload speed
LAN=eth0.1
DOWNRATE=3000 #change this to about 80% of your download speed (in kbps)
GAMEUP=450
GAMEDOWN=1200

setqdisc () {
DEV=$1
RATE=$2
OH=37
highrate=$((RATE*90/100))
lowrate=$((RATE*10/100))
gamerate=$3

## for ethernet / DOCSIS / VDSL etc use this
#tc qdisc replace dev $DEV stab overhead $OH linklayer ethernet handle 1: root hfsc default 3

# for old school DSL with ATM use this:
tc qdisc replace dev "$DEV" handle 1: root stab mtu 2047 tsize 512 mpu 68 overhead ${OH} linklayer atm hfsc default 3


tc class add dev "$DEV" parent 1: classid 1:1 hfsc ls m2 "${RATE}kbit" ul m2 "${RATE}kbit"

# high prio class
tc class add dev "$DEV" parent 1:1 classid 1:2 hfsc rt m1 "${highrate}kbit" d 20ms m2 "${gamerate}kbit"

tc class add dev "$DEV" parent 1:1 classid 1:3 hfsc ls m1 "${lowrate}kbit" d 20ms m2 "${highrate}kbit"

tc qdisc add dev "$DEV" parent 1:2 pfifo limit 10
tc qdisc add dev "$DEV" parent 1:3 pfifo limit 10

}

setqdisc $WAN $UPRATE $GAMEUP

## uncomment this to do the download direction via output of LAN
setqdisc $LAN $DOWNRATE $GAMEDOWN

dlakelan · November 16, 2020, 12:35am

Ok, @Knomax did some testing, with some decent results... We mainly made a change that put a pie qdisc on the non-gaming "channel". Doing that gives some reasonable amount of delay control to the "less sensitive" users but still prioritizes mainly the game traffic from the special machine... Here is the script:

#!/bin/sh
WAN=eth0.2 # change this to your WAN device name
UPRATE=650 #change this to your kbps upload speed
LAN=eth0.1
DOWNRATE=3000 #change this to about 80% of your download speed (in kbps)
GAMEUP=450
GAMEDOWN=1200

setqdisc () {
DEV=$1
RATE=$2
OH=37
highrate=$((RATE*90/100))
lowrate=$((RATE*10/100))
gamerate=$3

## for ethernet / DOCSIS / VDSL etc use this
#tc qdisc replace dev $DEV stab overhead $OH linklayer ethernet handle 1: root hfsc default 3

# for old school DSL with ATM use this:
tc qdisc replace dev "$DEV" handle 1: root stab mtu 2047 tsize 512 mpu 68 overhead ${OH} linklayer atm hfsc default 3


tc class add dev "$DEV" parent 1: classid 1:1 hfsc ls m2 "${RATE}kbit" ul m2 "${RATE}kbit"

# high prio class
tc class add dev "$DEV" parent 1:1 classid 1:2 hfsc rt m1 "${highrate}kbit" d 20ms m2 "${gamerate}kbit"

tc class add dev "$DEV" parent 1:1 classid 1:3 hfsc ls m1 "${lowrate}kbit" d 20ms m2 "${highrate}kbit"

tc qdisc add dev "$DEV" parent 1:2 pfifo limit 10
tc qdisc add dev "$DEV" parent 1:3 pie limit 100 target 80ms ecn tupdate 40ms bytemode

}

setqdisc $WAN $UPRATE $GAMEUP

## uncomment this to do the download direction via output of LAN
setqdisc $LAN $DOWNRATE $GAMEDOWN

This results in bufferbloat in the 80-150 ms range for things like general web surfing, but essentially zero bufferbloat and zero packet drops for the gaming traffic.

For typical stuff, 80-150ms is actually not particularly a big deal, you blink your eye in about 100-150ms. It's a huge deal however for games, or VOIP. So for people who dedicate a particular console or gaming machine to their games, this script should produce very good results after tuning the appropriate values in the header.

lantis1008 · November 16, 2020, 1:00am

Just as a point to add here (and echoing what was already said before), these rules have turned out pretty similarly to how Gargoyle would have generated them
Swap out pfifo for sfq and its basically there.

I would be interested to understand the performance difference between the two if you ever are inclined to test.

dlakelan · November 16, 2020, 1:29am

Does gargoyle use HFSC? and how does it set up the classes? I find that most people including most script-makers don't really understand HFSC.

Also in the latest iteration we're using pie for the non-sensitive channel. It does a decent job keeping the everyday traffic from going to crazy high latency.

lantis1008 · November 16, 2020, 2:20am

Note 1: Happy to take this to PM to not clutter this thread if it starts getting too far OT
Note 2: A lot of this was written before i came along, but i've had to learn it and understand it as part of the IPv6 updates

Yes it uses HFSC. Agreed, HFSC is confusing! The man pages do a pretty good job but it is easy to get lost in there.
The setup is performed mostly around:

github.com

ericpaulbishop/gargoyle/blob/master/package/qos-gargoyle/files/qos_gargoyle.init#L404-L514


		# Attach egress queuing discipline to QoS interface, now with temperary default
		$echo_on
		tc qdisc add dev $qos_interface root handle 1:0 hfsc default 1
		# For the root qdisc, only ul and ls are relevant since rt only applies to leaf qdiscs
		#
		# A detailed explanation of how/why/what is being set is warranted here...
		# Link Share bandwidths of the leaf nodes are all relative to their parents link share parameter and those of their
		# fellow leaf nodes competing for shares. We set the root class link share to 1000Mbit as per unit bandwidth.
		# Leaf nodes are then set with the respective percent of this per unit number.
		#
		# The actual maximum link speed is the lower of the ul and the ls and this is going to always be the ul parameter in our
		# design.
		#
		# Again, for ls only the ratios matter, the absolute values do not.
		tc class add dev $qos_interface parent 1:0 classid 1:1 hfsc ls rate 1000Mbit ul rate ${total_upload_bandwidth}kbit
		$echo_off
		class_mark_list=""
		upload_shift=0

This file has been truncated. show original

and

github.com

ericpaulbishop/gargoyle/blob/master/package/qos-gargoyle/files/qos_gargoyle.init#L547-L662


		# Attach ingress queuing discipline to IMQ0 with temporary default
		tc qdisc add dev imq0 root handle 1:0 hfsc default 1
		# For the root qdisc, only ul is relevant, since there is no link sharing, and rt only applies to leaf qdiscs
		tc class add dev imq0 parent 1:0 classid 1:1 hfsc ls rate 1000Mbit ul m2 ${total_download_bandwidth}kbit
		#load download classes
		$echo_off
		download_class_list=$(load_all_config_sections "$config_file_name" "download_class")
		for dclass_name in $download_class_list ; do
			percent_bandwidth=""
			min_bandwidth=""
			max_bandwidth=""
			minRTT=""
			load_all_config_options "$config_file_name" "$dclass_name"
			if [ -z "$percent_bandwidth" ] ; then
				percent_bandwidth="0"
			fi
			if [ -z "$min_bandwidth" ] ; then

This file has been truncated. show original

For the most part, actual values of the classes aren't that important, only the ratios. Everything is normalised to 1gbps, and then a ratio of this is calculated. This gives us our fully saturated link share between all classes using LS.
We use UL for allowing any class to have a defined max bw, and RT for allowing any class to have a defined min bandwidth. We use a 2x burst on these to help them come up to their minimum from stale quicker.

There's also a clever minimum round trip time function in there which prioritises certain classes under certain situations which uses RT.
This is coupled with the Gargoyle Active Congestion Controller which monitors the latency under all conditions and actively reduces the total link limit under poor latency conditions. The ACC is significantly more aggressive when one of the minRTT classes is active. This effectively drives any bottlenecks outside the network back under our control so we can get latency control back. ACC has the ability to throttle the connection right down to 15% of its total link speed.

Happy to take any feedback from anyone who knows better and wants to help improve it!
Notably, the STAB calculations need massive improvements (they're based on outdated connection speed/type assumptions in general) and the SFQ parameters could do with some tweaks to suit higher speed connections.

dlakelan · November 16, 2020, 2:47am

Would be happy to continue discussion on PM. I do think there's something to be said for rethinking the way those classes are set up.

moeller0 · November 16, 2020, 9:14am

I disagree, that is pretty terrible... web site loading often has some chained resources, where you need to invest >= 1 RTT to be able to figure out how/where to get the next asset from and sometimes rendering the page requires all assets, so your 80-150ms RTT explodes to page delays in the seconds. And for things to feel interactive you only have a certain "delay" budget, every ms saved on avoidable queueing will give a greater range...

But, once you separated the gaming traffic out of the way, why not use fq_codel or cake for the left over queue? As the CoDel RFC argues a PID controller like PIE is great for a linear system, but a network queue is not behaving like a linear system.... But if you must, think about fq_PIE instead of pure PIE?

dlakelan · November 16, 2020, 2:58pm

agree that if chained resources lead to seconds of delay it would be terrible. In this particular case @Knomax has only ~600kbps upload bandwidth so fairly long delays are inevitable. one MTU packet takes on the order of 20ms to serialize. So 80ms of delay is just 4 packets... I'm not clear on whether fq type queues are worthwhile in that kind of conditions. Basically if you try to hold the latency down to say 40ms you have 2 of your fair queues with one packet and all the rest with 0 packets... obviously for say 5-10Mbps or more upload we could do lots more stuff

All of this goes to show that it's really challenging to design qdiscs "in the abstract". There are essentially four major "domains"

Bandwidth is so low that long delays are inevitable... like if you have a 2400 baud modem... nothing will keep you from experiencing multi-minute page load times
Bandwidth is low but high enough that individual packets take near the noticeable delay time... 5-50ms... In this regime a single packet or two ahead of your "critical" packet produces annoying delays. If you want to prioritize interactive real-time gaming type stuff you will have to force the less interactive stuff to have delays that are definitely noticeable, 100s to 1000s of ms for example. In this regime queue lengths of even 2 or 3 packets are painful. Therefore it's hard to "learn" anything from queue lengths.
Bandwidth is moderate, ~ 5-50Mbps, a single packet makes no appreciable difference, queue lengths can be 5 to 10 packets without much problem, you can make decisions based on queue lengths because queues are not "either empty or 1 packet"
Bandwidth is large, ~50-1000Mbps, a single packet is trivial, queue lengths can be hundreds of packets, and CPU power and memory becomes the dominant concern

Solutions that work in one regime don't work in the others. I believe cake primarily targets (3), it does its best in regime (2) and it requires increasingly significant CPU as you move into regime (4). For now you can probably do cake into the 200Mbps range on cheap hardware... Stuff like the RPi4 or and x86 is required to do cake into the upper range beyond this... but at the same time, it may be that cake's sophisticated algorithms are less needed in this regime, and things like TBF and drr might do as well, remember that individual packets make no difference as you get into the (4) regime, so being just kind of approximately correct at low CPU usage might be fine compared to being bang-on but requiring a lot of CPU... I'm not sure. I haven't really played with it. I admit to just pouring CPU at my HFSC shaper

moeller0 · November 17, 2020, 8:34am

Ah, thanks for spelling that out, agreed he is clearly in the "all you can do is manage the pain" section of internet access links. I do not envy him.

Mmmh, on egress they will still help to isolate responsive from less responsive flows... but that is not going to be a big effect, I agree.

I, luckily started with 14.4 Kbaud, but I still remember how it felt to switch even from 56K to ISDN's 64K....

I agree, in that scenario per-flow fairness is not going to give reasonable results, as you really want/need directed unfairness. That said, I wonder whether cake's " precedence" keyword might not be applicable here. it implements hard precedence, unlike your nice solution that contains no forward guarantee for anything else, and hence would need an additional rate shaper for the game flows to be not completely unsafe, but it looks like it offers the required level of unfairness...

Not 100% sure I agree, in 1 we are managing pain, sure, so that is special, but from 2 on to 4 I see no real differences... cake and fq_codel are relative cheap in their FQ components and can easily be used at 1-10 Gbps rates, it is the shaper component that carries the high CPU cost (and in the end all shapers, be it TBF, HTB, HFSC, cake's in-built shaper, face the same problem, they need timely access to the CPU if they want to keep queueing delay low)

I would consider the RPi4B cheap hardware... but it clearly is not a drop in router replacment for most users, so I think I get your point

Well, TBF is going to have the same major CPU intensive component as cake's shaper, so I am not sure that this will change the picture significantly. And DRR should be similarly cheap as fq_codel or cake's equivalent, so sure you might safe a (guessing here) dozend percentage points in CPU load here, but that will hardly allow to keep using an old single core MIPS router at Gbps-link rates?

My hope is that we find a way to offload the shaper component more to hardware.... even though I have no clear idea how ot do that without introducing more queueing delay...

dlakelan · November 17, 2020, 7:14pm

It'd be great if switches had hardware TBF and drr built in that you could attach to egress of various ports. At speeds above say 100Mbps routers could mainly tag packets and spew them out into the switch... then the switch could send them in chunks of 0.5 ms at a time or so, at 100Mbps that's about 6250 bytes which means about 4 MTU packets at a time, at gigabit it'd be 10x that much, so 40 MTU packets. Being not quite perfect would still probably only involve 1-2ms of unnecessary delay, which I think we should be happy to pay in order to not have to do it in the CPU.

Switches like the Zyxel GS1900 get close to this... I believe they only police the ports, but they do have wrr with adjustable weights, so you can boost your game packets. But then, that device costs about $100. The tp-link sg108e is pretty cheap, but it has pretty much dumb non-adjustable wrr queues and policing...

So we're not there yet. But I think the tech is there, it'd be trivial for someone like TP-Link to get an ASIC that did TBF + drr, it's just not in demand.

Technically I guess below the DRR classes, you'd need something to buffer packets, at these speeds under a 0.5ms time quantum, you could probably get away with bfifos sized appropriately, but I could probably get behind something like PIE or RED or another smarter qdisc that's nevertheless easy to implement in hardware.

sirizha · November 17, 2020, 10:01pm

They do. You just need to find the right switch.

dlakelan · November 17, 2020, 10:12pm

Can you provide suggestions? Preferrably in the $50 range

my experience:
Zyxel GS1900-24e: has some kind of port speed limit, but if you set it above about 200Mbps it fails to work at those higher speeds. I have gigabit fiber, and I tried to set port speeds on the switch to 700Mbps thinking this might be fine instead of a custom shaper, and wound up getting something like 220Mbps.. Not sure what it uses, probably is a token bucket... Also has WRR with adjustable weights... so that's nice

tp-link sg108e: has port speed limits, and they do seem to work at higher rates... but it has a fixed wrr system that is kinda braindead.

sirizha · November 17, 2020, 11:13pm

No, I can’t provide suggestions in a $50 switch. I can sell you a POE+ one for $300, though, which can do what you want.

dlakelan · November 18, 2020, 12:13am

Which is why the right solution is to throw the CPU of a RPI4 at it, because it will do CAKE or HFSC easily for about $99 out the door.