Qosify: new package for DSCP marking + cake

EnfermeraSexy · November 15, 2021, 3:30pm

Wouldn't be better triple-isolate as an option for "host_isolate" than dual-srchost/dual-dsthost?

moeller0 · November 15, 2021, 3:41pm

Not really. dual-srchost will distribute capacity first by IP source address (and within each src address all flows will be treated equally); dual-dsthost will do the same by IP destination address; if the internet-ingress cake instance is configured for dual-dsthost and the interne-egress cake instamce is configured for dual-srchost, what you get is fairness by internal IP-addresses, whic basically gives all internal machines/IP addresses an equal share of the internet capacity, which often is not exactly what people ideally want, but something a lot of users/leaf networks seem to be willing to live with (and it is clearly better than any single host hogging all bandwidth).

triple-isolate however can not really offer strict per-internal-iP fairness (since it does not know which of the IP addresses in a packet live inside and outside the network) so it does something more complicated that mostly approximates what the dual-xxxhost combination offers, except in some conditions often seen during testing.
The beauty of triple-isolate is that it does mostly the right thing without knowing all necessary details, the beauty about the dual-xxxhost combination is that what it offers "strict internal IP fairness" is easy to predict and confirm even with simple testing approaches.

I would vote for trying hard to enable a reasonable dual-xxxhost configuration and only resort to triple-isolate if the required information can not be found.

EnfermeraSexy · November 15, 2021, 3:48pm

Oh fine, thanks for the explanation. One more thing, i use the flag no-split-gso for both egress and ingress, i don't know what it does but it lowes the cpu usage with cake.

dlakelan · November 15, 2021, 3:51pm

I don't know. All I see is that CS7 is sent by my wife's copy of zoom and traverses MY network and hits my router with that mark. I then remark it down to something that matches my scheme, probably CS4. I VERY much doubt it survives the internet and comes into my router from ATT at CS7. I'd give that a 0.1% chance. But I haven't actually looked (mainly because my router re-marks everything so I'd have to capture it before it hits my router and I haven't bothered). In fact, I think I have my main switch set up to remark everything coming from the internet to CS3 which is my default mark (so I have two tiers I can down-prioritize to CS2 or CS1 and it's compatible with my cheap tp-link 8 port switches QoS) so I'd have to capture it before the router.

EnfermeraSexy · November 15, 2021, 4:11pm

That's the config i have right now with diffserv8, i called all priority queues tins and the lower they are, the lower priority they have in cake.

config defaults
	list defaults /etc/qosify/*.conf
	option dscp_prio tin6
	option dscp_icmp tin7
	option dscp_bulk tin0
	option dscp_default_tcp tin1
	option dscp_default_udp	tin2
	option bulk_trigger_timeout 5
	option bulk_trigger_pps	100
	option prio_max_avg_pkt_len 500

config alias tin0
	option ingress LE
	option egress LE

config alias tin1
	option ingress CS1
	option egress CS1

config alias tin2
	option ingress CS0
	option egress CS0

config alias tin3
	option ingress CS3
	option egress CS3

config alias tin4
	option ingress AF21
	option egress AF21

config alias tin5
	option ingress CS2
	option egress CS2

config alias tin6
	option ingress CS5
	option egress CS5

config alias tin7
	option ingress CS6
	option egress CS6

config device wandev
	option name eth0.1074
	option bandwidth_up 600mbit
	option bandwidth_down 600mbit
	# defaults:
	option mode diffserv8
	option ingress 1
	option egress 1
	option nat 1
	option host_isolate 1
	option autorate_ingress 0
	option ingress_options ""
	option egress_options "ack-filter"
	option options "no-split-gso overhead 42"

Then i mark every rule with the tin i want.

# DNS
tcp:53		tin5
tcp:5353	tin5
udp:53		tin5
udp:5353	tin5

# NTP
udp:123		tin7

# SSH
tcp:22		+tin4

# HTTP/QUIC
tcp:80		+tin3
tcp:443		+tin3
udp:80		+tin3
udp:443		+tin3

moeller0 · November 15, 2021, 4:24pm

Well, Linux allows to pass meta-packets through the networking-stack (where the kernel pretends that a number of mergeable segments of a flow are actually one bigger segment, which then incurs the per-packet-cost of in-kerne processing just once instead of for each original segment individually resulting in a nice efficiency boost*). Unfortunately, these meta packets also appear to be big packets for qdiscs like cake (they are only split back into their constituting original segments at the driver level IIRC). These metas can be as large as 64KB and will hog the link for a considerably longer time than an ~1500 Packet would, and that can be a problem, as now all packets in the queue after such a meta will have to wait for at least the meta's transmission time (which on a slow link can easily exceed cake's set sojourn target) resulting in increased probability of dropping/marking packets in other flows. And that runs counter to cake's flow queueing design.
The solution is to split metas into their constituting segments already in cake, which allows to better multiplex different flows and will not "punish" other flows for the long transmission time hogged for large meta-packets. By default cake will split metas unless the rate is >= 1 Gbps.

I would strongly recommend to keep the gso-splitting defaults, as it does the right thing for most users.

*) This brings in one of the advantages that jumbo frames have, lower packet-per-second rate as far as stuff like fire-walling and routing is concerned, all the while being compatible with the MTU<=1500Byte internet as it exists today.

moeller0 · November 15, 2021, 4:26pm

Thanks. Unusual choice CS7 that is, or rather it indicates that the RFCs proposing specific DSCPs are mostly advisory and any DSCP-domain can do what ever they please...
One more reason to split e2e intent from current marking, but this is not the right place to advocate for that.

moeller0 · November 15, 2021, 4:34pm

Why do you do this? ping allows you to select a TOS value, so you can select which DSCP ping is going to use, thereby allowing you to probe the RTT/latency under load in any priority tin. But if you force ICMP into the highest tin there is zero dynamic range/select-ability left.... So if you then use ICMP to measure bufferbloat under load these ping RTT will only be helpful in estimating what RTT sparse flows in tin6 might see (while for most users, most traffic should end up in tin1 and hence measuring bufferbloat in that tin1 with ICMP by default might be a more interesting measurement).
Well possible, that for your network and conditions CS5/tin6 is the right thing for ICMP, but others should carefully consider whether that is also true for their own networks.
Especially since cake will give sparse flows a small priority boost anyway, so that most ICMP packets should already be doing just fine without such a specific prioritization.

dlakelan · November 15, 2021, 4:34pm

My impression is that Zoom offers the administrators the ability to set a policy, and so most likely someone at USC just logged into the policy server and selected the highest possible value from the menu. I think this indicates why networks should remark everything. Though I do like the idea of separating 3 bits for "application intent" and 3 bits for "network actual"

EnfermeraSexy · November 15, 2021, 4:38pm

Well, by default it's set to CS6, wich makes cake set it to tin7. I taked the default value and just placed the alias.

CS6 and CS7 are both the same, cakes put them in tin7

moeller0 · November 15, 2021, 4:48pm

Well, serves me right pontificating about qosify without installing it my self. Sorry, my remark was hence misdirected.

@nbd culd we maybe keep ICMP in the besteffort tin or at least do the "+CS6" thing so that manual TOS settings stay intact?

nbd · November 15, 2021, 6:11pm

I will adjust the default for ICMP soon. By the way, the + sign can also be used in combination with aliases (both within the alias and in the reference to the alias), so it's possible to be flexible in where to preserve the existing mark.

moeller0 · November 15, 2021, 7:49pm

Yes they do, but they also have a set of "sensible" recommendations (sensible only that they are aligned with IETF recommendations, not that these recommendations are necessarily sane).
The biggest issue I see however is on the IEEE, IMHO the default WMM priority system is not fir for purpose, it lacks an appropriate capacity sharing concept, where capacity share is inversely proportional to priority (I accept that AC_BK is special in that its definition is basically just "left-overs"). Mind you, being able to configure a system like the current defaults where, under load priority and capacity share are directly proportional is fine, but making it the default seems unwise. (I am also a big fan of strict priority scheduling albeit only in strictly controlled environments with admission control, just as an example of another system that would be unwise to inflict on laypersons )

moeller0 · November 15, 2021, 7:52pm

Just note that this "boosting" of sparse flows is so elegant because it is based on observed behavior of a flow and not some artificial classifier that can easily be gamed; and if load increases enough and sparse flow start to build up queues as well that boosting ceases (because it is not justified any longer; new flows however still receive that initial boost, helping to establish new connections on an otherwise saturated link).

dlakelan · November 15, 2021, 8:02pm

Yes, this is a big reason why I have found that using classifiers based on packet sending rates is actually doing a lot of heavy lifting, and I don't actually use port based rules at all. In my nftables script I have something like this:

          ## large transfers over 3 seconds long at 100Mbps (37.5 MBytes) get deprioritized
          ip protocol tcp tcp sport != {2049} ip dscp < af41 ct bytes ge 37500000 ip dscp set cs1
          ip6 nexthdr tcp tcp sport != {2049} ip6 dscp < af41 ct bytes ge 37500000 ip6 dscp set cs1

and

          ip protocol udp ip dscp < cs5 udp dport != {http,https,domain,51820,51821} udp sport != {http,https,domain} meter udp4meter {ip saddr . ip daddr . udp sport . udp dport limit rate over 200/second burst 100 packets } counter ct mark set 0x55
          ip6 nexthdr udp ip6 dscp < cs5 udp dport != {http,https,domain,51820,51821} udp sport != {http,https,domain} meter udp6meter {ip6 saddr . ip6 daddr . udp sport . udp dport limit rate over 200/second burst 100 packets } counter ct mark set 0x55

          ct mark 0x55 numgen random mod 10000 < 5 ct mark set 0x00 comment "small probability to unmark over-threshold connections"

## prioritize small packet udp flows on any ports:
          ct mark != 0x55 ip protocol udp ip dscp < cs5 udp dport != {http,https,domain,51820,51821} udp sport != {http,https, domain} ct avgpkt 0-450 counter ip dscp set cs5
          ct mark != 0x55 ip6 nexthdr udp ip6 dscp < cs5 udp dport != {http,https,domain,51820,51821} udp sport != {http,https,domain} ct avgpkt 0-450 counter ip6 dscp set cs5

The first rules down-prioritize long running bulk transfers based on their total transfer so far... the second set of rules marks udp transfers that go above 200 pps, gives them a small chance to be unmarked (so that if they go only briefly above 200 pps they can recover eventually), and then for packets smaller than 450 bytes that stay under the 200pps limit it puts them in a high priority group (cs5).

by itself this pretty much handles all gaming and voip on my network, though I still have separate rules specifically for my VOIP server.

(note: 450*8*200 = 720000 bits/sec so with most games needing at most a few hundred kbps this is well above the rate most games send at)

Dopam-IT_1987 · November 15, 2021, 8:08pm

this is really interesting, what do you think are the obstacles we can face in building nftables on openwrt?

nbd · November 15, 2021, 9:02pm

I've been thinking a bit more about how to properly handle the criteria for prioritizing and de-prioritizing flows based on observed behavior, and how to make it more flexible.
Using the aliases to specifically opt into both behaviors separately seems to be a very simple solution.
Example:

in /etc/config/qosify:

config defaults
    option dscp_default_udp udp_default

config alias udp_default
    option prio_boost 1
    option bulk_detect 1
    option ingress CS4
    option egress CS4

With this in place, whenever you want to use prioritization or de-prioritization, you simply use an alias that has the appropriate flags. Specifying DSCP values directly is of course still supported, but in that case it will simply skip those checks.

What do you think?

dtaht · November 15, 2021, 9:13pm

Ping should always be either besteffort or worse.

https://www.bufferbloat.net/projects/bloat/wiki/Wondershaper_Must_Die/

I go back to my original comment - Marking from 0 -> something else is ok, but especially in the case of trying to use ping as a diagnostic tool, preserving the sender's marking is a good idea.

dtaht · November 15, 2021, 9:14pm

Why do you think prioritizing udp in general is a good idea in the age of quic?

dlakelan · November 15, 2021, 9:22pm

Eh. When I use ping I use it to figure out how well my priority packets will behave, since I intentionally treat those other priorities roughly and make them wait. So I typically remark pings to the same priority as game packets. If they're delayed I know that games or voip will be too.

Yes, I could leave it all alone and manually set priorities with ping, but this would be a more viable option if ping weren't so stupid at the command line: -Q TOS where TOS is a decimal or hex number... no thanks. -Q CS4 or -Q AF41 would be totally fine with me.