NFtables and QoS in 2021

tohojo · August 25, 2022, 3:03pm

Haha, that's awesome! Thanks for sharing that story, this kind of stuff is exactly why we do this!

Lynx · August 25, 2022, 3:03pm

Totally agree with this. Without CAKE my LTE connection was not at all usable for Teams or Zoom. Now I use it all the time and this helps me to work from home in the Scottish Highlands with only my 4G connection. And with help from indivduals like you @dlakelan, @moeller0 and @dave14305 I have managed to get it to work in the context of VPN pbr (to overcome ISP throttling) and despite br-lan and br-guest networks. Fantastic.

I'm curious about how much DSCP marking is necessary to make Teams and Zoom work well despite saturation of a connection with bandwidth > 10Mbit/s. I understand that that may hinge upon how effective the sparse flow optimisation is in CAKE and otherwise how CAKE manages flows lumped together in 'besteffort'.

@tohojo how does the sparse flow optimisation work in CAKE, or otherwise flows lumped together in 'besteffort'? Say I have an ongoing Teams or Zoom or WhatsApp call and saturate the conneciton with a Windows Update and OneDrive transfer - how well will the CAKE work in prioritising the latency sensitive flow(s) here? How does CAKE decide what to prioritise?

My subjective experience using an LTE connection with a capacity that varies between 10-70Mbit/s (using the bash cake-autorate script) is that Teams and Zoom works very well even during intense saturation of the network e.g. through Netflix or OneDrive transfers, but aside from 'tc -s qdisc' I don't know how to see what CAKE is doing to the flows or which packets actually get dropped etc.

I'd love to be able to peer in and see what CAKE is prioritising, or at least more fully understand the sparse flow optimisation that apparently mostly ameliorates the need for DSCP marking at all (as you write above, besteffort seems to get us 90% of the way there).

moeller0 · August 25, 2022, 3:27pm

No amount of DSCP marking is going to achieve that (at least not without a scheduler that will give different sets of DSCPs differential service (capacity share and/or lower latency)

I tend to agree with @tohojo that cake/fq_codel basically do the right thing out of the box and equitable capacity sharing (either per flow, or per-IP-per-flow) has a lot going for it... E.g. it is the single policy that guarantees forward progress for all (or nobody), IMHO unless a scheduler has more reliable information about the relative importance of different flow's packets, FQ is the best available strategy, or rater the least worst.
That said, in the home network context however there occasionally the situation arises that capacity is so tight that the equitabe share does not suffice for some traffic deemed important, in that case users desire "active targeted unfairness", and in the context of cake DSCPs allow some coarse targeting of unfairness (in diffserv4 the video tin get up to 50% of capacity at elevated priority)...

Cake will give those flows a gentle priority that do not queue up more than their quantum worth of bytes IIRC. Essentially that treats "well-behaved" traffic sources that spread out their packets slightly better than those sources that do not do so. (I think the slightly and gentle part is somewhat important, as there are applications, like real-time games that tend not to behave that way, all packets describing the latest world state need to be "crashed" through to the end-user ASAP for best game performance, and world state can easily require more than a small quantum worth of packets.)

For this you would need to get two packet capture one before and one after cake and then compare which packets got missing...

Lynx · August 25, 2022, 3:32pm

As usual very informative. Can I be cheeky and ask for even more? Well you know me by now...

Can you possibly elaborate with an example like this. Let's say I have the following flows:

active Teams call
active Zoom call
active Windows Update
active iOS update
active Netflix traffic

How does the 'out of the box' flow fairness / sparse optimisation result in somehow cutting back more on the Windows Update / iOS Update / Netflix traffic flows, but permitting the Teams and Zoom calls to go on uninterrupted?

And say I have a localized capacity dip in my LTE connection that goes down to 12Mbit/s and let's say the cake-autorate script set the CAKE latency correctly to say10Mbit/s - during this choke point what happens?

Like is it that say you have 10Mbit/s and 5 active flows it will default to 2Mbit/s per flow. And the idea is that Teams or Zoom will both generally be less than 2Mbit/s and the others will try to push for more, so the others will get punished more? That is presumably not the same as 'sparse flow optimisation' - or is that exactly what 'sparse flow optimisation' relates to?

I think with examples like this is could be helpful to better understand the need for DSCP marking or lack thereof.

dlakelan · August 25, 2022, 7:31pm

what you described was per-flow fairness, and definitely not the same as sparse flow optimization. A sparse flow might be something like the DNS lookups needed to let browsing make forward progress... A couple packets get sent, a few come back... then a big gap before you load the next page and maybe need to do some more DNS lookups.

A voip call with 20ms between packets might also be a sparse flow. In its "lane" most of the time there is 0 packets, then occasionally 1... but never 10 or 15.

moeller0 · August 25, 2022, 7:34pm

The sparse-boost only works if Teams and Zoom do not exceed their quantum. However even if they do and all bulks qualify as bulk flows for cake, with only 5 applications there will be a small round-robin delay for all flows. As log as Teams and Zoom are working okay with their equitable capacity share you should be good to go, independent of whether they qualify as sparse flows or not.
One of the nice thing about the sparseness boost is that even on a loaded link this will still allow newly starting flows a foot "in the door" so that they can start to grow up to their capacity share.

Each of your flows (assuming a single flow per application) will get at best 10/2 = 5 Mbps of gross rate, if Teams/Zoom are fine with that your are golden, otherwise they probably need to downscale the quality/bitrate a bit...

Others will get the slack capacity in this model, say if out of your 5 flows 1+2 are for some reason limiting themselves to 500 Kbps each, the other flows would get up to (10-(2*0.5))/3 = 3 Mbps gross rate. However if in your Zoom and Teams calls you start to transfer big files (assuming that is possible, I know that worked in Skype, never tried in Zoom/Teams), each flow will 10/5 = 2 Mbps again.
(Please note that for simplicity we are talking about gross rates here to avoid dragging overhead accounting into this issue).

OK, here is an example, let's assume you actually want to twitch-stream yourself playing solitair at 5 Mbps gross-CBR with 4 other capacity seeking flows active, you would be in a world of pain, as the other flows each will move up to their capacity share, resulting in only 2 of 5 Mbps for twitch.
Now if you use difffserv4 and make sure to only DSCP-mark the twitch traffic CS2 (which maps into the Video tin) and all other packets CS0, now twitch will get access to the Video tin's 50% of capacity... et voila, you can twitch stream while other greedy traffic is happening in the background.

That is what I call targeted unfairness... IMHO cake made a right call in keeping the interface simple, but it might have been a good idea to make the capacity shares percentages of the tins configurable (to allow targeted unfairness for 1-2 applications).
I note that this is simpler with simpl.qos (HTB+fq_codel) as one can simply modify the guaranteed rate of the tins. (Note guaranteed rate is only guaranteed if the sum of these rates stays <= 100% of the total).

+1; video cals however are somewhat different beasts, occasional key frames might require a bunch of packets so they can occasionally exceed their quantum and get demoted to bulk-flows.
As I tried to explain, being a bulk-flow in cake is still pretty decent/usable.

dlakelan · August 25, 2022, 7:43pm

This is more or less correct, except I think video conferences can easily be more than 2Mbps. So this is getting into the region where unfairness could be beneficial.

Let's say you have 10Mbps symmetric, and you have two people in the household doing video conferences. Let's say that you're using something like Jitsi, and by default you're sending 2Mbps of voice and video up and each person you're watching is sending you say 700kbps of video/voice and there are 5 other participants... (this is based on a little testing I did this morning on Jitsi). So now you've got 2000kbps up and 3500kbps down for this conference. (note that Jitsi WILL adjust bandwidth but suppose our goal is to have this bandwidth be fairly constant and that people don't freeze or go blurry during the conf)

Now there are 2 of these conferences... one for you and one for your wife say.

So, conference traffic is 4000kbps up and 7000kbps down. In addition to those two conferences, there is a windows update and an iOS update going on, and let's say your kids are streaming netflix.

In order for this to work the way you want it to, you'll want the windows update, iOS update, and Netflix to all share 3000kbps download while the conferences get 7000kbps download.

Ideally you don't want your kids complaining too much, so their Netflix should maybe get half of the 3000kbps download, and then Windows and iOS updates should get what's left. Except by the way Netflix tends to "pulse" packets in big chunks. So they'll want 3000Kbps for say 100ms and then nothing for 100ms

You can see where in these circumstances, maybe the per stream or per IP or etc fairness isn't going to do it.

On the other hand, if you open up your throttle to 50Mbps... none of this will matter, and everyone can fit in their allotment.

dlakelan · August 25, 2022, 9:07pm

Thank YOU guys. I literally spent multiple days walking my friend through installing OpenWrt on a gl-inet router, configuring several APs, connecting up powerline networking equipment and configuring cake in 2 or 3 spots on his network (both at the router towards the WAN and across the powerline segments to keep them from saturating). His wife was teaching anatomy classes, needed a camera on her face, a camera on her desk where she was showing images and drawing diagrams, and I think a camera on a microscope. All that just for her. They had something like a 600/60 DOCSIS connection, and in addition to all the interactive teaching she was doing, he was in video meetings all day and both their kids in online school, plus they were living with his parents during the stay-at-home period. so there were maybe 6 or so devices video streaming at any given moment throughout the day. The full download capacity of that internet connection was close to 10x the powerline speed... Expanding to 3 APs, ethernet over powerline, and cake to limit all the bottlenecks including over the powerline (which at best was getting something like 70Mbps reliably...) with Diffserv4 and prioritization and network segmentation and all that stuff... they could seamlessly walk through the house roaming from AP to AP with everyone getting 0 latency issues after a week of configuration.... Before all that work no-one could understand any of her lectures at all and she was literally in danger of losing her job. The fact that she kept her job is undoubtedly directly responsible for them saving up and affording to buy a house last year. So easily we're talking a few hundred thousand in value to their family alone from all that work you guys did. Do NOT underestimate the literally billions of dollars of real value you guys created during the pandemic.

Lynx · August 26, 2022, 6:57am

Thanks guys for working through these helpful examples.

Shouldn't that be 1 of 5 Mbps (as 5/5?).

And so also this 10/5 = 2 Mbps?

I think I must be missing something here.

BTW what is the time period for determining flows and allocating bandwidth shares? Is that configurable? Presumably that's fairly important. I mean depending on the time period there will be different numbers of active flows and different bandwidth shares.

To expand on the last paragraph, whereas we have been thinking about a fixed number of simultaneous flow-generating services, I presume all it takes to interrupt a latency-sensitive flow is an unfortunate temporally localized convergence of bandwidth hungry flows that squeeze the available capacity for the latency-sensitive flow so much that its corresponding latency-sensitive application shows stuttering.

I presume the time window to assess numbers of concurrent flows has a bearing on the above?

Does CAKE have something to mitigate against this? Or perhaps for one reason or another this is not such a big issue.

tohojo · August 26, 2022, 10:13am

The qdisc only sees flows that are currently backlogged, it has no notion of an 'ongoing' flow other than that. There's no explicit 'allocating of bandwidth'; rather, the qdisc is simply alternating between flows, dequeueing on packet (or quantum) from each at a time.

So yeah, you can DoS it by spraying out a bunch of packets to different IP/port numbers at once. And if a burst of new flows appear at once that will pretty much instantly cut down on the bandwidth available to bulk flows. Note that a flow is only bulk if it continually uses at least the share of bandwidth available to it; I wrote a detailed analysis for FQ-CoDel here: https://doi.org/10.1109/LCOMM.2018.2871457

The scenario on your second graph can happen, but for most normal traffic it rarely (hand-wave) does for normal traffic because user behaviour (e.g., clicking on things) tends to be nicely random, which spreads things out.

CAKE does not explicitly mitigate this, other than by the host fairness: if that is enabled, all flows originating from a single host will get counted together for that hosts "share", so they'll get throttled together, which can lessen the effect of for instance a single host starting up a bunch of flows at once.

moeller0 · August 26, 2022, 11:40am

I was trying to stick to your 10 Mbps example, though not very artfully... so 10 divided by 5 gives 2 Mbps per flow (asuming flows that will consume more then 2 Mbps ifleft to their own desire).

Yes, you are correct... should tech me respondig to posts while simutaneously preparing a presentation for work... sorry.

IIUC, this is where DRR++ (deficit round robin) comes in, as that is what cake uses to select the next queue to service.

Not really in the end cake will essentially service all queues containing packets in a round robin fashion. So the number of occupied queues will have an effect of the minimal inter-packet interval for each queue. Note that by default cake uses 1024 queues, and will tolerate one packet per queue even under severe dropping conditions. Hence the worst case capacity share is shaper-rate/1024. For our 10 Mbps example that results in slughtly less than 10Kbps per flow....

This is where the per-flow-isolation modes in cake shine, they will restrict most of the fall-out from using excessive numbers of concurrent flows to the IP address actually sourcing or sinking these flows.
So make sure not to do your AWS S3 syncs and bittorrenting from your vide conferencing computer, at least not during a conference.

Lynx · August 26, 2022, 11:55am

So if I put certain traffic in the video 50% diffserv4 does that mean that traffic if in a fight will get up to 50% of bandwidth and if not in a fight up to 100%?

So I could e.g. ensure my work PC will get at least 50% if competing with other flows? And so around 4-5Mbit/s at worst congestion in my circa 10-70Mbit/s LTE?

Is there not a complicated interaction between such tinning and the sparse flow boost? If you start tinning stuff won't that mess with the sparse flow boost?

I am trying to work out if tinning could end up somehow adversely affecting performance of Teams or Zoom.

Because then sparse flow boost is isolated to tin rather than global across all flows? Or will the sparse boost never give more than 50% anyway so perhaps this concern is invalid.

dlakelan · August 26, 2022, 6:48pm

yes

Generally not a good idea to put all of a general purpose computer's traffic in a higher prio tin. If it's like a gaming console or IP camera or something where almost all the traffic it uses is high prio, then that'd be fine. But you don't want your work computer to get windows updates at high priority most likely.

Not really. If you're putting just the video conference stuff into a video tin, it's going to almost always help that vid conf be more stable than otherwise.

sparse flow boost is a real thing, but I wouldn't rely on it for known important traffic. Stuff like a VOIP phone or a video conference stream should be bumped to Video tin if you can easily identify it. It's very unlikely you will experience problems. On the other hand, if you accidentally start bumping other stuff too, that could cause problems. So you want very targeted bumps. Fortunately nftables makes targeting that stuff relatively easy.

Lynx · August 26, 2022, 7:14pm

That's helpful. But whereas prioritising my whole work computer traffic is easy, figuring out how to prioritise MS Teams, Zoom and Calling over WiFi or WhatsApp traffic is not easy because the ports seem to be variable and subject to change.

That's why I have given serious consideration to having Windows 11 mark DSCPs and then trying to have nftables work out the marks for download packets based on associated upload packets relating to a tracked connection, but I understand that doing that in nftables ingress hook is not possible, so I'd need egress hook on br-lan/br-guest and that's not available yet. Or?

dave14305 · August 26, 2022, 7:21pm

Would you say most of them are UDP at least? You can prioritize UDP from your work computer to minimize the unintended consequences.

Lynx · August 26, 2022, 7:22pm

I'll look into that.

BTW what's the status with having nftables mark download packets based on the identified marks associated with upload packets associated with tracked connections? Did you get that to work? I'm wondering what the state of this is and whether I could somehow make it work.

amteza · August 26, 2022, 8:15pm

I would say this is the easiest part, see below for Teams:

-m udp -m multiport --sports 50000:50019 -m multiport --dports 50000:50019 -m comment --comment "Teams P2P EF VO" -j DSCP --set-dscp-class EF
-m udp -m multiport --sports 50020:50039 -m multiport --dports 50020:50039 -m comment --comment "Teams P2P AF41 VI" -j DSCP --set-dscp-class AF41
-m udp -m multiport --sports 50040:50059 -m multiport --dports 50040:50059 -m comment --comment "Teams P2P AF21 VI" -j DSCP --set-dscp-class AF21
-m udp -m multiport --sports 50000:50019 -m multiport --dports 443,3478:3481 -m set --match-set msteams dst -m comment --comment "Teams EF VO" -j DSCP --set-dscp-class EF
-m udp -m multiport --sports 50020:50039 -m multiport --dports 443,3478:3481 -m set --match-set msteams dst -m comment --comment "Teams AF41 VI" -j DSCP --set-dscp-class AF41
-m udp -m multiport --sports 50040:50059 -m multiport --dports 443,3478:3481 -m set --match-set msteams dst -m comment --comment "Teams AF21 VI" -j DSCP --set-dscp-class AF21

#ipset msteams
52.112.0.0/14
52.114.20.0/22
20.64.0.0/10
52.113.76.0/22
13.107.64.0/18
52.120.0.0/14

The same applies to Zoom, just look for the info online, it's available on their website, for Whatsapp, just make your UDP packets negating msteams ipset as they use a different set of servers but the same ports.

dave14305 · August 26, 2022, 8:24pm

It was working at the time with some ugly looking (but functional) compromises to save the DSCP to the connmark.

Your challenge would probably be that you would not be masquerading traffic on the LAN bridges for any conntrack mark functions to work with act_ctinfo.

Lynx · August 26, 2022, 9:18pm

On the one hand that does look ugly and on the other hand impressive nftables coding! Is there hope for this getting simplified in the future with further nftables development? You mentioned a patch earlier?

grrr2 · August 26, 2022, 9:18pm

that's where DNS+(ip)set is handy. these services are using large ranges which may overlap with other services.