NFtables and QoS in 2021

yelreve · August 31, 2022, 8:37am

From a level I've implemented by classifification logic on the Input (to router) and postrouting (forwarded/instigated by router) hooks, rather than on the device ingress hooks.

The main rationalle here is so that I have full visibility of translated client/host IP addresses.

I've two classification methods:

Compare the first non-connmarked packet travelling in the original connection direction against pre-defined rules, any packets that do not match a rule are marked with the dynamic bit.
Any packets with a dynamic conntrack bit are sent to the dynamic chain

The dynamic chain has 3 mechanisms:

Firstly the first packet of any new connectoin is sent to a connection threading detection chain
- This identifies any hosts which open >2 connections to the same service and IP range within 5 seonds.
- This identifies any hosts which open >10 connections from the same source port within 5 seconds.
Any future packets which match one of the identified threaded sets are then assessed for reducing their DSCP class from CS0 to AF11/CS1.
- The first set works well at classifying threaded http downloads like steam.
- The second set works well at classifying P2P traffic (and importantly moving it into cake's bulk tin)
The third mechanism emulates some of @ldir's iptables work and will increase the priority of low throughput small packet UDP streams (intended to capture game traffic)

No further logic occurs in the input chain (since it's already been received by its target host).
Packets in the postrouting chain are then DSCP marked either using the CT to WMM map (if exiting via LAN), or CT to DSCP map.

One could argue that the majority of forwarded packets from WAN to LAN should already have the correct DSCP due to the ctinfo restore (and therefore they could be skipped rather than marked in postrouting), assuming they'd been classified previously and we don't want to use WMM retagging.

I do still think that the WMM marking is something I'd like to explore, as noted by @amteza OpenWrt APs themselves actually do put EF in the VO bucket, as do a lot of commercial routers - but some don't (and I would like consistency in my network regardless of the AP vendor).
I guess the argument of dimishing returns is valid, but it doesn't really introduce any more overhead since it's just using an alternative map.

moeller0 · August 31, 2022, 8:54am

Overall, your design makes sense, offering quite a lot of points for intervention. Personally I would probably be happy with a more minimalistic set, but then I barely use prioritization at all, so might simply lack the understanding why these different points are attractive to have ;).

Well the more generic method here is not to remark DSCPs to match the WMM mapping rules but to use OpenWrt's iw_qos_map_set to change the DSCP to AC mapping to match one's intentions.
As I said there is little empirical justification for putting EF into AC_VO besides the spurious correlation that EF is often used for voice packets and AC_VO expands to access class (for) voice. IMHO that is a bit thin of a justification to use something with noticeable side-effects like AC_VO... using AC_VO will noticeably diminish your total aggregate WiFi throughput (which can well be justified but it is hard to do so without actually measuring the effects for a networks typical traffic mix). RFC 8325 while a nice and compelling read offers little/no data supporting the proposed changes, AND assumes that a networkk actually tries to implement the PHBs for the named DSCPs which is not necessarily what a home network wants. (However cakes default DSCP to tin maps was inspired by the same RFCs as RFC8325's so generally these recommendations appear sane in context of compliance with RFCs, but my objection here is that compliance with reality seems more important, and RFCs are, alas, not necessarily all that bothered to make sense in the existing internet*).
The thing about DSCPs is, the actual bit-pattern does not matter much, what counts is how you treat them (what RFCs call per-hop-behaviour or PHB). Yes the IETF makes some recommendations which DSCP to use for which PHB that can make debugging easier if used consistently but these are no hard and strict rules that need to be followed at all costs**.

*) Some are more aspirational and might work in the internet the authors envision for the future.
**) The consequence of violating IETF RFCs is literally that you are not allowed to claim compliance with the violated RFCs anymore ;).

moeller0 · August 31, 2022, 9:02am

I think classifying based on observed behaviour is a decent approach in general. Not sure your rules are strict enough for my taste. These would likely trigger for speedtests or apples networkQuality tool and I am not sure I would be all that interested in measuring the bulk tin's performance I probably would relay on per-internal-IP fairness and accept that the host doing the steam download might not be the best machine to do interactive work on during the download...

I would respectfully first test whether that is a) improving things and b) whether it actually triggers for the game I play. I saw reports of valorant with ~15 players and 128 Hz ticks resulting in 10-20 Mbps ingress traffic, I am pretty sure this would not qualify for this rule, and yet a gamer might really want to prioritize that traffic. I also note that cake automatically tends to give a small boost to traffic with such properties (sparse, small packets) which might make special casing this moot (or not, I have not tested that).

Lynx · August 31, 2022, 9:18am

I really appreciate that you took the time to explain this. The diagram is fantastic and it seems logical.

From my perspective all of this helps facilitate a better understanding of nftables and linux networking, and is intellectually very intriguing, however I remain dubious about all of this DSCP-related work. @tohojo made the point above that CAKE with besteffort gets you 90% of the way there. That seems apt.

Most users seem to have >50Mbit/s connections and even though I seem to be the odd one out with a really challenging 4G LTE connection that ranges between 10Mbit/s and 70Mbit/s with CAKE using cake-auroate and besteffort I only very rarely see stutter in Teams or Zoom calls and only ever during peak congestion when the connection capacity is at its bare minimum 10Mbit/s. That makes me think that this is all mostly academic. @dlakelan made the point that a router with only a few users is one thing, whereas an office with many users is another. So that is true, but I think mostly users here fall more into the former category. And so I fear there is a great deal of rearranging of furniture, but it is clearly fun to rearrange, and we are learning about nftables.

Personally since I am dubious about all of this I favour a sparing approach to DSCPs and trying to keep things super simple. Isn't there a danger with approaches like yours that things have gotten overly complicated and are not necessarily being made better? I am open minded, but my gut tells me that these complicated DSCP marking scripts may be more about having fun in writing and thinking about them than actually offering improved overall performance. But I may well be wrong here.

Too add in a general question into the mix, @moeller0 there isn't an easy way to achieve something like this with the way CAKE works with tins is there? I would like to prioritise Teams/Zoom if there is Teams/Zoom traffic. If there is no Teams/Zoom traffic I would like to prioritise Netfix/Prime traffic. So I could set up fancy nftables rules to conditionally classify Netflix/Prime traffic as 'video' in the event no Teams/Zoom traffic is present. But I don't think I can somehow signal that I want to prioritise traffic like Teams/Zoom > Netflix/Prime > Windows Update. Or can I? Maybe I can by just placing Windows update into 'bulk'. Which I might even be able to do in Windows. But then it's not just Windows Update it's also iPad, iPhone, Android, etc.

It feels like once you start down this DSCP rabbit hole there is no end!

All, I worry about the potential for duplication of efforts here and effort not necessarily giving tangible benefit. Might it not be better if all of us instead put our minds to scrutinising the way the sparse flow boost works in CAKE and seeing whether there is any fruit to be had there, rather than messing about with all of these DSCP shenanigans?

See here: https://doi.org/10.1109/LCOMM.2018.2871457

This is a beautifully generic approach.

And also it would be good if we can somehow actually test the efficacy of DSCP setting vs best effort. Feedback like 'oh my game works so great now' may be nice to see but is not necessarily conclusive.

I mean all of this in a positive spirit. Great minds here and just wondering if we are harnessing the full potential of putting our minds together for the benefit of others.

yelreve · August 31, 2022, 10:34am

Haha yes you're right there, that's the exact behaviour currently - I'd figured in a set and leave environment that wouldn't matter, but on reflection upon your feedback I've removed the second 'bulk' step, and these will only now be marked as AF11 and thus remain in the standard tin.

The P2P (many connections from a single source ip:port) logic still drops to CS1 as I've found otherwise it swamps the standard tin.

Definitely agree Lynx, in this initial script I've tried to come up with something that requires minimal explicit rule setup - but definitely keen to cut the logic back further.

For context behind the script's logic you'll recall I came from a much lower bandwidth LTE link, where downrating connections allowed me to get more out of the limited bandwidth.

On the previous release I was actually using a lua service plugged into netifyd to explicitly recognise connections to netflix, spotify, zoom, whatsapp etc and set the conntracks rather than using dynamic 'approximate' rules like you see in my nftables implementation.

Lynx · August 31, 2022, 10:50am

Is there something like this out there that works pretty well? A generic traffic classifier. Like what Trends Micro does for Asus routers. Something that can be relied upon?

I wondered about the ipset approach mentioned above that uses keywords like zoom in the IP address lookups.

yelreve · August 31, 2022, 10:53am

I'll put a copy of lua service on github so you can have a poke around - I've not tested it on 22.03, was going to see if there was anything simpler that could be achieved without the depency on netifyd and its deep packet inspection.
It's one of the ways I've found however to accurately detect connections like whatsapp video/voice calls.
It goes without saying though that you'd need a more powerful router - mine runs in a proxmox vm.

moeller0 · August 31, 2022, 10:56am

The problem with that is probably, that an ipset populated from DNS snooping will essentially store the IP addresses from which the services in question are served, but the same IP address can well also supply different services that you might not want to prioritize... in other words this will have false positives if the services operates on shared IPs... This is why I consider marking at the local source (aka the zoom client) to be superior (which will only mis classify the occasional zoom update), however I have not checked whether a zoom conversation uses a single connection for the in and out traffic, so I can only guess/hope that the ct_info approach works...

Deep packet inspection can help there to decrease the rate of false positive detection, but at a steep cost...

ldir · August 31, 2022, 11:00am

Unfortunately at present the right hand side of a binary operation has to be a constant, so as much as 'nft add rule t c ct mark set ct mark and 0xffff0000 or meta mark and 0xffff' or net headers or whatever, nft simply doesn't let you 'or' in dynamic values.

Jeremy's patches help solve that.
A possible way around is to use verdict jump tables, but quite frankly this is getting really silly.

Lynx · August 31, 2022, 11:09am

Does diffserv actually make sense vs simply prioritising traffic? So I mean isn't it easier just to have besteffort plus you say which flows take priority over the others? Why is the latter worse than diffserv? Also with diffserv is priority respected within a tin? So if I have different classifications within the same tin is that ignored or taken into account?

yelreve · August 31, 2022, 11:14am

I encountered the same did precisely what you describe with verdict maps for my POC

I also found it looks like we may be missing some kernel config required for using the 'dscp set ct mark map' type functionality, as well as 'ct count' (connection count) with nft meters.
Doing some more investigation around this.

yelreve · August 31, 2022, 11:17am

I'd still recommend a second 'bulk' class tin for P2P traffic at a minimum - it's one class I've never gotten to play well with besteffort due to the sheer number of connections opened, tempted to agree however that for most other classes of traffic a single besteffort pot might be simplest.

I might work up a simpler version of the nft script which retails only the P2P and simple 'voice' matching (I've a VoIP phone with a fixed IP), which would be best suited to diffserv3 (losing the video class).
Can likely then drop all the verdict map logic.

Lynx · August 31, 2022, 11:34am

I checked this an it looked OK based on tcdump using live Zoom/Teams sessions. So we just set DSCPs in the LAN clients on upload and have router re-apply those on download based on the connection tracking built into linux.

I really like being able to just set DSCPs in Windows under the local Group Policy Editor:

For anyone thinking about using this approach you also need to set this registry key:

otherwise the values set in the Group Policy Editor are ignored.

So I've set this approach up in my 'cake-dual-ifb' solution. Seems to work. And it's very simple. Just DSCP mark set to conntrack on upload and wash DSCP on download, using nftables, and then restore DSCP from conntrack, using tc-ctinfo.

moeller0 · August 31, 2022, 11:42am

diffserv3 gives 25% of capacity as guaranteed high priority in the 3rd tin, if that is enough go for it other wise if uou need more just use diffserv4 but leaves the highest tin empty, as diffserv4's Video tin get up to 50% of capacity as high priority.

Bulk for P2P is probably a good idea if you want to work on the torrent machine at the same time, if you can outsource P2P to its own machine, maybe per-internal-IP-isolation is enough to reign in bit-torrent and friends? The difference is in bulk bit torrent will under load from other traffic only get 6.25% of capacity in total, while on its independent host it will get 1/N of the total for N active hosts, this will allow more forward progress for the P2P traffic.

yelreve · August 31, 2022, 11:46am

The primary driver with the P2P in my case is that a fair few game clients use P2P for updates, I won't be too popular if I suggest no games whilst they update in the background

moeller0 · August 31, 2022, 12:13pm

For that use the background/bulk tin seems a great match.

Sometimes proposals appear to move all "bulk" downloads above a certain size into bulk. IMHO that can be counter-intuitive, yes, there are downloads were I only care that they finish eventually (like I am fine when the transfer is finished the next morning), but there are some that I consider more urgent, like a OS update fixing some 0-day exploits (where I would prefer to update ASAP). IMHO it is hard to find a rule that will do the right thing in both situations...

Lynx · August 31, 2022, 2:11pm

@yelreve I just re-read your nft code. There are some serious nuggets in there for nftables coding in general. How did you pick all this up? Any particular good nft resources you could recommend?

Also just a general question about your code and nftables in general. You have e.g.:

## IP version agnostic DSCP set chains
chain dscp_set_cs0 {
    ip dscp set cs0 return
    ip6 dscp set cs0
}

chain dscp_set_cs1 {
    ip dscp set cs1 return
    ip6 dscp set cs1
}

... and so on.

Would there be a way to have just one 'chain' to set the marks? Can nftables chains not take in arguments like a function?

dlakelan · August 31, 2022, 2:17pm

The tighter your bandwidth the more incentive to mess with it. Back in the early days with @hisham2630 he had around 1-2Mbps out in the middle of Iraq, and playing with this stuff made a huge difference... Later I think he upgraded to ~5Mbps and we haven't heard from him since.

My gamer script uses a realtime tier for games and voice, and then I think it's 4 tiers for normal traffic. Up, normal, down, and way down priority.

This lets you put work related stuff in realtime, and Netflix in up priority... Which would do exactly what you wanted.

Lynx · August 31, 2022, 2:25pm

Sorry I haven't been any help yet there. I'm so new to nftables. Is it working more or less now? Is that compatible with variable capacity connections?

dlakelan · August 31, 2022, 2:38pm

Maybe... It comes down to whether we can adjust speeds on classes at runtime. I haven't looked into it yet