NFtables and QoS in 2021

Lynx · August 26, 2022, 9:19pm

DNS plus ipset? Ah wait you prioritise traffic through certain facility and then ensure you use that facility for teams/zoom? Like a proxy that is prioritized?

Or reading about this more seems the idea is that DNS lookups are leveraged to mean firewall rules applied in respect of IPs associated with lookups.

Will that work all the time for teams and zoom and stuff then? That seems promising.

grrr2 · August 26, 2022, 9:58pm

the 2nd, i.e. these services (teams, zoom) are centralized in a sense clients are connected via teams/zoom servers . so you just need the ip addresses when dns resolution happens, add to (ip/nft)set and use the set in the firewall doing dscp marking.

but this also can be used to downgrade traffic to windowsupdate.com or googledrive.com can be marked as bulk for example.

this is working with fw3+dnsmasq 2.86, and basically qosify is doing the same. but dnsmasq nft set support requires version 2.87 which is not yet in owrt as i know.

Lynx · August 26, 2022, 10:05pm

This seems promising.

amteza · August 26, 2022, 10:29pm

It's simpler than that, you have your Zoom and Teams range addresses list and ports, just match accordingly and you are all set. No need to use fancy DNS resolution to populate ipsets.

dlakelan · August 27, 2022, 2:04am

Prioritize all UDP from/to your work computer not on port 80 or 443 and using less than 3.5Mbps for example (though the bandwidth probably requires conntrack)

grrr2 · August 27, 2022, 12:40pm

yeah that simple ... or maybe not.

many service provider runs their business in AWS for example - how do you pick corresponding ip addresses and only those addresses you are interested and not whole AWS range? or how do you know the ip addresses when a service provider using a CDN solution? so in short if ip addresses are changing dynamically how do you know upfront?
or how do you distinguish between overlapping port ranges: there is no standard above 1024 ports, applications pick port ranges as they wish and there is no port police to stop two apps to use same port range. one app you want to promote, the other app you want demote but both picking a random port from the same port range and ip's are not fixed?

so, yeah, sometimes it is simple: if service provider using static ip addresses all the time, and publicly documenting ip or ports. other cases may not be that simple where dns+ipset approach could be used.

Lynx · August 27, 2022, 12:58pm

I must admit I keep returning to the conntrack concept - see here:

https://man7.org/linux/man-pages/man8/tc-ctinfo.8.html

But I am struggling to quite wrap my head around it. Is the idea that you can store DSCP in the outgoing conntrack and that somehow that same conntrack is pushed into incoming packets and thus you can restore DSCPs applied on upload to download packets associated with a tracked connection?

I only see there how to restore DSCPs from conntrack but not how to set DSCPs into conntrack.

I'd really value an explanation concerning how this conntrack stuff is leveraged.

I think I need to understand conntracks better.

Is it possible at the moment to set the DSCP into conntrack using nftables? See e.g.:

https://lore.kernel.org/netfilter-devel/20191209224710.GI795@breakpoint.cc/t/

dave14305 · August 27, 2022, 1:50pm

That’s the code I keep linking to from last year. Ugly but functional.

ldir · August 27, 2022, 2:00pm

In essence a conntrack entry is meta-data about packet flows - it knows/can associate inbound and outbound packet flows together and forms the basis for the Linux firewall.

In the conntrack meta-data there is a data field called 'mark' which can be used to store a value. Since the conntrack entry applies to both outgoing and incoming packets for a flow it's 'easy' (or all the hard work has been done for us) to store some sort of magic value in there that could then be applied to subsequent packets of those flows, both in & out. If you stored a desired DSCP value in there all you'd need is some tool to apply the stored DSCP to a packet's DSCP as it traverses your interface/s

act_ctinfo is the tool to apply 'magic numbers' stored in conntrack's mark to packet DSCP fields. act_ctinfo is in kernel now and has been for a while (years)

The more difficult bit is getting the desired DSCP magic value stored in the conntrack mark. Openwrt has a hack to iptables called 'connmark' which allows just that, copying a packet's current DSCP value into conntrack mark.

Upstream kernel land wanted an nftables equivalent which was beyond my coding abilities, involving parsers & all sorts of (to me) horrors. Since nothing new can go in upstream iptables land if there's no equivalent nftables version, the iptables version didn't go in either.

Jeremy Sowden took on the challenge and has been doing stuff with it over the past 2+ years.

There's no direct equivalent, yet, of openwrt iptables connmark functionality - Jeremy should be cheered on https://lore.kernel.org/netfilter-devel/20220404121410.188509-1-jeremy@azazel.net/

dave14305's nftables code is good, unfortunately it's not quite equivalent enough for my personal (deranged?) requirements.

Lynx · August 27, 2022, 2:11pm

Thanks a lot for your explanation. I've read your various posts with great interest and also share your frustration with getting something that clearly works implemented. It works - so why all the pushback!

Can I just check my understanding? An upload packet can can have a DSCP value set say by Windows and then router can copy that value to a conntrack mark associated with the flow that that upload packet belongs to. Then for a corresponding download packet the DSCP can be restored from the same conntrack that was applied to the download packet corresponding with the same flow.

Is that right? I like this idea a lot because setting DSCPs based on port ranges and the like feels awfully clumsy or prone to error. I do also quite like this ipset DNS idea as an alternative.

My use case is I'd like to set DSCP packets in Windows for outgoing packets and have my router set corresponding DSCPs for the incoming packets.

I set up IFBs based on ingress and egress on br-lan and br-guest to deal with WireGuard VPN and believe with this technique I could possibly restore DSCPs on the egress side using the tc restore capability.

So all I lack is the way to have DSCP stored into conntrack and sounds like that's a bit of a mess with nftables at the moment?

Also it just dawned on me. Can't a map just be created between the conntrack and DSCP in nftables? So DSCP is never set in conntrack rather map is created between conntracks and DSCPs? Is that what your code does @dave14305?

Is that what @dlakelan suggested here:

dlakelan · August 27, 2022, 4:02pm

I've had a lot of success setting DSCP based on behavior. For example boosting all UDP flows that transmit less than 150 packets per second. This is typical behavior for games or other realtime control. It might be that video conference is different but you can also do something like 3500kbps as a limit there, and there may be port combos you can include with that (for example ignoring port 80 and 443 which are likely web traffic over QUIC)

To do this requires conntrack though and that means your nftables code should be in a forward chain. It isn't compatible with your IFB approach where the queuing happens very early.

You could perhaps solve the issue by fiddling with network namespaces and veth pairs and such but I don't think it'll be worth the complexity.

Lynx · August 27, 2022, 4:13pm

Isn't that just rewriting the sparse flow boost though? I had a look at the paper @tohojo wrote above and that was very interesting and helpful for understanding but it is admittedly difficult to get a handle on whether Teams and Zoom flows will necessarily get treated as sparse. If they will under most use conditions (@tohojo?) then all of this is just rearranging furniture for the sake of learning about nftables isn't it? Nothing wrong with it since it's enjoyable, but I have a gut feeling we really are dealing with the remaining 10 in the 100 percent.

Nevertheless I'm still extremely curious and I'd still like to figure out about using nftables to map between conntrack and DSCP. I mean it's not that conntrack needs to be actually set if a map can just be made between the two right?

So I mean:

incoming upload packet - save conntrack and associated DSCP to map; and
incoming download packet - look up map for conntrack and restore DSCP from map.

Can't that be fairly simple in nftables?

dave14305 · August 27, 2022, 4:20pm

IIRC, the map idea didn’t work because of type mismatches, which is why I ended up using raw header data instead. That’s what made it ugly.

You need to start badgering the netfilter-devel list for some forward progress on Jeremy’s patches. I believe the ball was in Jeremy’s court after Pablo reviewed the patches.

dlakelan · August 27, 2022, 4:21pm

For you the problem is you get your packets at ingress and the conntrack code hasn't seen the packet yet so you can't get access to the connmark (I think)

It depends a lot on your bandwidth levels and traffic mix ... If you have 30 sparse flows all competing and you want two or three of them to be reliably boosted, you can't rely on boosting all sparse flows... So you might mix sparsity with endpoint addresses etc... But yeah it might not be worth it for you. An office with 15 people might be different

ldir · August 27, 2022, 4:33pm

Copying the current set DSCP in a packet to the conntrack mark using nftables isn't that horrendous (with the caveat that I've not actually tried this recently)

eg: to copy the set dscp as a 'raw' number into connmark mark

meta nfproto ipv4 ct mark set (@nh,8,8 & 252) >> 2       
meta nfproto ipv6 ct mark set (@nh,0,16 & 4032) >> 6

Assuming you're doing this for act_ctinfo's benefit then you'll also need to set another bit used as a flag to tell act_ctinfo that a dscp has been stored, even if it's zero.

eg.

ct mark set ct mark or 128

You need to (optionally) set the dscp ('cos windows might have set it for you, that's nice) and (not optionally) copy that dscp to conntrack mark and then set the flag. You can do that as often as you like, the last value stored is the one used for the restore path.

So what don't I like about 'meta nfproto ipv4 ct mark set (@nh,8,8 & 252) >> 2'? It's the 'set' bit - it blows away the whole conntrack mark value, which is a pain if you're using some of the other bits for flags and stuff.

and the instantiation of act_ctinfo would be something like

tc qdisc add dev eth0 handle ffff: ingress 
tc filter add dev eth0 parent ffff: matchall action ctinfo dscp 63 128

Lynx · August 27, 2022, 4:39pm

Seems odd that type could come into play here though or not? I mean the concept seems pretty simple and I read that nftables supports maps in a fairly generic looking way.

Hmm. Just looking at this now:

https://wiki.nftables.org/wiki-nftables/index.php/Data_types

So @dave14305 you can only map between elements belonging to a particular type? If so... argh!

dave14305 · August 27, 2022, 4:54pm

Without bothering to search the thread myself, I am probably wrong. The issue with maps was probably specific to IPv6 because the IPv6 DSCP field crosses a byte boundary in the header. So I just scrapped the maps completely instead of using it only for IPv4.

Lynx · August 27, 2022, 4:57pm

Sorry just saw this. Will try and code tonight if I get a chance.

amteza · August 27, 2022, 9:14pm

To be clear, this is the scenario for Zoom and Teams, which I was referring to, nothing else.

Lynx · August 27, 2022, 10:14pm

Sweet @ldir thank you so much - with your help I got this to work, as follows:

root@OpenWrt:/etc/init.d# cat cake-dual-ifb
#!/bin/sh /etc/rc.common

exec &> /tmp/cake-dual-ifb.log

START=50
STOP=4

start() {
        # ifb interface for handling ingress on WAN (and VPN interface if wg show reports endpoint)
        ip link add name ifb-ul type ifb
        ip link add name ifb-dl type ifb
        ip link set ifb-ul up
        ip link set ifb-dl up

        tc qdisc add dev br-lan handle ffff: ingress
        tc qdisc add dev br-guest handle ffff: ingress
        tc qdisc add dev br-lan handle 1: root prio
        tc qdisc add dev br-guest handle 1: root prio

        # capture upload (ingress) on br-lan and br-guest
        tc filter add dev br-lan parent ffff: protocol ip prio 1 u32 match ip dst 192.168.1.0/24 action pass
        tc filter add dev br-lan parent ffff: protocol ip prio 1 u32 match ip dst 224.0.0.0/4 action pass
        tc filter add dev br-lan parent ffff: protocol ip prio 1 u32 match ip dst 255.255.255.255/32 action pass
        tc filter add dev br-lan parent ffff: protocol ip prio 2 matchall action mirred egress redirect dev ifb-ul
        tc filter add dev br-guest parent ffff: protocol ip prio 1 u32 match ip dst 192.168.2.0/24 action pass
        tc filter add dev br-guest parent ffff: protocol ip prio 1 u32 match ip dst 224.0.0.0/4 action pass
        tc filter add dev br-guest parent ffff: protocol ip prio 1 u32 match ip dst 255.255.255.255/32 action pass
        tc filter add dev br-guest parent ffff: protocol ip prio 2 matchall action action mirred egress redirect dev ifb-ul

        # capture download (egress) on br-lan and br-guest
        tc filter add dev br-lan parent 1: protocol ip prio 1 u32 match ip src 192.168.1.0/24 action pass
        tc filter add dev br-lan parent 1: protocol ip prio 2 matchall action ctinfo dscp 63 128 action mirred egress redirect dev ifb-dl
        tc filter add dev br-guest parent 1: protocol ip prio 1 u32 match ip src 192.168.2.0/24 action pass
        tc filter add dev br-guest parent 1: protocol ip prio 2 matchall action mirred egress redirect dev ifb-dl

        # apply CAKE on the ifbs
        tc qdisc add dev ifb-ul root cake bandwidth 30Mbit besteffort triple-isolate nonat nowash no-ack-filter split-gso rtt 100ms noatm overhead 92
        tc qdisc add dev ifb-dl root cake bandwidth 25Mbit besteffort triple-isolate nonat nowash ingress no-ack-filter split-gso rtt 100ms noatm overhead 92
}

stop() {
        tc qdisc del dev br-lan ingress
        tc qdisc del dev br-guest ingress
        tc qdisc del dev br-lan root
        tc qdisc del dev br-guest root
        tc qdisc del dev ifb-ul root
        tc qdisc del dev ifb-dl root
        ip link set ifb-ul down
        ip link del ifb-ul
        ip link set ifb-dl down
        ip link del ifb-dl
}

And for nftables I just created an nft file for /etc/nftables.d/ with:

chain cake-dual-ifb {

                type filter hook prerouting priority mangle; policy accept;
                meta nfproto ipv4 ct mark set (@nh,8,8 & 252) >> 2
                meta nfproto ipv6 ct mark set (@nh,0,16 & 4032) >> 6
                ct mark set ct mark or 128
        }

I managed to set the DSCPs for 'chrome.exe' in Windows 11 using powershell based on a guide @moeller0 sent me (thanks @moeller0 that saved me a lot of time), as follows:

New-NetQosPolicy -Name "test" -AppPathNameMatchCondition "chrome.exe" -ThrottleRateActionBitsPerSecond 1MB -PolicyStore ActiveStore -NetworkProfile All -DSCPAction 47 -WhatIf
New-NetQosPolicy -Name "test" -AppPathNameMatchCondition "chrome.exe" -PolicyStore ActiveStore -NetworkProfile All -DSCPAction 46

And with that I see tos=0xb8 on upload, and with the above setup I now see tos=0xb8 on download too.

And just to confirm: I restore the DSCP mark on br-lan egress using the tc egress capability. It seems to work. And by setting the nftables filter hook to 'prerouting' it works even over my VPN interface.

This seems to me like a pretty elegant and generic solution (at least for IPv4 - I don't use IPv6 yet). All I have to do is assign appropriate DSCPs to applications in Windows and then everything is taken care of.

Any thoughts or comments? If anyone sees any possible improvements I would be very pleased to hear them. I'm least confident about the nftables file.