NFtables and QoS in 2021

Sorry just saw this. Will try and code tonight if I get a chance.

1 Like

To be clear, this is the scenario for Zoom and Teams, which I was referring to, nothing else.

Sweet @ldir thank you so much :smiley: - with your help I got this to work, as follows:

root@OpenWrt:/etc/init.d# cat cake-dual-ifb
#!/bin/sh /etc/rc.common

exec &> /tmp/cake-dual-ifb.log

START=50
STOP=4

start() {
        # ifb interface for handling ingress on WAN (and VPN interface if wg show reports endpoint)
        ip link add name ifb-ul type ifb
        ip link add name ifb-dl type ifb
        ip link set ifb-ul up
        ip link set ifb-dl up

        tc qdisc add dev br-lan handle ffff: ingress
        tc qdisc add dev br-guest handle ffff: ingress
        tc qdisc add dev br-lan handle 1: root prio
        tc qdisc add dev br-guest handle 1: root prio

        # capture upload (ingress) on br-lan and br-guest
        tc filter add dev br-lan parent ffff: protocol ip prio 1 u32 match ip dst 192.168.1.0/24 action pass
        tc filter add dev br-lan parent ffff: protocol ip prio 1 u32 match ip dst 224.0.0.0/4 action pass
        tc filter add dev br-lan parent ffff: protocol ip prio 1 u32 match ip dst 255.255.255.255/32 action pass
        tc filter add dev br-lan parent ffff: protocol ip prio 2 matchall action mirred egress redirect dev ifb-ul
        tc filter add dev br-guest parent ffff: protocol ip prio 1 u32 match ip dst 192.168.2.0/24 action pass
        tc filter add dev br-guest parent ffff: protocol ip prio 1 u32 match ip dst 224.0.0.0/4 action pass
        tc filter add dev br-guest parent ffff: protocol ip prio 1 u32 match ip dst 255.255.255.255/32 action pass
        tc filter add dev br-guest parent ffff: protocol ip prio 2 matchall action action mirred egress redirect dev ifb-ul

        # capture download (egress) on br-lan and br-guest
        tc filter add dev br-lan parent 1: protocol ip prio 1 u32 match ip src 192.168.1.0/24 action pass
        tc filter add dev br-lan parent 1: protocol ip prio 2 matchall action ctinfo dscp 63 128 action mirred egress redirect dev ifb-dl
        tc filter add dev br-guest parent 1: protocol ip prio 1 u32 match ip src 192.168.2.0/24 action pass
        tc filter add dev br-guest parent 1: protocol ip prio 2 matchall action mirred egress redirect dev ifb-dl

        # apply CAKE on the ifbs
        tc qdisc add dev ifb-ul root cake bandwidth 30Mbit besteffort triple-isolate nonat nowash no-ack-filter split-gso rtt 100ms noatm overhead 92
        tc qdisc add dev ifb-dl root cake bandwidth 25Mbit besteffort triple-isolate nonat nowash ingress no-ack-filter split-gso rtt 100ms noatm overhead 92
}

stop() {
        tc qdisc del dev br-lan ingress
        tc qdisc del dev br-guest ingress
        tc qdisc del dev br-lan root
        tc qdisc del dev br-guest root
        tc qdisc del dev ifb-ul root
        tc qdisc del dev ifb-dl root
        ip link set ifb-ul down
        ip link del ifb-ul
        ip link set ifb-dl down
        ip link del ifb-dl
}

And for nftables I just created an nft file for /etc/nftables.d/ with:

chain cake-dual-ifb {

                type filter hook prerouting priority mangle; policy accept;
                meta nfproto ipv4 ct mark set (@nh,8,8 & 252) >> 2
                meta nfproto ipv6 ct mark set (@nh,0,16 & 4032) >> 6
                ct mark set ct mark or 128
        }

I managed to set the DSCPs for 'chrome.exe' in Windows 11 using powershell based on a guide @moeller0 sent me (thanks @moeller0 that saved me a lot of time), as follows:

New-NetQosPolicy -Name "test" -AppPathNameMatchCondition "chrome.exe" -ThrottleRateActionBitsPerSecond 1MB -PolicyStore ActiveStore -NetworkProfile All -DSCPAction 47 -WhatIf
New-NetQosPolicy -Name "test" -AppPathNameMatchCondition "chrome.exe" -PolicyStore ActiveStore -NetworkProfile All -DSCPAction 46

And with that I see tos=0xb8 on upload, and with the above setup I now see tos=0xb8 on download too.

And just to confirm: I restore the DSCP mark on br-lan egress using the tc egress capability. It seems to work. And by setting the nftables filter hook to 'prerouting' it works even over my VPN interface.

This seems to me like a pretty elegant and generic solution (at least for IPv4 - I don't use IPv6 yet). All I have to do is assign appropriate DSCPs to applications in Windows and then everything is taken care of.

Any thoughts or comments? If anyone sees any possible improvements I would be very pleased to hear them. I'm least confident about the nftables file.

Have you tried to simply using:

New-NetQosPolicy -Name "test" -AppPathNameMatchCondition "chrome.exe"  -NetworkProfile All -DSCPAction 46

so, omitting the -PolicyStore ActiveStore directive, according to Microsoft's documentation that should result in persistent storage of the policy. Or if I misinterpret things: -PolicyStore $env:computername might work... In short there should be a way to make these things persistent.

Question: Do cake's counters actually increase now when you use chrome? BTW, somewhat generic to treat all chrome traffic to EF, you need to keep good discipline to only use chrome for really important traffic, unless chrome here is just a convenient test vehicl for the whole elaeborate construction.

I saw that later and have not tested it yet. But I'm also intending to look into the Group Policy stuff generally to see how to make this setting apply to my work domain or at least my user@domain.

Chrome was just purely for speedy testing because I wanted to somehow send some packets out with a custom DSCP. I initially tried ping but that didn't work since this only works for TCP or UDP not ICMP.

I'm very confident this will work in CAKE because my tc call on br-lan egress happens before redirection to IFB. And I checked the IFBs using tcpdump and all looked well. I haven't tested that yet as I'm trying to figure out which cake options to use for my prioritized Teams and Zoom applications: diffserv4 or diffserv3 and which DSCP marks. I also need to work out how to avoid anything other than the DSCPs I set on conntrack restore (e.g. ISP set DSCPs) from impacting anything. I think I can just set DSCPs to 0 on download so that way I wash the incoming traffic and only the CT DSCP restored DSCPs will persist through to CAKE.

BTW I have a technical question. How is it that the incoming packets take on the modified conntrack rather than the original conntrack? As in we change the conntrack on outgoing packets so how come the modified conntrack on incoming packets gets applied? Does the kernel update its conntracks based on the outgoing packets so that modifying the conntrack on outgoing packets before it exits then updates the conntrack information that gets applied to the incoming packets of the corresponding flow?

Put another way, we modify the stamp rather than the stamper. So does the stamper get modified by modifying a stamp?

The short version is that the conntrack entry has 2 sub-entries, one for the 'original' direction of a flow and one for the 'reply' direction of the flow. Both of those sub-entries are linked to the conntrack meta-data such as conntrack mark.

https://thermalcircle.de/doku.php?id=blog:linux:connection_tracking_1_modules_and_hooks goes into more depth than my mind can cope with.

2 Likes

Thanks.

Working with the DSCP restore from conntrack on lan egress seems to simplify things a lot. So I'm pretty pleased with the way this has worked out thanks to your help above.

BTW is there an nftables way to:

A) specify only packets from br-lan destined for wan or vpn are to have conntrack updated; and

B) only update conntrack once when necessary rather than for every packet?

Group Policy requires a domain controller IIRC, so is not an option for stand alone windows instances.

Hmm, so this technique does not work for ICMP? I guess I was lucky then using putty for testing and not ping...

DSCPs really have no inherent value,the only thing that matters is what your prioritization set-up does with them. For cake diffserv4 and diffserv3 the following mappings hold
diffserv4:

/*  Further pruned list of traffic classes for four-class system:
 *
 *	    Latency Sensitive  (CS7, CS6, EF, VA, CS5, CS4)
 *	    Streaming Media    (AF4x, AF3x, CS3, AF2x, TOS4, CS2, TOS1)
 *	    Best Effort        (CS0, AF1x, TOS2, and those not specified)
 *	    Background Traffic (CS1)
 *
 *		Total 4 traffic classes.
 */

diffserv3:

/*  Simplified Diffserv structure with 3 tins.
 *		Low Priority		(CS1)
 *		Best Effort
 *		Latency Sensitive	(TOS4, VA, EF, CS6, CS7)
 */

The question which to take is something I would base on how much traffic I expect Teams to require.
Here are the priority rate percentages:

diffserv3:
Low Priority:       6.25%
Best Effort:      100.00%
Latency Sensitive: 25.00%
diffserv4:
Background Traffic:   6.25%
Best Effort:        100.00%
Streaming Media:     50.00%
Latency Sensitive:   25.00%

Assuming that at your lowest shaper setting teams will use a considerable fraction of your capacity, I would use diffserv4 and specifically the "streaming media/video" tin:

static const u8 diffserv4[] = {
	0, 1, 0, 0, 2, 0, 0, 0,
	1, 0, 0, 0, 0, 0, 0, 0,
	2, 0, 2, 0, 2, 0, 2, 0,
	2, 0, 2, 0, 2, 0, 2, 0,
	3, 0, 2, 0, 2, 0, 2, 0,
	3, 0, 0, 0, 3, 0, 3, 0,
	3, 0, 0, 0, 0, 0, 0, 0,
	3, 0, 0, 0, 0, 0, 0, 0,
};

so CS2/16 would be a decent choice.

If however 25% of minimal shaper rate still works OK I would probably use EF/46 as that would allow to use either diffserv4 or diffserv3.

1 Like

I don't use nftables (yet) so it's difficult for me to be definitive about things. re: B, yes that's possible. If you think about it you could use the same flag 'act_ctinfo' is using to know if a DSCP has been stored but in reverse.

ct mark and 0x80 == 0 counter goto yourmarkingstoringchain

A 'set once and forget' approach. If you've a number of classification rules then there's some cpu usage merit in doing this 'set and forget', if you're hoping for your clients to provide the and decide the classification for you then you have to weigh the cpu cost of 'check & jump' vs just doing it anyway.

Look closely at Dave's implementation and I think you'll see both of your questions are answered - he based his implementation on my original iptables approach (I have a more complicated implementation now) - and now things go full circle because I'm quoting him.

3 Likes

@ldir do I understand correctly that the issue with @Lynx's current approach is that it basically will hog the full "mark" field for itself as it does not allow to specify a mask value for setting extracting the dscp? So this would be potentially problematic with say mwan3?

Also @ldir, I had a look at the output of the conntrack helper binary, and unlike my naive assumption, IPv6 connections have proper conntrack entries; I guess I was under the wrong impression that conntrack is only used for NAT while it appears to contain the state for the stateful firewall, so at least for fw3 your elegant approach will not care much about IPv4 or IPv6. You really came up with a pretty cool solution to a hard problem. (Once my router is switched from fw3/iptables to fw4/nftables I will try to see whether I can integrate that into sqm-scripts in a way that convinces Toke).

@Lynx, the cake-dual-ifb file, what name does it use and how is it actually enabled (simply be existing, if so any specific rules for naming it)?

2 Likes

conntrack is an enabler for NAT but it exists for stateful firewall purposes. It has always been IPv4, IPv6, ICMP aware (yes you can set DSCP values for 'ping')

And yes, mwan3 will fail to work because of the mark field hogging. It's another reason why Jeremy's nftables patches need to go in.

2 Likes

Occasionally it takes me a while (like several years) to grok conceptually simple things :wink:

I agree, but I think if the current approach works in a non-mwan3 compatible fashion but without requiring special binaries it still would be a valid option for all those users not using mwan3 and the like.

1 Like

It might be worth commenting on @tohojo's comments re 'excessive' packet classification. There are two reasons why I do it and they have a direct reason on why I patched cake to have 'diffserv5'. One is packet importance, the other is guaranteed bandwidth.

Least Effort: Bittorrent. I don't care about jitter and I don't want it to consume bandwidth if I've anything more important. It really is least effort, but if there's nothing competing, go right ahead. Minimal guaranteed bandwidth
Bulk/Background: 'long term' transfers, backups, updates etc - must complete, don't want it to stall to nothing. Small guaranteed bandwidth, but not nothing.
Best Effort: Normal traffic - 'interactive' in sense of 'click a mouse & something happens'.
Video: Streaming (in the hope that providers really do get to streaming level and not bursting), interactive video conferencing (zoom, facetime, etc). Guaranteed 50% b/w min
Voice: Latency & jitter sensitive. Guaranteed 25% min

Basically it's diffserv4 with a 'least effort' tin.

Packet classification for me is a combination of making sure latency/jitter sensitive traffic takes priority whilst making less important traffic sit in reasonable bandwidth limits.

5 Likes

Yes, so the catch here is the way cake deals with tin overruns... if your video tin exceeds its 50% guaranteed rate (which is only guaranteed because Voice only hogs 25% exclusively and 100-25>=50) it will be scheduled at lower priority... So in a sense for the tins to maintain their relative priority you need to make sure to not stuff capacity seeking traffic into Vide/Voice, as that can and will degrade their priority properties.

Personally, I am fine with 3 priority tiers, one normal one with less and one with higher priority, but I can also understand the rationale behind your 5-tier set-up. I think I follow Toke's concerns about 'excessive' packet classification, and hence aim to modify priorities only sparingly and I expect only very few applications/flows to ever require priority games. At least so far we have been lucky that the default behaviour is just fine (family of 5 on a link cake-shaped to 105/36 Mbps). I use diffserv4 so I am prepared to change priorities if need be, but am not unhappy if that is going to be "never" :wink:

2 Likes

Isn't it possibly to just do something like

ct mark set ct mark & ...

To manually mask in your own bit field?

1 Like

The rule of thumb (synthesised from that paper), is that any flow that uses less than it's "fair share" of the link traffic is considered sparse and prioritised appropriately (e.g., if there are 10 flows on a 20Mbit link, anything using less 2Mbps in instantaneous bandwidth (it's really about the space between individual packets, not really "bandwidth" over longer time periods) will be sparse). The benefit of this is that this property is completely dynamic and will work regardless of the marking or type of traffic, as long as it's in a separate flow...

...and the drawback is that it's completely dynamic, so if the traffic pattern is so that the "fair share" drops below what the flow uses, it'll suddenly no longer be considered sparse and may experience worse latency. Whether this happens in practice depends on a lot of different factors, so it's generally a YMMV kind of thing; and diffserv prioritisation is a way of dealing with the case where the sparse flow optimisation is not enough on its own.

Just to be clear, I wasn't trying to imply there's anything wrong with that. If you feel like spending the time and effort to setup and maintain this, by all means go for it. I was just saying that for my deployments, this has never been worth the effort (to me), because FQ and the sparse flow optimisation is sufficient to keep things smooth in my setup. But I'm lucky enough that the link I'm administering that has a lot of users also has a lot of bandwidth, which is obviously not the case for everyone. So as above, YMMV.

I guess what I'm really encouraging people to do is to understand their use cases properly so they can make informed decisions about what works best for them given the time and resources available to improve stuff :slight_smile:

2 Likes

Thanks all - this is really interesting and enjoyable. Since setting up CAKE on my 10-70Mbit/s 4G connection I'd say out of about ten Zoom/Teams meetings one has a glitch at some point. It occurs when my capacity hits rock bottom during peak congestion hours (around 5pm) so circa 10Mbit/s and my wife is streaming Netflix. So that drove me to look into setting up DSCPs. Which has proven quite the challenge. So it's a good thing that the default plus sparse boost will cut it for most. It could also be that my minimum in cake autorate is set too high (10Mbit/s) and that actually the real rate sometimes drops slightly below. Or even just unlucky enough to coincide with my LTE 48hr wan IP refresh. My connection has for sure made things interesting.

@moeller0 thanks a lot for your very helpful diffserv post above. This will help me in trying this out.

Also I didn't understand the point about hogging the conntrack mark. Could you elaborate there?

Yes i'm not clear entirely on that, but if you do

ct mark set ...

it will overwrite the whole mark... so if some other thing is using some bits in the mark to indicate something you'll bork that functionality, unless you somehow preserve that other mark. I THINK you could preserve it by for example...

ct mark set ((ct mark & 0xff000000) | (0x00ffffff & mymark))

which should grab the mark, set the lower bits to be 0, then or them with your own mark... but keep the upper 8 bits equal to whatever they used to be.

obviously move the masks around as needed relative to where the other functionality is putting its bits.

2 Likes

Do you think there's a chance the egress hook may be made available for 5.10 kernel?

I don't know :man_shrugging:

Most likely stuff won't be back ported.