Nftables custom QoS, round 2

dlakelan · June 27, 2020, 11:49pm

According to the most recent posts on the old thread, nftables is working with appropriate configurations in recent OpenWrt QoS and nftables ... some findings to share

Thanks to those who really helped a lot over there: @amteza, @anon50098793, @summers and the rest of the gang.

So, assuming you look through that thread and discover the magic ingredients, and you have an nftables only firewall, how do you actually use it to control your bufferbloat.

Let's start with the following config, which is a modified version of the config from before, including now some extra tables and chains:

# A simple stateful firewall with some packet tagging,
# based originally on nftables archlinux wiki
# https://wiki.archlinux.org/index.php/nftables

## this assumes eth0 is LAN and eth1 is WAN, modify as needed

flush ruleset

## change these

define wan = eth1
define lan = eth0

table inet filter {
	chain input {
		type filter hook input priority 0; policy drop;

		# established/related connections
		ct state established,related accept

		# loopback interface
		iifname lo accept

		## icmpv6 is a critical part of the protocol, we just
		## accept everything, you can lookin to making this
		## more restrictive but be careful
		ip6 nexthdr icmpv6 accept

		# we are more restrictive for ipv4 icmp
		ip protocol icmp icmp type { destination-unreachable, router-solicitation, router-advertisement, time-exceeded, parameter-problem } accept

		ip protocol igmp accept

		ip protocol icmp meta iifname $lan accept

		## ntp protocol accept from LAN
		udp dport ntp iifname $lan accept

		## DHCP accept
		iifname $lan ip protocol udp udp sport bootpc udp dport bootps log prefix "FIREWALL ACCEPT DHCP: " accept

		## DHCPv6 accept from LAN
		iifname $lan udp sport dhcpv6-client udp dport dhcpv6-server accept

		## allow dhcpv6 from router to ISP
		iifname $wan udp sport dhcpv6-server udp dport dhcpv6-client accept

		# SSH (port 22), limited to 10 connections per minute,
		# you might prefer to not allow this from WAN for
		# OpenWrt, in which case you should also add an
		# iifname eth0 filter in the front so we're only
		# allowing from LAN
		
		ct state new tcp dport ssh meter ssh-meter4 {ip saddr limit rate 10/minute burst 15 packets} accept
		ct state new ip6 nexthdr tcp tcp dport ssh meter ssh-meter6 {ip6 saddr limit rate 10/minute burst 15 packets} accept 

		## allow access to LUCI from LAN
		iifname $lan tcp dport {http,https} accept

		## DNS for main LAN, we limit the rates allowed from each LAN host to reduce chance of denial of service
		iifname $lan udp dport domain meter dommeter4 { ip saddr limit rate 240/minute burst 240 packets} accept
		iifname $lan udp dport domain meter dommeter6 { ip6 saddr limit rate 240/minute burst 240 packets} accept

		iifname $lan tcp dport domain meter dommeter4tcp { ip saddr limit rate 240/minute burst 240 packets} accept
		iifname $lan tcp dport domain meter dommeter6tcp { ip6 saddr limit rate 240/minute burst 240 packets} accept

		## allow remote syslog input? you might want this, or remove this
		
		# iifname $lan udp dport 514 accept

		counter log prefix "FIREWALL INPUT DROP: " drop
	}

	chain forward {
	    type filter hook forward priority 0; policy drop;

	    ct state established,related accept

	    iifname lo accept
	    iifname $lan oifname $wan accept ## allow LAN to forward to WAN

	    counter log prefix "FIREWALL FAIL FORWARDING: " drop
	}
}

## masquerading for ipv4 output on WAN
table ip masq {
	chain masqout {
	    type nat hook postrouting priority 0; policy accept;
	    oifname $wan masquerade

	}

	## this empty table is required to make the kernel do the unmasquerading
	chain masqin {
	    type nat hook prerouting priority 0; policy accept;

	}
	
}

## lets create a tagger table

table inet tag {
      chain wanin {
      	    type filter hook ingress device $wan priority 0;
	    jump tagchain
      }
      chain lanin {
      	    type filter hook ingress device $lan priority 0;
	    jump tagchain
      }
      chain tagchain {
      	    ## just some example tags for Steam games
      	    ip protocol udp udp dport {7000-9000, 27000-27200} ip dscp set cs5
      	    ip6 nexthdr udp udp dport {7000-9000, 27000-27200} ip6 dscp set cs5
      	    ip protocol udp udp sport {7000-9000, 27000-27200} ip dscp set cs5
      	    ip6 nexthdr udp udp sport {7000-9000, 27000-27200} ip6 dscp set cs5
      }
      
}

Someone try this out and see if you can get it working to begin with, then we'll move on to more advanced rules to tag stuff with DSCP

bqd · June 28, 2020, 10:59am

For egress DSCP marking, unless there's a special requirement I think it's better to do it on the device that generates the traffic as it would allow you to mark packets for individual processes which is simpler and more reliable than matching IP:port, and also the device is likely to have more resource to do the marking than the router.

moeller0 · June 28, 2020, 11:49am

While I agree in general, I also see this as an issue of trust and policy. If you trust your endpoints to play nice and fair, leave the marking to them, but if not stricter control might be in order. Since I do not know about your network I believe the best option is to explain how to do either (with a few words of the pros and cons of each approach) and then leave it to each networks admin to select the appropriate policy, no?

dlakelan · June 28, 2020, 1:49pm

besides @moeller0 s good answer, I've also found all kinds of stuff making no sense that is done by clients. for example Zoom on my wife's Mac marks all it's packets CS7. hell no. And ssh seems like it still uses the old TOS markings, and people playing games on consoles have no control.

you are free to keep dscp or change it, remap it, whatever. nftables is very flexible.

bqd · June 28, 2020, 4:26pm

I tried this years ago when I was a student sharing broadband with housemates. I was aware of the potential problem of people deliberately misusing DSCP which was mentioned in the quoted post.
That's why initially I used IP:port to identify traffic with iptables. It was time consuming, while it's relatively easy to find ports used by a game, keeping the IP list up to date was really no fun, I needed the remote IPs to prevent overmatching (e.g. a remote BT client might use a well known game port either by coincidence or on purpose).
Eventually I was fed up, so I started marking DSCP on endpoints and asked people to not misuse it, it was so much easier.

What I did for devices that didn't support DSCP marking was that I used the device's MAC address to identity its traffic and assigned a priority based on its intended use, e.g. game console would be given higher priority, NAS box doing downloads would be given lower priority. This usually worked fine, the caveat was that a game console might use P2P to download game updates, normal download would be fine as congestion tended to happen on egress queue.

When I managed a small network, I found it easier to mark DSCP on endpoints in general and override undesirable/missing DSCP values when it's needed.

To summarize the two approaches

always marking DSCP on the router - router assumes DSCP values from endpoints are always incorrect/missing, so the router does all the marking based on rules that need to be maintained.
marking DSCP on endpoints and overriding them on the router when necessary is like keeping a blacklist - router assumes DSCP values from most endpoints are correct, it only overrides incorrect/missing ones when needed.

dlakelan · June 28, 2020, 5:20pm

Those are both good strategies. And nftables make it easy to go anywhere in between the two extremes.

Once we get a few people trying this out I'll throw in some more sophisticated rules for people to test. In particular rather than using known ports, it's quite useful to look for traffic that has the right "signature". For example most game or VOIP or vid-conf traffic uses

UDP
A relatively steady flow of packets (a range of packets per second)
A relatively steady bandwidth (a range of packet sizes)

By looking for these criteria it's possible to heuristically mark streams without worrying about ports.

Whenever you've got ingress from the internet you absolutely MUST assume the DSCP is meaningless. This is actually explicit in the description of DSCP, that it's "domain" specific. So when it comes to ingress from the internet I think you really want to whitewash and re-mark all the packets. nftables makes it possible to do some smart stuff. I haven't tried this particular trick, but I think it would be possible in nftables to build a set of destinations where high priority packets are being sent from your LAN, and then remark incoming packets from those same IPs. This obviates the need for tc related hooks that have been created by others, where they copy the DSCP mark to the firewall mark and then restore it on ingress etc.

nftables is extremely powerful, and I'm hoping to get enough interest here that I can get some people to do some testing under real world gaming type conditions... My own network isn't a great place to test, because I've got gigabit fiber and run a bunch of business related stuff where breakage would be bad, but I recognize that there are many people out there who do want customizable QoS, and iptables simply can't hack it (not least in part because it has no chain attached to ingress).

bqd · June 28, 2020, 5:58pm

Agreed. I went down a somewhat similar path in the beginning, I was using layer 7 classifier , but it wasn't reliable and I had to use stateful iptables rules to make it work properly, it still produced too many incorrect matches than I could tolerate (maybe I wasn't using it correctly?), then I added IP:port to the mix to improve it. Being a lazy person, I really like a fully automated solution, so far network namesapce on Linux and Local Group Policy Editor on Windows are the simplest reliable solutions I could think of. Maybe you could find something in l7 classifier that could help with your traffic signature analysis.

dlakelan · June 28, 2020, 6:01pm

Modern internet is largely encrypted, so there is no real layer 7 analysis anymore... I use squid proxy to do a certain amount of layer 7 on the basis of domain names, but beyond that it's become kind of hopeless.

But the traffic characteristics are still valuable, and yes I think I've found ways to make this work acceptably, especially now there may be some more headroom than back in the day on 750kbps DSL or whatever.

Kherby · June 30, 2020, 3:25am

I would love to test SQM in combination with nftables and my VDSL connection as I do have a lot of gaming traffic in my network but I already failed somewhat with veth + SQM and my setup so I doubt that I'll get nftables to work. If I understand it correctly we have to replace the standard OpenWRT firewall with nftables and this seems to be only possible via compiling a fresh OpenWrt Image and creating every firewall rule from the scratch, which is sadly out of scope for me with my limited skills. For now it seems like that this kind of stuff is more or less only for the skilled and expert people.

Anyway I will look out for this thread as this is very interesting!

anon50098793 · July 1, 2020, 4:05am

so... some mods were needed to get it to "load" on 19.x.3, whether or not it works right i'm not able to test atm.

had to use non-variables ( only ) here...

changed the name of 'tag' here and/or had to use netdev for ingress to play ball

dlakelan · July 1, 2020, 3:24pm

whoops this is correct but easy fix. good catch. the name shouldn't be important, but it does need netdev.

Ok, let's take another step... let's prioritize DNS from or to various public providers. This is normally a very low bandwidth process and latency is important because it affects the responsiveness of browsing etc. So we add the following to the "tagchain"

I'm following a priority scheme where cs5 is used for games and other high priority stuff, cs6 is used for precisely matched VOIP (ie. from/to known servers). af41 is used for interactive video (zoom, jitsi etc) cs4 is used for buffered streaming video (netflix, fubo, sling, amazon). cs0 is used for best effort, and cs1 is used for bulk

ip saddr {8.8.8.8, 8.8.4.4, 1.1.1.1, 1.0.0.1, 9.9.9.9, 149.112,112,112} ip dscp set cs5
ip daddr {8.8.8.8, 8.8.4.4, 1.1.1.1, 1.0.0.1, 9.9.9.9, 149.112,112,112} ip dscp set cs5

ip6 saddr{ 2001:4860:4860::8888, 2001:4860:4860::8844, 2606:4700:4700::1111, 2606:4700:4700::1001, 2620:fe::fe, 2620:fe::9 } ip6 dscp set cs5
ip6 daddr{ 2001:4860:4860::8888, 2001:4860:4860::8844, 2606:4700:4700::1111, 2606:4700:4700::1001, 2620:fe::fe, 2620:fe::9 } ip6 dscp set cs5

## lets set a known port used by our torrent client to use cs1

udp dport 51419 ip dscp set cs1
udp sport 51419 ip dscp set cs1

## let's also downweight http and https ports that are currently cs0 and that have transferred more than say 20Mbytes 

tcp sport {http, https} ip dscp cs0 ct bytes ge 20000000 ip dscp set cs1
tcp dport {http,https} ip dscp cs0 ct bytes ge 20000000 ip dscp set cs1

In addition, let's assume that all udp streams where avgpkt is less than 500 bytes is a latency sensitive stream, except that we'll limit the total rate this rule matches to 4Mbps = 500 kbytes/second. Feel free to up this to some other limit, like maybe 1/4 of your bandwidth. With 500kbytes/second and up to 500 byte packets, we're talking a thousand packets per second or more. This should be sufficient for several players for most games.


# boost UDP packets with small average size up to some allowed bandwidth, ignoring the QUIC protocol to http and https ports:

ip protocol udp udp dport != {http,https} udp sport != {http,https} ip dscp < cs5 ct avgpkt 0-500 limit rate 500 kbytes/second ip dscp set cs5

## and limit QUIC to best effort

udp dport {http,https} ip dscp set cs0
udp sport {http,https} ip dscp set cs0

Finally, let's prioritize all the udp packets from a set of known "gaming consoles"

ip protocol udp ip saddr {192.168.1.2, 192.168.1.3, 192.168.1.4} ip dscp set cs5
## ideally here we'd do some NAT lookup and see if the daddr after NAT is one of the consoles... but it's not obvious how to do this. I think we can, or we can do this marking in a filter chain somewhere else instead of a netdev table on ingress.

ip6 nexthdr udp ip6 saddr {2001:db8::2, 2001:db8::3, 2001:db8::4} ip6 dscp set cs5
ip6 nexthdr udp ip6 daddr {2001:db8::2, 2001:db8::3, 2001:db8::4} ip6 dscp set cs5
## no NAT issue exists with public ipv6 addresses, but you have to either know your prefix, or do some kind of masking off the prefix here.. I think that's possible but I"m not sure

Give these rules a try and see if we can get something that loads and runs properly. @anon50098793 are you up for testing this? If so, can you also post your complete example script once it's running? Thanks

moeller0 · July 1, 2020, 5:22pm

I note that both fq_codel and cake in their default flow-isolation modes will boost small/sparse streams like DNS requests simply by virtue of them being sparse, so no magic filtering required, if your leaf qdisc employs the sparseness up-prioritization (this also works pretty well and natural for TCP session intitiation and ACKs, and this naturally scales to overload situations, where a flow needs to be increasingly sparse to still gain that boost, which for overload is a sane strategy).
That said, the syntax is pretty much a joy in its clarity and these are great examples...

dlakelan · July 1, 2020, 6:19pm

You can see why I insist on using nftables for the last couple years !

The sparse flow prioritization is a great feature of cake. This kind of nftables based prioritization allows such things for other qdiscs as well, for example HFSC or drr or whatever.

anon50098793 · July 1, 2020, 6:34pm

so... 'ct' rules seem to not like netdev. shown here in random 'inet' to demonstrate they do actually load... but obviously need to be restructured.

dlakelan-sample-w-fixes-tmpcopy

flush ruleset
define wan = eth1
define lan = eth0

table inet filter {
	chain input {
		type filter hook input priority 0; policy drop;
		ct state established,related accept
		iifname lo accept
		ip6 nexthdr icmpv6 accept
		ip protocol icmp icmp type { destination-unreachable, router-solicitation, router-advertisement, time-exceeded, parameter-problem } accept
		ip protocol igmp accept
		ip protocol icmp meta iifname $lan accept
		udp dport ntp iifname $lan accept
		iifname $lan ip protocol udp udp sport bootpc udp dport bootps log prefix "FIREWALL ACCEPT DHCP: " accept

		iifname $lan udp sport dhcpv6-client udp dport dhcpv6-server accept
		iifname $wan udp sport dhcpv6-server udp dport dhcpv6-client accept
		ct state new tcp dport ssh meter ssh-meter4 {ip saddr limit rate 10/minute burst 15 packets} accept
		ct state new ip6 nexthdr tcp tcp dport ssh meter ssh-meter6 {ip6 saddr limit rate 10/minute burst 15 packets} accept 
		iifname $lan tcp dport {http,https} accept
		iifname $lan udp dport domain meter dommeter4 { ip saddr limit rate 240/minute burst 240 packets} accept
		iifname $lan udp dport domain meter dommeter6 { ip6 saddr limit rate 240/minute burst 240 packets} accept
		iifname $lan tcp dport domain meter dommeter4tcp { ip saddr limit rate 240/minute burst 240 packets} accept
		iifname $lan tcp dport domain meter dommeter6tcp { ip6 saddr limit rate 240/minute burst 240 packets} accept
		counter log prefix "FIREWALL INPUT DROP: " drop
	}
	chain forward {
	    type filter hook forward priority 0; policy drop;
	    ct state established,related accept
	    iifname lo accept
	    iifname $lan oifname $wan accept
	    ip protocol tcp tcp sport {http,https} ip dscp cs0 ct bytes ge 20000000 ip dscp set cs1
	    ip protocol tcp tcp dport {http,https} ip dscp cs0 ct bytes ge 20000000 ip dscp set cs1
	    ip protocol udp udp dport != {http,https} udp sport != {http,https} ip dscp < cs5 ct avgpkt 0-500 limit rate 500 kbytes/second ip dscp set cs5
	    counter log prefix "FIREWALL FAIL FORWARDING: " drop
	}
}

table ip masq {
	chain masqout {
	    type nat hook postrouting priority 0; policy accept;
	    oifname $wan masquerade
	}
	chain masqin {
	    type nat hook prerouting priority 0; policy accept;
	}
}

table netdev tagger {
      chain wanin {
      	    type filter hook ingress device eth1 priority 0;
	    jump tagchain
      }
      chain lanin {
      	    type filter hook ingress device eth0 priority 0;
	    jump tagchain
      }
      chain tagchain {
      	    ip protocol udp udp dport {7000-9000, 27000-27200} ip dscp set cs5
      	    ip6 nexthdr udp udp dport {7000-9000, 27000-27200} ip6 dscp set cs5
      	    ip protocol udp udp sport {7000-9000, 27000-27200} ip dscp set cs5
      	    ip6 nexthdr udp udp sport {7000-9000, 27000-27200} ip6 dscp set cs5
	    ip saddr {8.8.8.8, 8.8.4.4, 1.1.1.1, 1.0.0.1, 9.9.9.9, 149.112.112.112} ip dscp set cs5
	   ip daddr {8.8.8.8, 8.8.4.4, 1.1.1.1, 1.0.0.1, 9.9.9.9, 149.112.112.112} ip dscp set cs5
           ip6 saddr{ 2001:4860:4860::8888, 2001:4860:4860::8844, 2606:4700:4700::1111, 2606:4700:4700::1001, 2620:fe::fe, 2620:fe::9 } ip6 dscp set cs5
           ip6 daddr{ 2001:4860:4860::8888, 2001:4860:4860::8844, 2606:4700:4700::1111, 2606:4700:4700::1001, 2620:fe::fe, 2620:fe::9 } ip6 dscp set cs5
	  udp dport 51419 ip dscp set cs1
	  udp sport 51419 ip dscp set cs1
	  udp dport {http,https} ip dscp set cs0
	  udp sport {http,https} ip dscp set cs0
	  ip protocol udp ip saddr {192.168.1.2, 192.168.1.3, 192.168.1.4} ip dscp set cs5
	  ip6 nexthdr udp ip6 saddr {2001:db8::2, 2001:db8::3, 2001:db8::4} ip6 dscp set cs5
	  ip6 nexthdr udp ip6 daddr {2001:db8::2, 2001:db8::3, 2001:db8::4} ip6 dscp set cs5
	}
}

dlakelan · July 1, 2020, 7:02pm

Ok, that probably makes sense, it's getting the packets before conntrack has been able to see them because it's in a netdev chain

that does limit what we can do in the ingress chain in terms of automatic rules. But we can still do those in the upstream direction no problem, and we can do them if we're willing to use a veth or have our AP on the LAN.

I'll see about reorganizing things a bit so we can take advantage of those things at least in the upstream direction.

_FailSafe · July 14, 2020, 12:01am

With nftables, is there a way to do the equivalent of what ipset allows in combination with dnsmasq?

In other words, a fairly straightforward way to build IP sets based on DNS resolution that can be plugged into an nftables statement for dynamic source/dest rules?

UPDATE 1
Doing some research to get myself up to speed on nftables. It looks pretty great, honestly! Coming from a primarily C-based language background, I'm a big fan of the nftables config syntax.

It looks like Named Sets are a big piece of my puzzle I was inquiring about. But it looks like dnsmasq was never updated to support nftables injection of sets. That said, it seems conceivable that a Named Set could be programmatically populated with a script either at startup or periodically via cron.

Also, for giggles, here's a [slightly dated] comparison of performance between iptables and nftables as rule sets scale: https://developers.redhat.com/blog/2017/04/11/benchmarking-nftables/. Intriguing!

dlakelan · July 14, 2020, 1:38am

as far as I know dnsmasq doesn't support this yet. nftables has that capability so it's straightforward to add it to dnsmasq I believe, but the expertise and documentation wasn't around making it not obvious to the maintainer of dnsmasq how to accomplish it (as of 2016):

http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2016q2/010501.html

I don't know if since then it's been revisited. Maybe if you google something up and discover more info, let us know

ldir · July 15, 2020, 8:39am

Simon said 'patches welcome' and has been killed in the rush...ah, no, my mistake, no nftables patches at all. As is often the case with open source, you want it, you code it, unless you can persuade $corp to do it and share for you.

ldir · July 15, 2020, 9:58am

This is a rant and I'm quite cross but I need to get it off of my chest.

nf'f**king'tables.

The status from my 'QoS' perspective is:

dnsmasq doesn't support nftables' named sets so any 'dynamically populated IP address'/port combinations easily achieved with iptables/ipsets is currently off the table. AFAIK there's no 'transition path'. nftables literate people need to provide patches to dnsmasq.
AFAIUI The promise of being able to mangle DSCP on ingress in nftables is only partially fulfilled in that the hook point as pre-NAT so any classifications based on internal IP addresses are still not possible. Leading to...
Workarounds for lack of easy ingress classification in the form of act_ctinfo and storing DSCPs into firewall connmarks are currently impossible in nftables, easily achieved with iptables (and a straightforward patch)

What really pisses me off is that 'act_ctinfo' and 'CONNMARK --set-dscpmark' are solving real world problems TODAY in Openwrt and '--set-dscpmark' in 'old'n'busted' iptables wasn't accepted upstream because there isn't an implementation for the new hotness of nftables. The 'new hotness' is actually stopping development of 'old'n'busted' even though 'new hotness' doesn't have the support mechanism for something that 'old'n'busted' can do with ease.

Storing stuff into connmarks from nftables apparently requires a parser 're-write' which quite frankly is beyond my C and failure to attend a computer science degree. I was hopeful at one point: Jeremy Sowden looked at '--set-dscpmark' and thought "seems simple enough" before being sent down the rabbit hole of parser changes and not seen since.

I want to like nftables. iptables is clunky, I suspect 'ctinfo_4/5layercake.qos' could be written in a much nicer way under nftables but there are some key functionality points missing AFAICT in nftables to do so.... and that's leaving aside openwrts 'fw3'.

sigh. and breath.

moeller0 · July 15, 2020, 11:19am

Let's try to make lemonade here. How about we try to get your nifty script into sqm-scripts proper while OpenWrt is still using iptables based fw3?
I think we already prepared a few things in sqm-scripts (like the automatic iptables unrolling at sqm stop) but we might still need a few more (like allowing custom tear-down routines)...