Nftables custom QoS, round 2

I tried this years ago when I was a student sharing broadband with housemates. I was aware of the potential problem of people deliberately misusing DSCP which was mentioned in the quoted post.
That's why initially I used IP:port to identify traffic with iptables. It was time consuming, while it's relatively easy to find ports used by a game, keeping the IP list up to date was really no fun, I needed the remote IPs to prevent overmatching (e.g. a remote BT client might use a well known game port either by coincidence or on purpose).
Eventually I was fed up, so I started marking DSCP on endpoints and asked people to not misuse it, it was so much easier.

What I did for devices that didn't support DSCP marking was that I used the device's MAC address to identity its traffic and assigned a priority based on its intended use, e.g. game console would be given higher priority, NAS box doing downloads would be given lower priority. This usually worked fine, the caveat was that a game console might use P2P to download game updates, normal download would be fine as congestion tended to happen on egress queue.

When I managed a small network, I found it easier to mark DSCP on endpoints in general and override undesirable/missing DSCP values when it's needed.

To summarize the two approaches

  • always marking DSCP on the router - router assumes DSCP values from endpoints are always incorrect/missing, so the router does all the marking based on rules that need to be maintained.
  • marking DSCP on endpoints and overriding them on the router when necessary is like keeping a blacklist - router assumes DSCP values from most endpoints are correct, it only overrides incorrect/missing ones when needed.

Those are both good strategies. And nftables make it easy to go anywhere in between the two extremes.

Once we get a few people trying this out I'll throw in some more sophisticated rules for people to test. In particular rather than using known ports, it's quite useful to look for traffic that has the right "signature". For example most game or VOIP or vid-conf traffic uses

  1. UDP
  2. A relatively steady flow of packets (a range of packets per second)
  3. A relatively steady bandwidth (a range of packet sizes)

By looking for these criteria it's possible to heuristically mark streams without worrying about ports.

Whenever you've got ingress from the internet you absolutely MUST assume the DSCP is meaningless. This is actually explicit in the description of DSCP, that it's "domain" specific. So when it comes to ingress from the internet I think you really want to whitewash and re-mark all the packets. nftables makes it possible to do some smart stuff. I haven't tried this particular trick, but I think it would be possible in nftables to build a set of destinations where high priority packets are being sent from your LAN, and then remark incoming packets from those same IPs. This obviates the need for tc related hooks that have been created by others, where they copy the DSCP mark to the firewall mark and then restore it on ingress etc.

nftables is extremely powerful, and I'm hoping to get enough interest here that I can get some people to do some testing under real world gaming type conditions... My own network isn't a great place to test, because I've got gigabit fiber and run a bunch of business related stuff where breakage would be bad, but I recognize that there are many people out there who do want customizable QoS, and iptables simply can't hack it (not least in part because it has no chain attached to ingress).


Agreed. I went down a somewhat similar path in the beginning, I was using layer 7 classifier , but it wasn't reliable and I had to use stateful iptables rules to make it work properly, it still produced too many incorrect matches than I could tolerate (maybe I wasn't using it correctly?), then I added IP:port to the mix to improve it. Being a lazy person, I really like a fully automated solution, so far network namesapce on Linux and Local Group Policy Editor on Windows are the simplest reliable solutions I could think of. Maybe you could find something in l7 classifier that could help with your traffic signature analysis.

Modern internet is largely encrypted, so there is no real layer 7 analysis anymore... I use squid proxy to do a certain amount of layer 7 on the basis of domain names, but beyond that it's become kind of hopeless.

But the traffic characteristics are still valuable, and yes I think I've found ways to make this work acceptably, especially now there may be some more headroom than back in the day on 750kbps DSL or whatever.

I would love to test SQM in combination with nftables and my VDSL connection as I do have a lot of gaming traffic in my network but I already failed somewhat with veth + SQM and my setup so I doubt that I'll get nftables to work. If I understand it correctly we have to replace the standard OpenWRT firewall with nftables and this seems to be only possible via compiling a fresh OpenWrt Image and creating every firewall rule from the scratch, which is sadly out of scope for me with my limited skills. For now it seems like that this kind of stuff is more or less only for the skilled and expert people.

Anyway I will look out for this thread as this is very interesting! :slight_smile:


so... some mods were needed to get it to "load" on 19.x.3, whether or not it works right i'm not able to test atm.

had to use non-variables ( only ) here...

changed the name of 'tag' here and/or had to use netdev for ingress to play ball

whoops this is correct but easy fix. good catch. the name shouldn't be important, but it does need netdev.

Ok, let's take another step... let's prioritize DNS from or to various public providers. This is normally a very low bandwidth process and latency is important because it affects the responsiveness of browsing etc. So we add the following to the "tagchain"

I'm following a priority scheme where cs5 is used for games and other high priority stuff, cs6 is used for precisely matched VOIP (ie. from/to known servers). af41 is used for interactive video (zoom, jitsi etc) cs4 is used for buffered streaming video (netflix, fubo, sling, amazon). cs0 is used for best effort, and cs1 is used for bulk

ip saddr {,,,,, 149.112,112,112} ip dscp set cs5
ip daddr {,,,,, 149.112,112,112} ip dscp set cs5

ip6 saddr{ 2001:4860:4860::8888, 2001:4860:4860::8844, 2606:4700:4700::1111, 2606:4700:4700::1001, 2620:fe::fe, 2620:fe::9 } ip6 dscp set cs5
ip6 daddr{ 2001:4860:4860::8888, 2001:4860:4860::8844, 2606:4700:4700::1111, 2606:4700:4700::1001, 2620:fe::fe, 2620:fe::9 } ip6 dscp set cs5

## lets set a known port used by our torrent client to use cs1

udp dport 51419 ip dscp set cs1
udp sport 51419 ip dscp set cs1

## let's also downweight http and https ports that are currently cs0 and that have transferred more than say 20Mbytes 

tcp sport {http, https} ip dscp cs0 ct bytes ge 20000000 ip dscp set cs1
tcp dport {http,https} ip dscp cs0 ct bytes ge 20000000 ip dscp set cs1

In addition, let's assume that all udp streams where avgpkt is less than 500 bytes is a latency sensitive stream, except that we'll limit the total rate this rule matches to 4Mbps = 500 kbytes/second. Feel free to up this to some other limit, like maybe 1/4 of your bandwidth. With 500kbytes/second and up to 500 byte packets, we're talking a thousand packets per second or more. This should be sufficient for several players for most games.

# boost UDP packets with small average size up to some allowed bandwidth, ignoring the QUIC protocol to http and https ports:

ip protocol udp udp dport != {http,https} udp sport != {http,https} ip dscp < cs5 ct avgpkt 0-500 limit rate 500 kbytes/second ip dscp set cs5

## and limit QUIC to best effort

udp dport {http,https} ip dscp set cs0
udp sport {http,https} ip dscp set cs0

Finally, let's prioritize all the udp packets from a set of known "gaming consoles"

ip protocol udp ip saddr {,,} ip dscp set cs5
## ideally here we'd do some NAT lookup and see if the daddr after NAT is one of the consoles... but it's not obvious how to do this. I think we can, or we can do this marking in a filter chain somewhere else instead of a netdev table on ingress.

ip6 nexthdr udp ip6 saddr {2001:db8::2, 2001:db8::3, 2001:db8::4} ip6 dscp set cs5
ip6 nexthdr udp ip6 daddr {2001:db8::2, 2001:db8::3, 2001:db8::4} ip6 dscp set cs5
## no NAT issue exists with public ipv6 addresses, but you have to either know your prefix, or do some kind of masking off the prefix here.. I think that's possible but I"m not sure

Give these rules a try and see if we can get something that loads and runs properly. @anon50098793 are you up for testing this? If so, can you also post your complete example script once it's running? Thanks


I note that both fq_codel and cake in their default flow-isolation modes will boost small/sparse streams like DNS requests simply by virtue of them being sparse, so no magic filtering required, if your leaf qdisc employs the sparseness up-prioritization (this also works pretty well and natural for TCP session intitiation and ACKs, and this naturally scales to overload situations, where a flow needs to be increasingly sparse to still gain that boost, which for overload is a sane strategy).
That said, the syntax is pretty much a joy in its clarity and these are great examples...


You can see why I insist on using nftables for the last couple years ! :wink:

The sparse flow prioritization is a great feature of cake. This kind of nftables based prioritization allows such things for other qdiscs as well, for example HFSC or drr or whatever.

1 Like

so... 'ct' rules seem to not like netdev. shown here in random 'inet' to demonstrate they do actually load... but obviously need to be restructured.

flush ruleset
define wan = eth1
define lan = eth0

table inet filter {
	chain input {
		type filter hook input priority 0; policy drop;
		ct state established,related accept
		iifname lo accept
		ip6 nexthdr icmpv6 accept
		ip protocol icmp icmp type { destination-unreachable, router-solicitation, router-advertisement, time-exceeded, parameter-problem } accept
		ip protocol igmp accept
		ip protocol icmp meta iifname $lan accept
		udp dport ntp iifname $lan accept
		iifname $lan ip protocol udp udp sport bootpc udp dport bootps log prefix "FIREWALL ACCEPT DHCP: " accept

		iifname $lan udp sport dhcpv6-client udp dport dhcpv6-server accept
		iifname $wan udp sport dhcpv6-server udp dport dhcpv6-client accept
		ct state new tcp dport ssh meter ssh-meter4 {ip saddr limit rate 10/minute burst 15 packets} accept
		ct state new ip6 nexthdr tcp tcp dport ssh meter ssh-meter6 {ip6 saddr limit rate 10/minute burst 15 packets} accept 
		iifname $lan tcp dport {http,https} accept
		iifname $lan udp dport domain meter dommeter4 { ip saddr limit rate 240/minute burst 240 packets} accept
		iifname $lan udp dport domain meter dommeter6 { ip6 saddr limit rate 240/minute burst 240 packets} accept
		iifname $lan tcp dport domain meter dommeter4tcp { ip saddr limit rate 240/minute burst 240 packets} accept
		iifname $lan tcp dport domain meter dommeter6tcp { ip6 saddr limit rate 240/minute burst 240 packets} accept
		counter log prefix "FIREWALL INPUT DROP: " drop
	chain forward {
	    type filter hook forward priority 0; policy drop;
	    ct state established,related accept
	    iifname lo accept
	    iifname $lan oifname $wan accept
	    ip protocol tcp tcp sport {http,https} ip dscp cs0 ct bytes ge 20000000 ip dscp set cs1
	    ip protocol tcp tcp dport {http,https} ip dscp cs0 ct bytes ge 20000000 ip dscp set cs1
	    ip protocol udp udp dport != {http,https} udp sport != {http,https} ip dscp < cs5 ct avgpkt 0-500 limit rate 500 kbytes/second ip dscp set cs5
	    counter log prefix "FIREWALL FAIL FORWARDING: " drop

table ip masq {
	chain masqout {
	    type nat hook postrouting priority 0; policy accept;
	    oifname $wan masquerade
	chain masqin {
	    type nat hook prerouting priority 0; policy accept;

table netdev tagger {
      chain wanin {
      	    type filter hook ingress device eth1 priority 0;
	    jump tagchain
      chain lanin {
      	    type filter hook ingress device eth0 priority 0;
	    jump tagchain
      chain tagchain {
      	    ip protocol udp udp dport {7000-9000, 27000-27200} ip dscp set cs5
      	    ip6 nexthdr udp udp dport {7000-9000, 27000-27200} ip6 dscp set cs5
      	    ip protocol udp udp sport {7000-9000, 27000-27200} ip dscp set cs5
      	    ip6 nexthdr udp udp sport {7000-9000, 27000-27200} ip6 dscp set cs5
	    ip saddr {,,,,,} ip dscp set cs5
	   ip daddr {,,,,,} ip dscp set cs5
           ip6 saddr{ 2001:4860:4860::8888, 2001:4860:4860::8844, 2606:4700:4700::1111, 2606:4700:4700::1001, 2620:fe::fe, 2620:fe::9 } ip6 dscp set cs5
           ip6 daddr{ 2001:4860:4860::8888, 2001:4860:4860::8844, 2606:4700:4700::1111, 2606:4700:4700::1001, 2620:fe::fe, 2620:fe::9 } ip6 dscp set cs5
	  udp dport 51419 ip dscp set cs1
	  udp sport 51419 ip dscp set cs1
	  udp dport {http,https} ip dscp set cs0
	  udp sport {http,https} ip dscp set cs0
	  ip protocol udp ip saddr {,,} ip dscp set cs5
	  ip6 nexthdr udp ip6 saddr {2001:db8::2, 2001:db8::3, 2001:db8::4} ip6 dscp set cs5
	  ip6 nexthdr udp ip6 daddr {2001:db8::2, 2001:db8::3, 2001:db8::4} ip6 dscp set cs5

Ok, that probably makes sense, it's getting the packets before conntrack has been able to see them because it's in a netdev chain

that does limit what we can do in the ingress chain in terms of automatic rules. But we can still do those in the upstream direction no problem, and we can do them if we're willing to use a veth or have our AP on the LAN.

I'll see about reorganizing things a bit so we can take advantage of those things at least in the upstream direction.

1 Like

With nftables, is there a way to do the equivalent of what ipset allows in combination with dnsmasq?

In other words, a fairly straightforward way to build IP sets based on DNS resolution that can be plugged into an nftables statement for dynamic source/dest rules?

Doing some research to get myself up to speed on nftables. It looks pretty great, honestly! Coming from a primarily C-based language background, I'm a big fan of the nftables config syntax. :slight_smile:

It looks like Named Sets are a big piece of my puzzle I was inquiring about. But it looks like dnsmasq was never updated to support nftables injection of sets. That said, it seems conceivable that a Named Set could be programmatically populated with a script either at startup or periodically via cron.

Also, for giggles, here's a [slightly dated] comparison of performance between iptables and nftables as rule sets scale: Intriguing!

as far as I know dnsmasq doesn't support this yet. nftables has that capability so it's straightforward to add it to dnsmasq I believe, but the expertise and documentation wasn't around making it not obvious to the maintainer of dnsmasq how to accomplish it (as of 2016):

I don't know if since then it's been revisited. Maybe if you google something up and discover more info, let us know :slight_smile:

1 Like

Simon said 'patches welcome' and has been killed in the rush...ah, no, my mistake, no nftables patches at all. As is often the case with open source, you want it, you code it, unless you can persuade $corp to do it and share for you.

This is a rant and I'm quite cross but I need to get it off of my chest.


The status from my 'QoS' perspective is:

  1. dnsmasq doesn't support nftables' named sets so any 'dynamically populated IP address'/port combinations easily achieved with iptables/ipsets is currently off the table. AFAIK there's no 'transition path'. nftables literate people need to provide patches to dnsmasq.

  2. AFAIUI The promise of being able to mangle DSCP on ingress in nftables is only partially fulfilled in that the hook point as pre-NAT so any classifications based on internal IP addresses are still not possible. Leading to...

  3. Workarounds for lack of easy ingress classification in the form of act_ctinfo and storing DSCPs into firewall connmarks are currently impossible in nftables, easily achieved with iptables (and a straightforward patch)

What really pisses me off is that 'act_ctinfo' and 'CONNMARK --set-dscpmark' are solving real world problems TODAY in Openwrt and '--set-dscpmark' in 'old'n'busted' iptables wasn't accepted upstream because there isn't an implementation for the new hotness of nftables. The 'new hotness' is actually stopping development of 'old'n'busted' even though 'new hotness' doesn't have the support mechanism for something that 'old'n'busted' can do with ease.

Storing stuff into connmarks from nftables apparently requires a parser 're-write' which quite frankly is beyond my C and failure to attend a computer science degree. I was hopeful at one point: Jeremy Sowden looked at '--set-dscpmark' and thought "seems simple enough" before being sent down the rabbit hole of parser changes and not seen since.

I want to like nftables. iptables is clunky, I suspect 'ctinfo_4/5layercake.qos' could be written in a much nicer way under nftables but there are some key functionality points missing AFAICT in nftables to do so.... and that's leaving aside openwrts 'fw3'.

sigh. and breath.


Let's try to make lemonade here. How about we try to get your nifty script into sqm-scripts proper while OpenWrt is still using iptables based fw3?
I think we already prepared a few things in sqm-scripts (like the automatic iptables unrolling at sqm stop) but we might still need a few more (like allowing custom tear-down routines)...

1 Like

I don't think it's as bad as all that. I am working on a grant to massively improve COVID testing availability so I don't have lots of spare time for networking issues, but I think it's straightforward to look at egress connections that have high prio DSCP and plunk the IP addresses and ports into a set and on ingress match source ips and set dscp. it doesn't require any special act_ctinfo.

or am I being obtuse?

That sounds like essentially duplicating already existing conntrack information into a set, something the iptables CONNMARK solution handled more elegantly by simply reusing the mark field of kernel connection tracking entries.

Besides the obvious downsides like increased memory usage, there's also issues with different timeouts of set entries vs. ct entries, unrelated connections reusing the same ports etc.

It seems to me like connmark is a very limited thing, it can attach a 32 bit integer to a connection, but only ONE integer, and it can only be used by one thing at a time... like if you want to use it for copying the DSCP field, you can't use it to say mark packets going to a known set of servers or coming from low-priority machines or that have been validated by a captive portal or whatever else you might want to do with them. For example, I put all udp streams into a higher priority tin and then when they go over a certain pps I connmark them to permanently downgrade their priority. This automatically captures a LOT of interactive RTP traffic without much work.

I don't really see that much in the way of downsides to the nftables approach. It basically comes down to the difference between having composable primitives that are general purpose, and fast but limited special purpose things. In general I think composable primitives win out in the long run, but for certain people with machines that have limited resources the special purpose thing can be better at the moment.

For the most part, the nftables sets are lightweight, I mean maybe if you've got a million simultaneous connections you'll run into memory problems on small routers. My RPi4 has 4GB of RAM and cost less than most enthusiast all-in-wonder routers so I guess I don't care about that myself. If you want to play around with this level of network nerdery I think it's worthwhile to invest in somewhat higher performance hardware than say the low end gl-inet travel routers or the like.

Also, I think the best way to handle ingress is to convert it to egress on the LAN side interface... First do simple total bandwidth throttling on the ingress side, and then do reprioritization on the egress side. This could be as simple as a TBF on the IFB associated with the WAN interface, and then a DRR on the egress, or even a Cake on egress if you've got the CPU.

I feel like in some way my dnsmasq/ipsets post kicked a hornets nest here. Perhaps I am misreading the correlation, too. It certainly was not my intention to stir things up, if that was related.

Regardless, my interest in Nftables is in no way meant to be a jab at anyone’s blood, sweat, and tears that have gone into making iptables + SQM the awesome combination it is currently. I would wager most would agree with that sentiment. Speaking for myself, I am a tinkerer and find enjoyment in fiddling with new, often "bleeding edge" stuff. Yeah, I want my internet connection to work well, and it does, but seeing if I can make it run even better is a motivation that has turned into somewhat of a hobby of mine.

Back to the matter at hand, I certainly do not want to see betterment of iptables taking a hit at the expense of nftables. Iptables is out there, all over the place, and it should continue to get attention. I am sorry to hear of the stalled development because of 'new hotness'. But, it is the nature of IT for "things" to iterate and [hopefully] improve. We all know that does not happen by sticking to one platform/tool/ecosystem forever, no matter how good it is today. Otherwise, we would all still be using "operating systems" like GEOS and trying to reach outside our own four walls with 300 baud modems. Yes, those things worked at the time, but thankfully tinkerers around the world decided not to stay with "good enough" forever.

Please know I am not trying to lecture anyone here. I am furthest from the smartest person in this virtual room. On that point, I will admit it right here and now... I have a hard time understanding the intricacies of iptables. Heck, I have a hard time even understanding the basics of it at times. I mean no offense to anyone reading this, but iptables syntax seems complicated. I work in IT and can develop in multiple languages. But, for some dang reason, iptables confounds me at times. I can look at nftables syntax now after only about a week of reading up on it and I get it. It just makes sense to me. Maybe not to everyone else, but that leads me to my next point.

Options. I would like to see nftables get to the point of being equivalent to iptables. Whether that's next month or years from now--and I'm sure it will. But I think it's important for there to be options. For those that get iptables--use the heck out of it. For those who just cannot wrap their heads around all of it, having another workable alternative might be a better fit. I am not proposing advancement of nftables be at the cost of iptables, but I would like to see people continue to tinker with nftables and figure out what needs to happen to make it feature-equivalent to iptables.

Wrapping this up... @ldir the work you've done to get the ctinfo_4/5layercake is flat-out awesome. I am using the ctinfo_5layercake configuration now and it is the finest I have ever seen my internet connection operate at. The improvement in responsiveness is great. I have the utmost respect for you, and many others here, and I am 100% behind @moeller0's suggestion to "make lemonade" :slight_smile: At the end of the day, I do hope those others like @dlakelan will continue pursuing ways to help make nftables more well-rounded to bring parity to the amazingness that SQM + iptables is today.

Sincerely hoping for no hard feelings here. :+1:

P.S. You might find it "interesting" to note that NAT modules appear to be missing for nftables in the latest kernel 5.4 builds at the moment: [kernel 5.4.x | nft] NAT not working due to missing kmods :smirk:

1 Like