CAKE w/ DSCPs - cake-qos-simple

As the on-duty fly in the ointment, the ipsets approach is as elegant as is it dangerous. The kernel will match via IP address, but in the CDN world that we are living in multiple services might map onto the same IP address. That is not an argument against this as one tool in the hands of consenting and informed adults, but I recommend caution when thinking about offering default domain names to populate such an ipset...
As always I would recommend as well to carefully check each and every element added to the ipset whether special casing is actually an improvement overall (I understand that this is mostly about default down-prioritization and hence any fall-out will be limited). The point is the normal per-IP-per-flow isolation already works pretty well for most applications so that the traditional QoS hierarchies of old might not actually be needed.

1 Like

If setting DNS/NTP to voice is actually pointless (which I suspect it may be given the sparse flow boost), then presumably the simplest approach would be to have LAN clients set DSCPs and just work with that. And for me to remove all attempts at classification in cake-qos-simple. But then there are some LAN clients that can't set DSCPs like televisions or IoT devices. So perhaps offering the limited degree of classification I already offer (to complement LAN client DSCP-setting) is sufficient:

        # correspondence between protocol, destination port and DSCPs
        # the format is:
        # 'protocol' . 'destination port' . dscp_set_bulk OR dscp_set_besteffort OR dscp_set_video OR dscp_set_voice
        define PROTO_DPORT_DSCP_MAP = {
                tcp . 53 : goto dscp_set_voice,  # DNS
                udp . 53 : goto dscp_set_voice,  # DNS
                tcp . 853 : goto dscp_set_voice, # DNS-over-TLS
                udp . 853 : goto dscp_set_voice, # DNS-over-TLS
                udp . 123 : goto dscp_set_voice  # NTP
        }

        # local MAC addresses to set to bulk (e.g. IoT devices)
        # replace MAC address below with comma separated entries
        define BULK_MACS = {
                02:00:00:00:00:00
        }

Don’t forget “caveman”. :kissing_heart:

I did neglect to mention I have patched dnsmasq to pass the TTL when adding the element to the set, so usually the impact is short-lived (5 minutes or less). But the point is valid.

Once I have everything running the way I want, I tend to abandon it all and go back to besteffort, which is even easier now that Comcast has fixed the inbound CS1 problem. No washing needed.

Well, that is my point, if for your use-case this re-marking yields noticeably more responsive internet usage, then it is not pointless, but does it? I think answering that question first might make some sense, no? My gut feeling is that over a wired network this likely is in the noise, aka not noticeable one way or the other. Over a WiFi network this might be different, but here one needs to keep the trade-off in sight: WiFi with its listen before talk and per-TX-OP determination of coding scheme has a considerable per-TX-OP overhead, wasting that overhead on a single NTP or DNS packet will easily result in a noticeable reduction in total possible throughput on that channel (and close by channels). (One of the schemes WiFi uses to ameliorate these rather high per TX-op costs is by trying to send more data per TX-OP, which increases throughput but also introduces a noticeably higher jitter over the air than over e.g. switched ethernet, but I digress).

You can always add firewall rules for those? And only IoT devices that are used interactively will ever care about how fast DNS/NTP are, no? In any case, nft rules per cake-qos-simple or per FW4 both suffer from the fact that from the actual device to the router what ever DSCP the device uses is active including any side-effect on WiFI AC selection. That is if an IoT device sends bulk data marked CS7 that will map to the highest priority WiFi AC (AC_VO) and hence hog TX-ops (unless you use a specially tailored qos_map to counter act such shenanigans*)

*) I am beginning to see why a dedicated SSID for IoT devices might make a ton of sense (now I only need IoT devices to put this theory to a test :wink: ).

+1; unfrozen caveman at that :wink:

Oh, I have zero doubts that this tool in your hands is in 'good hands' my concern is more about people starting out with the whole QoS game that are not on top of the potential side-effects.

:wink: , I typically end up running diffserv3 but do no active DSCP management, this is mostly so I can run quick tests easily. Sure my ISP throws me an occasional curve-ball, like ICMP error messages coming in as CS6, but honestly these are rare enough to not bother and I think for true network errors CS6 is just fine (not my ISPs fault that some of these errors come from my artificial TTL games in MTR and friends).

I am still impressed about comcast's and specifically @jlivingood's action in that regard and am really feeling somewhat guilty that we did not try to contacting Comcast earlier (though personally, I never lived in comcast's territory and never was their customer making it somewhat awkward to report an issue).

2 Likes

DSCPs affecting WiFi is funky. Should DSCPs be zeroed before packets are WiFi'd? Or should WiFi be instructed to ignore such signals. Seems there is a disconnect arising from conflation of DSCPs for WAN with signals for LAN.

This post from @moeller0 has a lot of references to the complexities involved.

Read the RFC before bed for maximum REM sleep.

I had forgotten about the differences between CS1 and LE. Is it preferred to put our bulk traffic in CS1 (AC_BE) or LE (AC_BK)?

1 Like

oh, cool, I thought at least one tv was negotiating ecn. I wonder if it is on universally for webos?

I discovered recently that using pedit you need to tell it at the very end to recalc the checksum on ipv4 when scribbling on ecn or diffserv. you don´t need it on ipv6.

I had kvetched to comcast a few times on the cs1 remapping issue over the last decade. Glad it is fixed now. (thx @jlivingood !!) But my next question would be what codepoints (like NQB) actually survive transit over comcast now?

Yes. @Lynx came to the same conclusion, the ipv4 header's checksum needs to be recalculated, ipv6 is fine due to lacking a header checksum at all. As the TOS/traffic_class byte (unlike e.g. the source and destination addresses) is not included in the pseudoheader for TCP or UDP checksum calculation these do not need to be refreshed after DSCP/ECN manipulations.

I think a few things in comcast's focus had changed recently and there was more intuitive understanding about the consequences of mis-marking packets to background... in short there are better and worse times to have such issues fixed...

Sidenote, I currently see ECT(1) packets coming in my network from my ISP and my ISP does not really bother so I am having a hard time figuring out whether this is datacenter's leaking dctcp or an inartful attempt at DSCP re-mapping by some intermediate AS or even my ISP, but I digress.

I wonder whether the TV issue might not be 'apparent' ecn, that is instead of intended ECT(0) marking (which cake uses and responds with CE instead or drop) these are not just inartful DSCP/TOS byte remarkings instead (like trying to set DSCP 1 but writing TOS1), or worse dctcp escaping somehow into the wild...

Good question, also inconvenient to test... I guess one could use irtt/flent (with manually set DSCP values) and capture the irtt traffic and look at the DSCPs in the packets, however that would be easier if the payload would contain the intended DSCP as well...

captiuring that particular tv to see if a rate reduction happened (CWR) and that it echoed the ecn bits properly would be good. BBR still just plain ignores the ECN bits.

I was seeing (From my libreqos perspective) more ECN than could be accounted for than coming from just apple.

what codepoints actually survive

Right now nothing - any inbound DSCP marks are bleached at domain boundaries. We will in the next couple of months begin testing DSCP-45 (NQB) for inbound packets but the plan is still in development because we rely on some readiness / testing by the core network team & associated network policies are changed very carefully. That will likely mean we test it with a small number of peers initially, maybe even just a few interconnects at first, soak it to see what happens, then expand gradually. We have been testing DSCP-45 inside the network - right now to an internal DNS server to test out LL via low bitrate UDP/53 traffic.

This is relatively easy to test with tcpdump:

# catch non-zero DSCP
tcpdump -v -n -i pppoe-wan '(ip6 and (ip6[0:2] & 0xfc0) >> 6  != 0) or (ip and (ip[1] & 0xfc) >> 2 != 0)' # NOT CS0

If run on a router with the appropriate wan interface this will show all IPv4 and IPv6 packets that carry any non-zero DSCP fields.

Here are a few invocations to look closer at the ECN bitfield

# catch ECN
tcpdump -v -n -i pppoe-wan '(ip6 and (ip6[0:2] & 0x30) >> 4  == 0) or (ip and (ip[1] & 0x3) == 0)' # Not-ECT

tcpdump -v -n -i pppoe-wan '(ip6 and (ip6[0:2] & 0x30) >> 4  != 0) or (ip and (ip[1] & 0x3) != 0)' # NOT Not-ECT
tcpdump -v -n -i pppoe-wan '(ip6 and (ip6[0:2] & 0x30) >> 4  == 1) or (ip and (ip[1] & 0x3) == 1)' # ECT(1)
tcpdump -v -n -i pppoe-wan '(ip6 and (ip6[0:2] & 0x30) >> 4  == 2) or (ip and (ip[1] & 0x3) == 2)' # ECT(0)
tcpdump -v -n -i pppoe-wan '(ip6 and (ip6[0:2] & 0x30) >> 4  == 3) or (ip and (ip[1] & 0x3) == 3)' # CE

My gut feeling is that the teportd of the lack of rfc3168 signaling over the existing internet might not have been all that objective...
I am currently puzzling about a number of ECT(1) marked packets hitting my ingress. And about some steam downloads that accumulate 17 million CE marks for ~29GB over a 105 Mbps (~10MBps) link. My gut feeling is this smells like too many CE marks, but I have neither tested ECE/CWR nor tried to calculate how many marks I should expect...

I predict this is working reasonably well. After all without any greedy traffic in the L-queue this is essentially classic priotitization (or 'conditional' prioritization to use a term from the docsis standards).
Question:

  1. this is operational in both directions?
  2. will this dscp45 path be between comcast's dns servers and the forwarder on the docsis router? Put differently, will these packets traverse a WiFi link or not?

DSCP-45 will eventually work in both directions (out of the CM and into the CM). As I noted, we need to sort inbound policy but we will send outbound 45 (though presumably most peers will bleach it).

In terms of WiFi, yes it should traverse the wireless network fine in most cases - but we're doing a bunch of testing of that right now in the LL trials so I can't say 100% just yet.

This is a heads up that this script does not work for egress if you are using fastpath (in nftables it's called flowtables). Below are some charts where I'm doing a line rate download on a 1000/35 connection. In the charts you'll see two downloads. In the first download I have fastpath turned off. In the second download I have fastpath turned on.

Here is the bulk ingress tin. You can see that the packets are put in the bulk tin in both cases:

Here is the bulk egress tin. You can see there are no packets when fastpath is on:

Here is the best effort egress tin, which confirms it:

And finally, the chart of the flows in the bulk egress tin. I am using 150 connections in the download. It is interesting:

How do you assign these downloads to the bulk tin in your rules?

What mark is shown in /proc/net/nf_conntrack for all these connections?