CAKE w/ DSCPs - cake-qos-simple

moeller0 · October 9, 2024, 12:04pm

Yes and no, explicit congestion notification (ECN) uses a 2 bit wide bitfield with in the IP header (this was carved out of the old IPv4 Type of service (TOS) or IPv6 traffic-class (tclass)). A 2 bit-wide bitfield can encode 4 code points, which in ECN have been named in rfc 3168:

5.  Explicit Congestion Notification in IP

   This document specifies that the Internet provide a congestion
   indication for incipient congestion (as in RED and earlier work
   [RJ90]) where the notification can sometimes be through marking
   packets rather than dropping them.  This uses an ECN field in the IP
   header with two bits, making four ECN codepoints, '00' to '11'.  The
   ECN-Capable Transport (ECT) codepoints '10' and '01' are set by the
   data sender to indicate that the end-points of the transport protocol
   are ECN-capable; we call them ECT(0) and ECT(1) respectively.  The
   phrase "the ECT codepoint" in this documents refers to either of the
   two ECT codepoints.  Routers treat the ECT(0) and ECT(1) codepoints
   as equivalent.  Senders are free to use either the ECT(0) or the
   ECT(1) codepoint to indicate ECT, on a packet-by-packet basis.
   The use of both the two codepoints for ECT, ECT(0) and ECT(1), is
   motivated primarily by the desire to allow mechanisms for the data
   sender to verify that network elements are not erasing the CE
   codepoint, and that data receivers are properly reporting to the
   sender the receipt of packets with the CE codepoint set, as required
   by the transport protocol.  Guidelines for the senders and receivers
   to differentiate between the ECT(0) and ECT(1) codepoints will be
   addressed in separate documents, for each transport protocol.  In
   particular, this document does not address mechanisms for TCP end-
   nodes to differentiate between the ECT(0) and ECT(1) codepoints.
   Protocols and senders that only require a single ECT codepoint SHOULD
   use ECT(0).

   The not-ECT codepoint '00' indicates a packet that is not using ECN.
   The CE codepoint '11' is set by a router to indicate congestion to
   the end nodes.  Routers that have a packet arriving at a full queue
   drop the packet, just as they do in the absence of ECN.

      +-----+-----+
      | ECN FIELD |
      +-----+-----+
        ECT   CE         [Obsolete] RFC 2481 names for the ECN bits.
         0     0         Not-ECT
         0     1         ECT(1)
         1     0         ECT(0)
         1     1         CE

      Figure 1: The ECN Field in IP.

BUT since then ECT(1) has been retroactively re-defined as indicating L4S style congestion control... to be explicit the IETF did not honor its creed "improving the internet for all users" with L4S but rather bowed to the commercial interests of a select few, by ratifying a really underwhelming design falling well short of the state of the art; L4S: too little too late. This process taught me a lot about IETF "standards" and how to use them (spoiler alert: you need to actually read them, there are really great ones among clunkers and engineering by wishful thinking), but I digress.

moeller0 · October 9, 2024, 12:28pm

No, the per-packet-overhead and MPU values are required for our traffic shaper to properly do its job, these should be set >= the true values (empirically measuring the true values can be hard, so rather err on the side of slightly too big than too small). There is really no theoretical use in playing with these values after you found proper settings. Our traffic shaper needs to know the accountable size of every packet on the bottleneck link so it can assert to not exceed its configured gross rate. Some link layers like ethernets have a minimum size that can be larger than the IP payload size (plus per-packet overhead) the linux kernel tracks, unless you configured the MPU value accordingly our traffic shaper will under-estimate the true effective size of small packets, if your traffic mix only has few small packets, you will not notice, but the more of such small packets are in the mix the more likely you are to experience bufferbloat. But since MPU becomes only relevant with small packets, it is not something you can easily "test" with a few waveform bufferbbloat tests...

See above you should set the per-packet-overhead to >= the true overhead otherwise our traffic shaper is not able to do its intended job robustly and reliably. But for a given fixed (dominant) packet size under-estimating the per-packet-overhead can be balanced out with setting the gross shaper rate lower (but that setting will fail to work reliably for much smaller packet sizes), see here:
https://openwrt.org/docs/guide-user/network/traffic-shaping/sqm-details#sqmlink_layer_adaptation_tab
for a discussion of how gross shaper rate and per-packet-overhead appear dependent on each other.

In short, unless we have a bug somewhere, per-packet-overheads and MPU should be set >= their true but hard to measure sizes, while the gross shaper rate needs to be set <= the true bottleneck rate (note: especially for ingress shaping it tends to be more reliable to set the shaper rate to 80-90% of the true bottleneck rate, otherwise epochs where the ISPs upstream elements queues are overfilled become too likely (that however is a policy question: how much of these "back-spill" events are you willing to tolerate versus the static loss in achievable throughput, "your network, your rules" applies here ))

See above, but in short ECT means ECN-capable transport so these are relevant for ECN style signalling (be that rfc3168 or dctcp/l4s style). Cake will always honor ECT, that is will normally not drop ECT packets, but mark them CE according to the rfc3168 marking rules. This generally is a good thing, but occasionally it helps being able to disable this behaviour and since cake does not offer a toggle for that some more effort is required. In addition cake does not offer dctcp/l4s style "shallow marking" and such traffic will not behave well with cake's marking rule (re-defining what a CE mark means is one of the more obvious mis-designs in L4S*), so being able to selectively "neuter" ECT(1) can be useful for cake.

*) DCTCP was only ever intendeds for local usage in data centers where the operator controls all machines, here re-defining what CE means is much less of a problem than extending this over the whole internet as L4S did, but then L4S was badly engineered in multiple dimensions, this is is just one of them.

Amalvajdan · October 9, 2024, 2:45pm

Thank you @moeller0 i learn something new everyday.

@Lynx adding flags interval worked! thank you @dave14305

Lynx · October 9, 2024, 9:39pm

Ah sweet! Thanks for that feedback @Amalvajdan. I’ll push a commit with that flag to make the nftables template a little better. By the way, in case you were wondering, it made more sense to me to have users modify nftables code directly rather than code up a bridge between a config file and nftables code. Nftables is very powerful so experimenting is highly worthwhile.

Lynx · October 15, 2024, 9:44pm

@dave14305 any clue what's going on with the below?

Namely I split my wan interface into x3 VLANs, with wan.1 now carrying the actual wan traffic.

But with this:

Setting up tc filter to restore DSCPs from conntrack on ingress packets on interface: 'wan.1' and redirecting to IFB interface: 'ifb-wan.1'.

I'm not seeing all the download packets in ifb-wan.1.

root@OpenWrt-1:~# tc -s filter show dev wan.1
filter parent 1: protocol all pref 49152 matchall chain 0
filter parent 1: protocol all pref 49152 matchall chain 0 handle 0x1
  not_in_hw (rule hit 10770)
        action order 1: ctinfo zone 0 continue
         index 2 ref 1 bind 1 dscp 0x0000003f 0x00000080 installed 88 sec used 0 sec firstused 88 sec DSCP set 228 error 0 CPMARK set 0
        Action statistics:
        Sent 1386963 bytes 10792 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0

dave14305 · October 15, 2024, 10:27pm

I’m not a VLAN guru, but why would your WAN be segmented instead of your LAN?

Lynx · October 15, 2024, 10:29pm

It’s because I have downstream router and upstream router and want to pass across WiFi traffic along wan cable to leverage upstream router’s WiFi.

What I’m not getting is why I see download packets with vlan id 1 on interface wan but on interface wan.1 I don’t see these download packets. I only see download packets that terminate at router not packets that terminate at end clients.

@patrakov maybe you have an idea here?

patrakov · October 15, 2024, 10:48pm

@Lynx No idea. Could you please post the whole switch and interface configuration?

On non-OpenWrt routers, it might be due to the fact that VLAN 1 is "special" by default. Does the problem disappear if you replace VLAN 1 with any other unused number like 5?

Lynx · October 15, 2024, 10:59pm

Backtracking, if I create an 802.1q device based on wan called wan.1 and then make my wan interface use device wan.1 would you expect wan.1 to include download packets that terminate in client devices? I’m confused as to why I’m only seeing download packets that terminate at the router.

The whole thing works, it’s just that when I use tcpdump on interface wan.1 I’m not seeing what I’d expect.

And this explains why the tc mirroring isn’t picking up all the packets.

But with tcpdump on wan itself with the -e vlan switch I see all the packets correctly including the missing packets to end devices with vlan id 1.

patrakov · October 15, 2024, 11:05pm

It depends on the switch configuration. That's why I asked.

Lynx · October 15, 2024, 11:08pm

root@OpenWrt-1:~# cat /etc/config/network 

config interface 'loopback'
        option device 'lo'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'

config globals 'globals'
        option ula_prefix 'xxx'

config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'lan4'
        list ports 'wan.2'

config interface 'lan'
        option device 'br-lan'
        option proto 'static'
        option ipaddr '192.168.1.1'
        option netmask '255.255.255.0'
        option ip6assign '60'

config interface 'wan'
        option proto 'dhcp'
        option device 'wan.1'

config interface 'wan6'
        option proto 'dhcpv6'
        option reqaddress 'try'
        option reqprefix 'auto'
        option device 'wan.1'

config device
        option type 'bridge'
        option name 'br-guest'
        list ports 'vxguest2'
        list ports 'vxguest3'
        list ports 'wan.3'

config interface 'guest'
        option proto 'static'
        option device 'br-guest'
        option ipaddr '192.168.2.1'
        option netmask '255.255.255.0'

config interface 'vxguest2'
        option proto 'vxlan'
        option peeraddr '192.168.1.2'
        option vid '1'

config interface 'vxguest3'
        option proto 'vxlan'
        option peeraddr '192.168.1.3'
        option vid '2'

config device
        option type '8021q'
        option ifname 'wan'
        option vid '1'
        option name 'wan.1'

config device
        option type '8021q'
        option ifname 'wan'
        option vid '2'
        option name 'wan.2'

config device
        option type '8021q'
        option ifname 'wan'
        option vid '3'
        option name 'wan.3'

patrakov · October 15, 2024, 11:16pm

Thanks for posting the configuration. I think that what you hit is indeed a case of "VLAN 1 is the default" quirk, but I would need to test this to tell for sure.

EDIT: no, VLAN 1 is not the default. I don't know how to explain the missing packets.

Lynx · October 15, 2024, 11:24pm

Yes I just tested with vlanid 5 and same issue.

Is software flow offloading relevant?

patrakov · October 15, 2024, 11:28pm

It might be, depending on which devices are included in the flowtable, and this depends on the OpenWrt version and on whether you apply the required workaround for WiFi roaming (see https://github.com/openwrt/openwrt/issues/10224#issuecomment-2316382659). See the output of nft list ruleset (search for flowtable ft).

Turn it off to eliminate all doubts.

Also test the commands from that GitHub comment that update the firewall rules to only include logical (not physical) devices in the offload, this might also resolve the packet visibility issue.

Lynx · October 16, 2024, 8:07am

Yes this is the reason and your comment concerning nftables seems very relevant - thanks very much indeed, and impressive knowledge as ever.

With software flow offloading enabled, I see that the wan device appears in that flowtable ft declaration.

flowtable ft {
    hook ingress priority filter
    devices = { lan1, lan2, lan3, lan4, vxguest2, vxguest3, wan, wl0-ap0, wl0-ap1, wl1-ap0, wl1-ap0.sta1, wl1-ap0.sta2 }
    counter
}

I presume this is relevant?

And when disabling software flow offloading, I see the packets again.

patrakov · October 17, 2024, 12:31pm

Yes, this is relevant. The latest fw4 code puts only logical devices, not physical, into the flowtable. For me, this is:

	flowtable ft {
		hook ingress priority filter
		devices = { br-lan, pppoe-wan, wan, wwan0 }
		counter
	}

wan is there only because of the need to access the modem UI, but with up-to-date kernels, it does not matter whether the table includes wan or pppoe-wan.

Lynx · October 18, 2024, 11:54am

Interesting. So what does an entry in this flowtable ft declaration achieve? What does it do?

patrakov · October 18, 2024, 12:25pm

https://wiki.nftables.org/wiki-nftables/index.php/Flowtables

If a packet arrives via the interface listed in the flowable, and there is a cached entry internal to the flowable that matches the source and destination addresses and ports, then the normal firewalling and routing logic is skipped. Instead, the relevant part of the packet is copied (possibly with the TTL decremented, decapsulation performed, and addresses changed to satisfy the NAT requirements) directly to the cached outgoing interface. No bridge port re-lookup or ARP is ever done. Collectively, this is known as flow offloading.

In short: flow offloading applies to packets arriving via these interfaces.

Lynx · October 18, 2024, 1:37pm

Very helpful and makes sense. Thank you.

But regarding:

I don’t get this bit.

moeller0 · October 18, 2024, 1:38pm

Thanks, so it looks like software flow offloading steals packets even before qdiscs get hold of them? If that would be the case that would be sub-optimal for a future where sqm and SFO cooperate.