CAKE w/ DSCPs - cake-qos-simple

plater · November 21, 2023, 5:01am

I’m interested in applying cake-qos-simple to only specific MAC’s. I’ve commented out the necessary parts and it appears to work. Is there a better way to go about it?

ether saddr { XX:XX:XX:XX:XX:XX, XX:XX:XX:XX:XX:XX } oifname wan ct mark & 128 == 0 goto store-dscp-in-conntrack

Edit:
Looks like there’s more I would need to do and I’m not sure it’s within this projects scope. Sorry for the noise.

Lynx · November 22, 2023, 3:45pm

Looks fine to me - only classify connections coming from certain source MACs that have yet to be classified?

Seems well within scope to me. Helpful to explore ways to customize things.

plater · November 22, 2023, 4:18pm

I‘ve been running with it since I posted about it and it’s working well. I just feel more comfortable applying conntrack to only machines that need it and that I’m in complete control of. It limits the potential for abuse and unforeseen issues.

I thought I could make it more efficient if I did it via the ‘tc filter’ portion of the script but my knowledge is lacking. It seems I would need to add a new line per MAC if I go that route. I didn’t get too far. I just have this in my notes should I get motivated to come back to this.

u32 match ether src XX:XX:XX:XX:XX:XX

anon78773196 · November 22, 2023, 5:58pm

@Lynx do we agree that with this script we need to add our own dscp marks for it to be effective and fair?

Lynx · November 22, 2023, 6:05pm

plater:

I thought I could make it more efficient if I did it via the ‘tc filter’ portion of the script but my knowledge is lacking. It seems I would need to add a new line per MAC if I go that route. I didn’t get too far. I just have this in my notes should I get motivated to come back to this.
u32 match ether src XX:XX:XX:XX:XX:XX

I wouldn't go that route. What you could do instead is have nftables apply a mark and then match on that mark in tc - like so:

github.com

lynxthecat/cake-dual-ifb/blob/17b7eacd80056f326384acd14f8ebdfe8a810364/cake-dual-ifb.nft#L80


      
          	chain process-lan-to-wan {
          
          		# Handle DNS hijacking (effectively resulting in openwrt-to-wan)
          		tcp dport 53 mark set 2 goto process-openwrt-to-wan
          		udp dport 53 mark set 2 goto process-openwrt-to-wan
          
          	}
          
          	chain process-wan-to-lan {
          
          		oifname $IFACE_NAMES mark set 1
          		
          		# Handle DNS hijacking
          		tcp sport 53 mark set 0
          		udp sport 53 mark set 0
          
          	}
          
          	# OpenWrt->wan dscp classication by router
                  chain classify-dscp {

github.com

lynxthecat/cake-dual-ifb/blob/master/cake-dual-ifb#L30


      
          
          		tc qdisc add dev $iface handle ffff: ingress
          		tc qdisc add dev $iface handle 1: root prio
          
          		# capture $if (ingress) -> wan 
          		# filter on fwmark 1
          		tc filter add dev $iface parent ffff: protocol all handle 1 fw flowid 1:1 action mirred egress redirect dev ifb-ul
          
          		# capture wan -> $iface (egress) and restore DSCPs from conntracks
          		# filter on fwmark 1
          		tc filter add dev $iface parent 1: protocol all handle 1 fw flowid 1:1 action ctinfo dscp 63 128 action mirred egress redirect dev ifb-dl
          
          	done 
          
          	tc qdisc add dev wan handle ffff: ingress
          	tc qdisc add dev wan handle 1: root prio
          
          	# capture OpenWrt -> wan (egress)
          	# filter by fwmark 2 (use nftables to skip over WireGuard traffic with fwmark 3 and apply fwmark 2 to the remainder)
          	tc filter add dev wan parent 1: protocol all handle 2 fw flowid 1:1 action mirred egress redirect dev ifb-ul

But I'm not sure what the advantage of that would be versus instead just limiting classification in nftables to source MAC as you do already.

Correct - it's up to the user to set the outbound DSCPs as desired using nftables and/or by setting it in the individual LAN clients. I personally have mixture - e.g. router sets stuff like DNS requests or NTP requests to 'voice' and my office computer sets Teams/Zoom traffic to mixture of 'video' and 'voice'.

anon78773196 · November 22, 2023, 6:08pm

ok thanks for your reply
to be clear, the rules I want to set are not to be added in the script itself, but in luci traffic rules for example.

or an own .nft script

Lynx · November 22, 2023, 6:16pm

You can tweak /root/cake-qos-simple/nft.rules and follow template or add your own rules. Or use LuCi.

plater · November 22, 2023, 6:17pm

That’s reassuring, thank you! If you think this feature is worthy of addition then I’d be happy to test.

Lynx · November 22, 2023, 6:19pm

It's difficult to generalise. My thinking is that with the custom nft.rules file users can easily add their own rules for individual use cases by following the nftables format. I don't see the point in making a separate system for cake-qos-simple that translates to nftables rules.

plater · November 22, 2023, 6:28pm

That’s a big reason why I chose this solution. It’s easily hackable!

Lynx · November 22, 2023, 7:23pm

Yes it’s very powerful. I think leveraging connection tracking is super useful to enable setting of DSCPs on upload at router and/or by individual LAN clients, saving those to conntrack, and then restoring DSCPs on download. And harnessing nftables directly provides the flexibility and power that comes with that.

Lynx · January 9, 2024, 11:14am

@moeller0 I have a question for you. I've been observing the tc cake stats whilst running speed tests and have observed, for example, the below taken during a saturating download speed test with the 'ondemand' performance governor:

root@OpenWrt-1:~# service cake-qos-simple download
qdisc cake 1: root refcnt 2 bandwidth 80Mbit diffserv4 triple-isolate nat nowash ingress no-ack-filter split-gso rtt 100ms noatm overhead 0
 Sent 151319932806 bytes 149596359 pkt (dropped 75750, overlimits 160266270 requeues 0)
 backlog 159534b 117p requeues 0
 memory used: 3656528b of 4Mb
 capacity estimate: 60Mbit
 min/max network layer size:           46 /    1500
 min/max overhead-adjusted size:       46 /    1500
 average network hdr offset:           14

                   Bulk  Best Effort        Video        Voice
  thresh          5Mbit       80Mbit       40Mbit       20Mbit
  target            5ms          5ms          5ms          5ms
  interval        100ms        100ms        100ms        100ms
  pk_delay       22.3ms       14.6ms        1.3ms        457us
  av_delay       15.7ms       11.9ms        217us        102us
  sp_delay         12us       3.32ms          6us          7us
  backlog            0b      159534b           0b           0b
  pkts           845346    147801944         7173      1017763
  bytes       964498637 150044846463      4826489    409767981
  way_inds           17     13124212            0        49586
  way_miss         2582       207280          349        54387
  way_cols            0            0            0            0
  drops            1010        74735            1            4
  marks               0            0            0            0
  ack_drop            0            0            0            0
  sp_flows            1            5            1            2
  bk_flows            0            1            0            0
  un_flows            0            0            0            0
  max_len         67326        68700        10992         3490
  quantum           300         1514         1220          610

I am trying to understand 'pk_delay', 'av_delay' and 'sp_delay' - I gather that these are short for 'peak delay', 'average delay' and 'sparse delay', and that EWMAs are involved. But what these represent is very hazy in my mind.

I'd really value a kind of summary on what these values represent and how we can interpret them.

Looking through various posts on the forum in which you comment on 'pk_delay', I gather that sometimes large values can be indicative of CPU saturation.

In what way can CPU saturation give rise to an increase in 'pk_delay', and how can we tell whether a large 'pk_delay' is a result of CPU saturation or something else? I'm still curious about whether the 'ondemand' performance governor might be hurting performance on my router.

moeller0 · January 9, 2024, 12:25pm

Yes, these are EWMAs of the peak, the average and the "sparse" delay... sparse delay is for flows classified as sparse (sp_flows) these get a slight boost over other flows and hence are expected to see lower latencies. The problem with these measures mainly is these are EWMAs that are changed with every packet dequeued, so after a saturating load, as long as there still is some traffic, even the peak delay value will come down relatively quickly... I always phantasize about having the peak value be the true maximum, but reset on each read (so maximum sojourn time since last read-out), but I have not even thought about how to accomplish that...

pk_delay takes the most recent value with a higher weight than av_delay, so it will show larger swings and hence is easier to interpret. Other than that if cake runs out of CPU cycles (or does not get scheduled quickly enough) it will not drop packets hard but show increased delay, and as I said pk_delay tends to be more affected by that.

We can only take this as a hint, not as proof, then I would correlate this with current CPU load (ideally per CPU) and potentially CPU frequency to get stronger evidence.

Lynx · January 9, 2024, 1:49pm

Thanks for this explanation. I'm still a bit hazy. Is there any documentation on these values I wonder? Might it be possible or helpful to describe the situation from the perspective of a few particular packets and what cake does with them, and how this affects these metrics?

For example:

What governs this other than available CPU time?

Is there some sense in which the connection itself has an impact?

Suppose cake bandwidth is set to 60Mbit/s and true bandwidth as in capacity on the line is 10Mbit/s, and there is an attempt to exchange a) load that is 20Mbit/s b) load that is 70Mbit/s. How does the scheduling work in those situations? Or perhaps other such hypothetical examples might be better.

moeller0 · January 9, 2024, 2:28pm

I guess the source code:

	/* collect delay stats */
	delay = ktime_to_ns(ktime_sub(now, cobalt_get_enqueue_time(skb)));
	b->avge_delay = cake_ewma(b->avge_delay, delay, 8);
	b->peak_delay = cake_ewma(b->peak_delay, delay,
				  delay > b->peak_delay ? 2 : 8);
	b->base_delay = cake_ewma(b->base_delay, delay,
				  delay < b->base_delay ? 2 : 8);

static u64 cake_ewma(u64 avg, u64 sample, u32 shift)
{
	avg -= avg >> shift;
	avg += sample >> shift;
	return avg;
}

So pk_delay simply weights past and current value differently than av_delay, but both get updated for every dequeued packet...

If I set up a timer to wake/run cake again in 10ms and the kernel takes 20ms there is not much I can do, as I will be 5ms later than desired, the question then is how to deal with that unsolicited delay...

I will need to read closer through the code at https://elixir.bootlin.com/linux/latest/source/net/sched/sch_cake.c
to understand that fully.

Lynx · January 9, 2024, 2:38pm

So are the delays 'pk_delay', 'avg_delay' and 'sp_delay' always associated with (avoidable) delays introduced by processing limitation (in terms of taking time to get round to doing something) or is there a situation in which something else can give rise to the delays?

Would a hypotethical computer with infinite processing capability result in these values always being zero?

moeller0 · January 9, 2024, 2:49pm

No these really just are (biased) EWMAs over the sojourn times of dequeued packets... anything that makes a packet stay enqueued longer will result in delay spikes... be that that the queue can not be serviced as quickly as desired or that the the number of arriving packets exceeds the number of departing packets (and that to a degree is the reason for having queues, they act as "shock absorbers")

Lynx · January 9, 2024, 3:31pm

This is what I'm trying to better understand. What is it that limits the servicing of a queue?

In respect of my download example:

root@OpenWrt-1:~# service cake-qos-simple download
qdisc cake 1: root refcnt 2 bandwidth 80Mbit diffserv4 triple-isolate nat nowash ingress no-ack-filter split-gso rtt 100ms noatm overhead 0
 Sent 151319932806 bytes 149596359 pkt (dropped 75750, overlimits 160266270 requeues 0)
 backlog 159534b 117p requeues 0
 memory used: 3656528b of 4Mb
 capacity estimate: 60Mbit
 min/max network layer size:           46 /    1500
 min/max overhead-adjusted size:       46 /    1500
 average network hdr offset:           14

                   Bulk  Best Effort        Video        Voice
  thresh          5Mbit       80Mbit       40Mbit       20Mbit
  target            5ms          5ms          5ms          5ms
  interval        100ms        100ms        100ms        100ms
  pk_delay       22.3ms       14.6ms        1.3ms        457us
  av_delay       15.7ms       11.9ms        217us        102us
  sp_delay         12us       3.32ms          6us          7us
  backlog            0b      159534b           0b           0b
  pkts           845346    147801944         7173      1017763
  bytes       964498637 150044846463      4826489    409767981
  way_inds           17     13124212            0        49586
  way_miss         2582       207280          349        54387
  way_cols            0            0            0            0
  drops            1010        74735            1            4
  marks               0            0            0            0
  ack_drop            0            0            0            0
  sp_flows            1            5            1            2
  bk_flows            0            1            0            0
  un_flows            0            0            0            0
  max_len         67326        68700        10992         3490
  quantum           300         1514         1220          610

Is the servicing of the queues in any sense limited by the cake bandwidth? I initially wondered about that, but I thought the cake bandwidth dictates the rate of dropping, not the time of servicing packets in the queue.

Does a slow connection somehow limit the rate of servicing a queue? If so, how does that work?

Once the packets have been downloaded, and cake can either send them on their way or drop them, shouldn't there be essentially zero build up of queues given sufficiency processing capacity? If so, then isn't significant delay values always a measure of processing limitation?

moeller0 · January 10, 2024, 8:45am

Yes and no, conceptually cake needs to become active when:
a) a new packet arrives and is enqueued (and timestamped)
b) a packet is about to be dequeued (if operating as traffic shaper cake will dequeue some packets and then wait for some time to dequeue the next packets)

And b) is clearly affected by cake's bandwidth setting.

As long as b) is not happening, packets already in the queue will "age" that is the current time will increase while the enqueuing timestamp stays constant so the sojourn time at eventual dequeue time will increase. Any packet being enqueued while cake is waiting for the next time to dequeue packet(s) will grow the queue (actually cake uses stochastic flow queueing so there are multiple queues in parallel, but that is irrelevant here).

Anything that delays b) over the time it was supposed to happen will result in lower throughput than expected from the shaper setting, unless the traffic shaper tries to compensate and the dequeues a bit too much (resulting in less throughput sacrifice for being late at the cost of increasing latency transiently a bit).

Yes, if we keep things simplistic and assume only fixed sized packets an a set shaper rate of say 100 packets/second that means that if cake dequeued a packet now, the next packet will be ideally dequeued at time now+10ms. During that 10ms packets being passed to cake will need to be put into a queue and stored... (now during that wait time conceptually for a given queue there could be a drop scheduled, I a not sure whether such drops will actually be executed at that time or whether they will wait for the next dequeue action, I believe the latter).

No, the while idea about AQM is to better manage the queue, and the trick of sqm with its traffic shapers is to get control over the relevant queue... the only time a queue does not change noticeable (that is the queue size will only change by one packet) is if the incoming rate and the outgoing rate are identical (with the limit being ingress rate < egress rate, at which case the queue size fluctuates between 0 and 1).

But the consequence of this is that with a traffic shaper, we can not disambiguate whether the queueing delay is growing, because of more ingress that ideal, or because the egress rate is lower than expected/desired...

As I said to diagnose "processing limitation" we also would like to see that correlated to longer queues we also do see (close to) saturated CPUs.

I guess from my simplistic description, we could try to add a "dequeueing slack" statistic where we would EWMA the difference between desired dequeueing times and realized dequeuing times, with the expectation that this value will increase when we are not getting CPU access in a timely enough fashion...

Lynx · January 10, 2024, 11:43am

Terrific explanation there, thank you.

So it seems that cake is queuing up packets, retaining them for some time in dependence on shaper rate (not just processing limitation), and then removing packets from the queues and either sending them onwards or dropping them in dependence upon the shaper rate.

Doesn't this introduce some degree of avoidable delay between the time a packet is received by the modem and sent on its merry way, which seems to run counter to avoiding any latency increase? I am sure you are going to tell me that this is not avoidable, but I am not fully getting why yet.

I am thinking - isn't buffering only needed to compensate for processing limitation, rather than in effecting the shaper rate?

Returning to the 100 packets/second example, rather than queuing and picking packets out from the queues every 10ms, wouldn't it instead be possible to simply process every packet as quickly as possible (with queuing only if needed to deal with processing limitation) and drop packets as necessary in dependence upon the shaper rate, and whether the packet is to be prioritised or not? I am supposing this should mean not seeing the sort of 10ms+ values I am seeing in the various delay statistics.