Fine-tune your cake setup

Description:
This post details the results of a series of ROUGH benchmarks conducted on a Raspberry Pi 4B running OpenWrt 23.05.5 to optimize CAKE parameters for low latency and fair traffic distribution among devices.

Test Setup:
Hardware: Raspberry Pi 4B
Firmware: OpenWrt 23.05.5
PPPoE WAN Connection: 20 Mbps download, 20 Mbps upload. Offload: Disabled.

SQM file:

config queue
	option enabled '1'
	option interface 'pppoe-wan'
	option download '19500'
	option upload '19500'
	option debug_logging '0'
	option verbosity '5'
	option qdisc 'cake'
	option script 'layer_cake.qos'
	option qdisc_advanced '1'
	option squash_dscp '1'
	option squash_ingress '1'
	option ingress_ecn 'ECN'
	option egress_ecn 'NOECN'
	option qdisc_really_really_advanced '1'
	option iqdisc_opts 'nat ingress diffserv8 dual-dsthost'
	option eqdisc_opts 'nat ack-filter diffserv8 dual-srchost'
	option linklayer 'none'

By squashing and ignoring incoming DSCP values, all traffic is treated equally, allowing CAKE to prioritize traffic based on packet size and type, rather than relying on pre-assigned DSCP markings.

Qdisc Options (ingress): nat ingress diffserv8 dual-dsthost
Qdisc Options (egress): nat ack-filter diffserv8 dual-srchost

CAKE leverages the kernel's NAT subsystem to extract the internal host address associated with each packet. This enables accurate traffic classification & prioritization.

Ingress mode improves responsiveness during high download activity, it allows the rate to be set closer to the actual bottleneck bandwidth, thus eliminating waste. It also helps in latency-sensitive traffic like games and VoIP by avoiding back spilling keeping ping stable the whole time without fluctuation.

Testing Methodology:
To assess performance, I simulated heavy network load using Internet Download Manager with 16 maximum connections. Simultaneously, we measured the loading times of various websites on different devices connected to the network, both wired and wireless.

CAKE Parameter Evaluation:
Layer-Cake vs. Piece-of-Cake: Layer-Cake with DiffServ8 consistently outperformed Piece-of-Cake by approximately 40-50%, even without explicit DSCP marking. DiffServ4 provided less significant improvements compared to Piece-of-Cake.
Triple-Isolate vs. Dual Mode: While both modes yielded similar overall performance, Dual mode exhibited slightly better results (~14%) on individual devices.


The above graph is derived from the research paper. For a more in-depth analysis, please refer to the Piece of CAKE: A Comprehensive Queue Management Solution for Home Gateways

Ack-Filter vs. Ack-Filter-Aggressive: Minimal differences were observed between the two at upload. Given the potential risks of aggressive packet dropping in future TCP extensions, the standard ack-filter was preferred.

Recommendations:

  • Layer-Cake with DiffServ8: This configuration is highly recommended for optimal performance.
  • Dual Mode: This mode provides a good balance between isolation and performance.
  • Link Layer Adaptation: While the automatic Link Layer Adaptation provided me satisfactory results, it's essential to note that manual configuration may further optimize performance. Additionally, the paper highlights the potential inaccuracy of kernel-reported overhead values. Therefore, it is strongly recommended to conduct thorough benchmarking and manually adjust the overhead parameter for optimal results.

Final thoughts. While these observations offer valuable insights into optimizing CAKE parameters, it's important to note that network conditions can vary widely. To achieve optimal performance, I encourage you to conduct thorough benchmarking tailored to your specific setup. I've also developed a user-friendly tool (Qdisc Benchmarking Tool) to facilitate the benchmarking process that you're welcome to try.

============================================================
For readers seeking a straightforward CAKE configuration, employing a layer-cake diffserv8 could potentially provide half the latency compared to piece-of-pie without the need for more complex configurations
For advanced/gamers stick to QoSmate or qosify
Based on my benchmarks, the QoSmate configuration utilizing CAKE+diffserv4 has yielded the most optimal results without the need for additional rules.

2 Likes

These are subtly different. The ignore option just sets up a single priority tier cake instance for ingress, which you then over ride with diffserv8, so this has no effect.
The squash option in turn translates into cake's wah option, so with your recommendations cake will actually select the priority tie based on incoming DSCP values and then remark to DSCP 0.

I bellieve the "modern" approach would be to use conntrack to asdign incoming packets the same DSCP as outgoing packets instead. Note sqm-scripts does not offer that right now.

The expected difference is in "fairness", both triple and dual aim for per host fairness, but dual is more precise due to being able to equalise precisely between internal hosts, triple does not know what is internal and external and hence equalizes over both internal and internal hosts. For organic traffic both are really close, but for speedtest scenarios triple tends to be a bit looser in achieved internal fairness.
Triples big advantage though is not having to know what is the desired ingress and egress directions....

This is odd, if at all diffserv8 is computationally and memory wise more expensive than besteffort. How did you measure performance here? Do you have packet captures of suc a test or the output of tc -s qdisc from just before and after each test?

I am, to be clear, not doubting your numbers but want to understand why these numbers come about...

There is no meaningful automatic adaptation, unless you specify something you are treated to the kernels default values which are IIRC 14 bytes for e.g. ethernet interfaces and 0 bytes for pppoe interfaces..

For a given packet size getting the overhead wrong can be worked around by adjusting the shaper rate, and most saturating traffic likely will be bulk down/uploads using (close to) maximum packet size, so not getting the overhead correctly would only show in the rare situations where the link is saturated with smaller packets. If one considers that sufficiently unlikely, sure no explicit overhead accounting needed. But e.g. on cable/docsis links the download/upload capacity ratio is often close to the 20/21-40/1 ratio of (small) ACK to forward data traffic, so situations where a significant portion of traffic is from small packets are not completely theoretical either...

Thanks @moeller0 for further explanation in detail. I agree that leveraging conntrack to assign packets offers potential performance benefits, but practically I still need to test it out so that would be in future. But for the sake of simplicity and to avoid additional complexity of manually assigning DSCP to traffic I chose default cake script.

Given a 20Mbps connection & a DiffServ8 config, the RPi 4 should exhibit minimal CPU utilization. To assess real-world performance, a controlled test was conducted involving the opening of 15-25 websites with static content, while a bandwidth hogging IDM as running in the background. This test aimed to measure the impact on website loading times under heavy network load. I disabled browser caching to ensure consistent latency measurement.

Example: Single tin Piece-of-Cake took 10 seconds to load all the sites while layer Cake with Differv8 took 5 seconds while at 100% network load. Test conducted multiple times with different test sites and results were similar.

Here's the tc -s qdisc after downloading only bulk data through IDM.

qdisc cake 80eb: dev pppoe-wan root refcnt 2 bandwidth 21Mbit diffserv8 dual-srchost nat nowash ack-filter split-gso rtt 100ms raw overhead 0
 Sent 265161 bytes 4452 pkt (dropped 0, overlimits 25 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 8256b of 4Mb
 capacity estimate: 21Mbit
 min/max network layer size:           40 /    1280
 min/max overhead-adjusted size:       40 /    1280
 average network hdr offset:            0

                  Tin 0        Tin 1        Tin 2        Tin 3        Tin 4        Tin 5        Tin 6        Tin 7
  thresh         21Mbit    18375Kbit    16078Kbit    14068Kbit    12309Kbit    10771Kbit     9424Kbit     8246Kbit
  target            5ms          5ms          5ms          5ms          5ms          5ms          5ms          5ms
  interval        100ms        100ms        100ms        100ms        100ms        100ms        100ms        100ms
  pk_delay          0us          0us         77us          0us          0us          0us          0us          0us
  av_delay          0us          0us          5us          0us          0us          0us          0us          0us
  sp_delay          0us          0us          4us          0us          0us          0us          0us          0us
  backlog            0b           0b           0b           0b           0b           0b           0b           0b
  pkts                0            0         4452            0            0            0            0            0
  bytes               0            0       265161            0            0            0            0            0
  way_inds            0            0            0            0            0            0            0            0
  way_miss            0            0           29            0            0            0            0            0
  way_cols            0            0            0            0            0            0            0            0
  drops               0            0            0            0            0            0            0            0
  marks               0            0            0            0            0            0            0            0
  ack_drop            0            0            0            0            0            0            0            0
  sp_flows            0            0           19            0            0            0            0            0
  bk_flows            0            0            1            0            0            0            0            0
  un_flows            0            0            0            0            0            0            0            0
  max_len             0            0         1280            0            0            0            0            0
  quantum           640          560          490          429          375          328          300          300

qdisc ingress ffff: dev pppoe-wan parent ffff:fff1 ----------------
 Sent 6718121 bytes 4538 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 80ec: dev ifb4pppoe-wan root refcnt 2 bandwidth 21Mbit diffserv8 dual-dsthost nat wash ingress no-ack-filter split-gso rtt 100ms raw overhead 0
 Sent 6500289 bytes 4392 pkt (dropped 116, overlimits 8976 requeues 0)
 backlog 44760b 30p requeues 0
 memory used: 79360b of 4Mb
 capacity estimate: 21Mbit
 min/max network layer size:           60 /    1492
 min/max overhead-adjusted size:       60 /    1492
 average network hdr offset:            0

                  Tin 0        Tin 1        Tin 2        Tin 3        Tin 4        Tin 5        Tin 6        Tin 7
  thresh         21Mbit    18375Kbit    16078Kbit    14068Kbit    12309Kbit    10771Kbit     9424Kbit     8246Kbit
  target            5ms          5ms          5ms          5ms          5ms          5ms          5ms          5ms
  interval        100ms        100ms        100ms        100ms        100ms        100ms        100ms        100ms
  pk_delay          0us          0us        585us        338us          0us          0us       71.5ms          0us
  av_delay          0us          0us         44us         10us          0us          0us       13.3ms          0us
  sp_delay          0us          0us         44us         10us          0us          0us        838us          0us
  backlog            0b           0b           0b           0b           0b           0b       44760b           0b
  pkts                0            0           34            7            0            0         4497            0
  bytes               0            0        17592          860            0            0      6699669            0
  way_inds            0            0            0            0            0            0            0            0
  way_miss            0            0           10            1            0            0           19            0
  way_cols            0            0            0            0            0            0            0            0
  drops               0            0            0            0            0            0          116            0
  marks               0            0            0            0            0            0            0            0
  ack_drop            0            0            0            0            0            0            0            0
  sp_flows            0            0            1            1            0            0            7            0
  bk_flows            0            0            0            0            0            0            9            0
  un_flows            0            0            0            0            0            0            0            0
  max_len             0            0         1452          188            0            0         1492            0
  quantum           640          560          490          429          375          328          300          300


What's weird is that despite the traffic having CS0 & packet length of 1506 bytes, it filled Tin 6. Do you know why?

Here's tc -s qdisc after using YouTube and Spotify

qdisc cake 80e7: dev pppoe-wan root refcnt 2 bandwidth 21Mbit diffserv8 dual-srchost nat nowash ack-filter split-gso rtt 100ms raw overhead 0
 Sent 719302 bytes 5174 pkt (dropped 1, overlimits 496 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 118336b of 4Mb
 capacity estimate: 21Mbit
 min/max network layer size:           29 /    1492
 min/max overhead-adjusted size:       29 /    1492
 average network hdr offset:            0

                  Tin 0        Tin 1        Tin 2        Tin 3        Tin 4        Tin 5        Tin 6        Tin 7
  thresh         21Mbit    18375Kbit    16078Kbit    14068Kbit    12309Kbit    10771Kbit     9424Kbit     8246Kbit
  target            5ms          5ms          5ms          5ms          5ms          5ms          5ms          5ms
  interval        100ms        100ms        100ms        100ms        100ms        100ms        100ms        100ms
  pk_delay          0us          0us        163us          0us          0us          0us          0us          0us
  av_delay          0us          0us         12us          0us          0us          0us          0us          0us
  sp_delay          0us          0us          3us          0us          0us          0us          0us          0us
  backlog            0b           0b           0b           0b           0b           0b           0b           0b
  pkts                0            0         5174            0            0            0            0            0
  bytes               0            0       720520            0            0            0            0            0
  way_inds            0            0            4            0            0            0            0            0
  way_miss            0            0           81            0            0            0            0            0
  way_cols            0            0            0            0            0            0            0            0
  drops               0            0            1            0            0            0            0            0
  marks               0            0            0            0            0            0            0            0
  ack_drop            0            0            0            0            0            0            0            0
  sp_flows            0            0            3            0            0            0            0            0
  bk_flows            0            0            1            0            0            0            0            0
  un_flows            0            0            0            0            0            0            0            0
  max_len             0            0         1947            0            0            0            0            0
  quantum           640          560          490          429          375          328          300          300

qdisc ingress ffff: dev pppoe-wan parent ffff:fff1 ----------------
 Sent 14311972 bytes 11837 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 80e8: dev ifb4pppoe-wan root refcnt 2 bandwidth 21Mbit diffserv8 dual-dsthost nat wash ingress no-ack-filter split-gso rtt 100ms raw overhead 0
 Sent 14279183 bytes 11814 pkt (dropped 13, overlimits 22010 requeues 0)
 backlog 14920b 10p requeues 0
 memory used: 314112b of 4Mb
 capacity estimate: 21Mbit
 min/max network layer size:           40 /    1492
 min/max overhead-adjusted size:       40 /    1492
 average network hdr offset:            0

                  Tin 0        Tin 1        Tin 2        Tin 3        Tin 4        Tin 5        Tin 6        Tin 7
  thresh         21Mbit    18375Kbit    16078Kbit    14068Kbit    12309Kbit    10771Kbit     9424Kbit     8246Kbit
  target            5ms          5ms          5ms          5ms          5ms          5ms          5ms          5ms
  interval        100ms        100ms        100ms        100ms        100ms        100ms        100ms        100ms
  pk_delay          0us          7us       43.2ms        447us          0us          0us       10.9ms          0us
  av_delay          0us          0us       25.7ms         16us          0us          0us       2.64ms          0us
  sp_delay          0us          0us       7.07ms          9us          0us          0us          5us          0us
  backlog            0b           0b       14920b           0b           0b           0b           0b           0b
  pkts                0            5        11270           27            0            0          535            0
  bytes               0          284     13863626         6912            0            0       441150            0
  way_inds            0            0            0            0            0            0            0            0
  way_miss            0            4           48            8            0            0           17            0
  way_cols            0            0            0            0            0            0            0            0
  drops               0            0           12            0            0            0            1            0
  marks               0            0            0            0            0            0            0            0
  ack_drop            0            0            0            0            0            0            0            0
  sp_flows            0            1            3            1            0            0            1            0
  bk_flows            0            0            1            0            0            0            0            0
  un_flows            0            0            0            0            0            0            0            0
  max_len             0           76         1492         1274            0            0         1278            0
  quantum           640          560          490          429          375          328          300          300

I didn't mean to mention "automatic" link layer adaptation, but you're correct about CAKE assigning default overhead size + leveraging kernel provided interface details to add extra overhead. Also I adjusted the shaper rate to mitigate bufferbloat which worked out perfectly in my case.

Please correct me if I've done any mistake or missing something.

In that case set the ingress qdisc to besteffort and not diffserv8, as with diffserv8 you are essentially operating on what ever DSCP values your ISP gave you, and the tc -s qdisc output shows Tin2, Tin3, and loads in Tin6, and later also Tin1 saw traffic.

Well, how do you know that all incoming packets were CS0? Since you are using the wash keyword and diffserv8, cake will for each incoming packet first look up the existing DSCP value and decide in which of the 8 Tins to put that packet, then it will also re-mark each packet's DSCP value to CS0, so if you do a packet capture after cake all you will see is CS0, but exactly because you instructed cake to re-mark all packets to DSCP0...

But it din not:

Here forpppoe-wan and ifb4pppoe-wan the kernel added exactly 0 bytes of overhead (these 14 bytes are for those parts of the ethernet header the kernel knows about for ethernet interfaces)...

This is certainly incorrect.

If things work well for you, you must be doing something right. However I am not sure that all your theories why things work well are fully developed yet...

Does wash happen after CAKE puts traffic into tins? Also after rebooting the router I tried downloading the same file and now it's correctly placed in Tin1 so idk why the last result showed skewed.

I see. So basically, I'm running script without overhead. Surprisingly idk why but practically it gave really good result even better than piece of cake.

I'm testing hudra0/qosmate since his implantation is assigning traffic into proper tins. I'll update soon

Yes.

For diffserv8, IIRC CS0 should map to tin2...

As I tried to explain before, for a given packet size one make up for too small per-packet-overhead by satting the shaper a bit lower...

That would solve the ingress DSCP marks make little sense issue.

I'm testing qosmate and so far, so good. Packets are correctly putting in respective tins. And it's giving 10% faster webpages loading at full network load compared to my original advice.

One question though. Do I still need to manually assign overhead (qosmate> Link Preset: PPPoE PTM=30) even at raw I don't get any bufferbloat nor latency spike.

1 Like

How do you measure that?

IDM with 16 connections+ a tool to load websites back-to-back with cache disabled. The one tool I posted will give you rough measurement. On my device I use BenchmarkDotNet along with my code

1 Like

Since you ask, my advice is always to set the per-packet-overhead to the exact value if known or to a value that is expected to be a bit larger than the true overhead. Not configuring the overhead correctly can be "invisible" for large packets but show up as increased latency under load if the traffic is made up of smaller packets... the openwrt sqm detail wiki has some text on that

1 Like