CAKE w/ Adaptive Bandwidth [October 2021 to September 2022]

Not a plot but a summary:

[...]
                        Min     Mean   Median      Max  Stddev                                                                                                                                 
                         ---     ----   ------      ---  ------                                                                                                                                 
                RTT  16.64ms  17.49ms  17.18ms  64.67ms  1.41ms                                                                                                                                 
         send delay   6.62ms   7.56ms   7.48ms  25.31ms   619µs                                                                                                                                 
      receive delay   9.19ms   9.92ms   9.78ms  56.47ms  1.08ms                                                                                                                                 
                                                                                                                                                                                                
      IPDV (jitter)       0s    259µs    188µs  19.59ms   420µs                                                                                                                                 
          send IPDV       0s    187µs    115µs  16.15ms   351µs                                                                                                                                 
       receive IPDV       0s    156µs    108µs   19.8ms   240µs                                                                                                                                 
                                                                                                                                                                                                
     send call time   6.73µs   25.2µs             263µs  7.18µs                                                                                                                                 
        timer error      1ns    276µs            13.6ms   169µs                                                                                                                                 
  server proc. time    880ns   6.42µs            4.99ms  22.1µs                                                                                                                                 
                                                                                                                                                                                                
                duration: 20m0s (wait 194ms)                                                                                                                                                    
   packets sent/received: 399562/399327 (0.06% loss)                                                                                                                                            
 server packets received: 399559/399562 (0.00%/0.06% loss up/down)                                                                                                                              
     bytes sent/received: 23973720/23959620                                                                                                                                                     
       send/receive rate: 159.8 Kbps / 159.7 Kbps                                                                                                                                               
           packet length: 60 bytes                                                                                                                                                              
             timer stats: 438/400000 (0.11%) missed, 9.21% error                         

Thank you very much for your detailed thoughts. Seems like you think the existing approach of not clearing the window and keeping the delay measurements running even during refractory period is OK, especially if the window is smaller in width than the refractory period.

Should we include a sanity check that the refractory period is greater than the window?

At the moment we have a fixed refractory period. Should the refractory period be dynamically adjusted then? Based on e.g.

max_wire_packet_rtt_us=$(( (1000*$dl_max_wire_packet_size_bits)/$dl_shaper_rate_kbps + (1000*$ul_max_wire_packet_size_bits)/$ul_shaper_rate_kbps ))

https://github.com/lynxthecat/CAKE-autorate/blob/fcc9ec8e654e3b8770d65e732747ccad16e84f9c/CAKE-autorate.sh#L361

Or shaper rate? Or change of shaper rate?

A warning should be okay, but I would not enforce this as the theory of operation we have might be wrong...

We would need a theory for how long it takes for traffic the behave well again after we reduced the shaper rate*... but that depends on things like the RTT of each individual flow and the CC algorithm used. I really think we are better off with a simple fixed duration parameter here and acceptance that this is just a heuristic aimed at keeping our control loop from overshooting/oscillating too wildly.

To be explicit, this assumes that the speed we shaped down to is actually >= the true bottleneck speed, and the question is "how long does it take for our "shape-down" signal that was based on the already remedied mismatch between shaper and bottleneck rate to subside. It is that period in which we can't disambiguate between having our shaper still set too high and shaper is correct but we still need to wait for accumulated data to be "serviced" away (so clean-up of the old "bloat").

1 Like

Does this look reasonable:

Example output:

DEBUG Warning: bufferbloat refractory period:  100000  us.
DEBUG Warning: but expected time to overwrite samples in bufferbloat detection window is:  200000  us.
DEBUG Warning: Consider increasing bufferbloat refractory period or decreasing bufferbloat detection window.
1 Like

Yes, that is IMHO great, directs to what we think is correct all the while allowing alternative settings, if only for proper A/B testing.

1 Like

Field Report: Success on Belkin RT3200

I got a "brand new" Belkin RT3200 ($95 on eBay, apparently same specs as Linksys e8450) and flashed it to OpenWrt using the Dangowrt instructions to get all the bootloader stuff working. (I'm not sure what that's all about, but I just followed the instructions.) I then upgraded it to OpenWrt 22.03.0-rc4 by downloading the binary from firmware-selector.openwrt.org.

Installing CAKE-autorate

  • I installed the luci-app-sqm software and configured for the router's wan port.

  • I followed the steps for the main branch of the CAKE-autorate page to install the software. I didn't do any "manual testing"

  • In config.sh, I used wan and ifb4wan for the interface names and set the speeds to:

min_dl_shaper_rate_kbps=65000  # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=75000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=78000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=66000  # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=75000 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=77500  # maximum bandwidth for upload (Kbit/s)

Results on my 75mbps/75mbps fiber ISP

  • I don't expect my fiber internet connection to change speeds very much, so this won't test CAKE-autorate's ability to track speed changes. But I wanted to document what happened to CPU performance with "out of the box" settings.

  • When the router is essentially idle (no traffic), htop shows the CPU is 8-15% when CAKE-autorate is running as a service.

  • Running betterspeedtest.sh on the router, CPU load is good (30-40%) and latency is very well controlled.

root@OpenWrt-Belkin-RT3200:~# sh betterspeedtest.sh -t 20 -p 1.1.1.1
2022-06-26 19:32:46 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging 1.1.1.1 (20 seconds in each direction)
.....................
 Download: 73.62 Mbps
  Latency: (in msec, 21 pings, 0.00% packet loss)
      Min: 7.300
    10pct: 7.440
   Median: 7.840
      Avg: 8.621
    90pct: 8.170
      Max: 24.100
.....................
   Upload: 73.60 Mbps
  Latency: (in msec, 21 pings, 0.00% packet loss)
      Min: 7.910
    10pct: 7.990
   Median: 8.540
      Avg: 10.233
    90pct: 9.400
      Max: 28.500
  • Running betterspeedtest.sh on my wi-fi connected laptop, router CPU is good (about 35-40% peaking to 55%), but speeds are degraded, sometimes latency is bad:
√ OpenWrtScripts % sh betterspeedtest.sh -t 20 -p 1.1.1.1
2022-06-26 19:35:45 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging 1.1.1.1 (20 seconds in each direction)
.....................
 Download: 55.73 Mbps
  Latency: (in msec, 21 pings, 0.00% packet loss)
      Min: 9.111
    10pct: 9.611
   Median: 10.766
      Avg: 29.164
    90pct: 27.721
      Max: 327.052
.....................
   Upload: 71.70 Mbps
  Latency: (in msec, 21 pings, 0.00% packet loss)
      Min: 8.935
    10pct: 9.022
   Median: 10.677
      Avg: 11.540
    90pct: 11.443
      Max: 30.117
√ OpenWrtScripts % sh betterspeedtest.sh -t 20 -p 1.1.1.1
2022-06-26 19:36:30 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging 1.1.1.1 (20 seconds in each direction)
.....................
 Download: 73.58 Mbps
  Latency: (in msec, 21 pings, 0.00% packet loss)
      Min: 8.751
    10pct: 9.000
   Median: 10.598
      Avg: 10.536
    90pct: 11.296
      Max: 14.897
.....................
   Upload: 58.06 Mbps
  Latency: (in msec, 21 pings, 0.00% packet loss)
      Min: 9.268
    10pct: 9.346
   Median: 10.608
      Avg: 23.781
    90pct: 13.284
      Max: 270.800

TL;DR CAKE-autorate on a capable router seems to work very well. Wi-Fi may induce latency that harms the responsiveness.

2 Likes

@richb-hanover-priv firstly thank you very much indeed for taking the time and trouble to test the bash implementation. It is reassuring to know that the instructions seem to be apt, and I have now addressed the errors you have helpfully identified in connection with the service file installation. And it is of course reassuring to see that the script worked on your RT3200 with fixed fibre connection. Any chance you might test this on your variable rate connection at some point too?

As you know this script is designed for use on variable rate connections such as LTE or connections that otherwise suffer from e.g. congestion-related capacity variation preventing the use of a static CAKE bandwidth. For fixed connections without such variation setting a static bandwidth is still optimal. Nevertheless testing on fixed bandwidth connections is still helpful, albeit for connections where there really is no capacity variation then the best the script can do is not make things significantly worse than just setting the static CAKE bandwidths.

By the way there have been a number of very significant commits to the OpenWrt 22.03 branch affecting Wi-Fi on the RT3200 such as this:

And these:

Using 3x RT3200's in a WDS setup myself I also find it frustrating that I spend all this time getting a latency free experience in respect of my LTE connection, only for clients to nevertheless experience latency associated with the use of Wi-Fi.

@richb-hanover-priv myself and I imagine also @nbd would be interested to know if the latest 22.03 snapshot incorporating these commits:

reduces latency. I think you can get a 22.03 snapshot that contains these commits by setting the sysupgrade server to: https://chef.libremesh.org/

and then using auc or LuCi attended sysupgrade to get the latest 22.03 snapshot on your RT3200. I think you need to change the sysupgrade server because the default one lags behind the most up to date snapshots for some reason.

You can verify which commit the snapshot corresponds to on the 22.03 github page here:

by looking at the github hashes:

and when it offers upgrade of form rXXXX-YYYYYY then checking which github hash the YYYYY corresponds with. 08e1812 seems to be the latest as of writing this post.

I would be really interested to know whether from your perspective the latest 22.03 snapshot reduces latency as compared to the rc4 given the changes to the WiFi driver and these airtime fairness alterations. Of course you could also just wait for the release of rc5 and test then.

I have added instructions to suggest including:

/root/CAKE-autorate
/etc/init.d/cake-autorate

to the list of files to preserve during backups or upgrades.

Thanks for the update. I'll check the most recent snapshot builds soon-ish.

I also appreciate your tip about backing up CAKE-autorate files. Could I suggest the following change for the README? (I also edited the wiki page to make it clear about using the Configuration tab.) Thanks again.



## Preserving CAKE-autorate files for backup or upgrades

The [Backup and Restore page on the wiki](https://openwrt.org/docs/guide-user/troubleshooting/backup_restore#customize_and_verify)
describes how files can be saved across upgrades. 

[Add these files on the **Configuration** tab](https://openwrt.org/docs/guide-user/troubleshooting/backup_restore#back_up),
so they will be saved in backups and preserved across snapshot upgrades.
/root/CAKE-autorate
/etc/init.d/cake-autorate

2 Likes

PS I am a huge fan of the Attended Sysupgrade package. It is the future of OpenWrt, since gives a "one-click" upgrade to your router from the LuCI web GUI while preserving all the packages and settings of your current setup. That said...

Neither chef.libremesh.org nor asu.aparcar.org servers have snapshot builds - just the -rc1 build. I will just wait for -rc5, since I'm up to my elbows in other work right now...

1 Like

Thanks for making this revised version. I gave it a try this morning but I wasn't exactly sure what settings you wanted me to try, if you wanted me to revert some of the earlier things we had done or not. So basically I took your provided config.sh and changed:


reflector_ping_interval_s=0.15 # (seconds, e.g. 0.2s or 2s)
no_pingers=6
delay_thr_ms=75 # (milliseconds)

min_dl_shaper_rate_kbps=10000  # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=50000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=200000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=2000  # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=10000 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=30000  # maximum bandwidth for upload (Kbit/s)

So some of the other things we tried aren't in there, but I can put them back in there, let me know what you think.

My impression is that it working well to cut back on the bandwidth to reduce latency, but sometimes kind of got stuck in a lower bandwidth range than it probably needed to be.

Here's an example log

which only averaged 26 Mbps download for the 60s period, and it seems like it probably wasn't ramping up the bandwidth fast enough, so I'm thinking I need to adjust some other parameters from the defaults again.

@gba yeah I meant could you try with these (if you haven't already?).

OK, I tried these parameters. gping plot for download:

here is the log

Speed test averaged 51 Mbps down, 7.1 Mbps up for this run. Thanks.

1 Like

Thanks plotting now.

I think this looks better:

Here is a closeup on the download phase:

And here is a further close up of the big hump:

Here were the protrusions of the RTT delta beyond the threshold:

Perhaps still some room for improvement? Maybe we need more lengthy download data like from long sustained download lasting 2 mins? Thoughts @moeller0?

1 Like

Just curius, I still seem to see "stay the course" periods, but should these not go away with:

medium_load_thr=0.75 # % of currently set bandwidth for detecting medium load
high_load_thr=0.75   # % of currently set bandwidth for detecting high load

puzzled?

1 Like

Yes they should. @gba did you make all the changes according to:

Save for that issue, any further observations @moeller0? What explains this:

Am I right that here CAKE is not actually shaping because there is a continued large discrepancy between shaper rate and achieved rate? Or is there another explanation why download is not increasing to closer to the shaper bandwidth?

I mean obviously the medium rate logic is still being applied (which I hope is because @gba did not set the values). But I'm still interested in what is causing the connection to cruise along subject to a maximum other than our own shaper without RTT spiking? Does this mean Starlink has its own shaper?

For comparison purposes, see how with my LTE connection there is not such a discrepancy between the shaper rate and the achieved rate:

I mean this:

1 Like

Ahh, so sorry, I missed changing medium_load_thr so it was set to 0.50. I just changed it to 0.75 and double-checked everything else to be the same as your list there. I'll run another test with medium_load_thr set to 0.75. Sorry again...

1 Like

No worries. I am still confused by what is going on so hoping @moeller0 can shed some light.

OK, I fixed that parameter and ran another test, this time doing it for 120s, if that would be more helpful to you.

log

Average download speed was about 33 Mbps, average upload 9.4 Mbps this time.

1 Like