Thank you very much for your detailed thoughts. Seems like you think the existing approach of not clearing the window and keeping the delay measurements running even during refractory period is OK, especially if the window is smaller in width than the refractory period.
Should we include a sanity check that the refractory period is greater than the window?
At the moment we have a fixed refractory period. Should the refractory period be dynamically adjusted then? Based on e.g.
A warning should be okay, but I would not enforce this as the theory of operation we have might be wrong...
We would need a theory for how long it takes for traffic the behave well again after we reduced the shaper rate*... but that depends on things like the RTT of each individual flow and the CC algorithm used. I really think we are better off with a simple fixed duration parameter here and acceptance that this is just a heuristic aimed at keeping our control loop from overshooting/oscillating too wildly.
To be explicit, this assumes that the speed we shaped down to is actually >= the true bottleneck speed, and the question is "how long does it take for our "shape-down" signal that was based on the already remedied mismatch between shaper and bottleneck rate to subside. It is that period in which we can't disambiguate between having our shaper still set too high and shaper is correct but we still need to wait for accumulated data to be "serviced" away (so clean-up of the old "bloat").
DEBUG Warning: bufferbloat refractory period: 100000 us.
DEBUG Warning: but expected time to overwrite samples in bufferbloat detection window is: 200000 us.
DEBUG Warning: Consider increasing bufferbloat refractory period or decreasing bufferbloat detection window.
I got a "brand new" Belkin RT3200 ($95 on eBay, apparently same specs as Linksys e8450) and flashed it to OpenWrt using the Dangowrt instructions to get all the bootloader stuff working. (I'm not sure what that's all about, but I just followed the instructions.) I then upgraded it to OpenWrt 22.03.0-rc4 by downloading the binary from firmware-selector.openwrt.org.
Installing CAKE-autorate
I installed the luci-app-sqm software and configured for the router's wan port.
In config.sh, I used wan and ifb4wan for the interface names and set the speeds to:
min_dl_shaper_rate_kbps=65000 # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=75000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=78000 # maximum bandwidth for download (Kbit/s)
min_ul_shaper_rate_kbps=66000 # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=75000 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=77500 # maximum bandwidth for upload (Kbit/s)
NB: I believe the two service... commands should refer to cake-autorate not CAKE-autorate
Results on my 75mbps/75mbps fiber ISP
I don't expect my fiber internet connection to change speeds very much, so this won't test CAKE-autorate's ability to track speed changes. But I wanted to document what happened to CPU performance with "out of the box" settings.
When the router is essentially idle (no traffic), htop shows the CPU is 8-15% when CAKE-autorate is running as a service.
Running betterspeedtest.sh on the router, CPU load is good (30-40%) and latency is very well controlled.
root@OpenWrt-Belkin-RT3200:~# sh betterspeedtest.sh -t 20 -p 1.1.1.1
2022-06-26 19:32:46 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging 1.1.1.1 (20 seconds in each direction)
.....................
Download: 73.62 Mbps
Latency: (in msec, 21 pings, 0.00% packet loss)
Min: 7.300
10pct: 7.440
Median: 7.840
Avg: 8.621
90pct: 8.170
Max: 24.100
.....................
Upload: 73.60 Mbps
Latency: (in msec, 21 pings, 0.00% packet loss)
Min: 7.910
10pct: 7.990
Median: 8.540
Avg: 10.233
90pct: 9.400
Max: 28.500
Running betterspeedtest.sh on my wi-fi connected laptop, router CPU is good (about 35-40% peaking to 55%), but speeds are degraded, sometimes latency is bad:
√ OpenWrtScripts % sh betterspeedtest.sh -t 20 -p 1.1.1.1
2022-06-26 19:35:45 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging 1.1.1.1 (20 seconds in each direction)
.....................
Download: 55.73 Mbps
Latency: (in msec, 21 pings, 0.00% packet loss)
Min: 9.111
10pct: 9.611
Median: 10.766
Avg: 29.164
90pct: 27.721
Max: 327.052
.....................
Upload: 71.70 Mbps
Latency: (in msec, 21 pings, 0.00% packet loss)
Min: 8.935
10pct: 9.022
Median: 10.677
Avg: 11.540
90pct: 11.443
Max: 30.117
√ OpenWrtScripts % sh betterspeedtest.sh -t 20 -p 1.1.1.1
2022-06-26 19:36:30 Testing against netperf.bufferbloat.net (ipv4) with 5 simultaneous sessions while pinging 1.1.1.1 (20 seconds in each direction)
.....................
Download: 73.58 Mbps
Latency: (in msec, 21 pings, 0.00% packet loss)
Min: 8.751
10pct: 9.000
Median: 10.598
Avg: 10.536
90pct: 11.296
Max: 14.897
.....................
Upload: 58.06 Mbps
Latency: (in msec, 21 pings, 0.00% packet loss)
Min: 9.268
10pct: 9.346
Median: 10.608
Avg: 23.781
90pct: 13.284
Max: 270.800
TL;DR CAKE-autorate on a capable router seems to work very well. Wi-Fi may induce latency that harms the responsiveness.
@richb-hanover-priv firstly thank you very much indeed for taking the time and trouble to test the bash implementation. It is reassuring to know that the instructions seem to be apt, and I have now addressed the errors you have helpfully identified in connection with the service file installation. And it is of course reassuring to see that the script worked on your RT3200 with fixed fibre connection. Any chance you might test this on your variable rate connection at some point too?
As you know this script is designed for use on variable rate connections such as LTE or connections that otherwise suffer from e.g. congestion-related capacity variation preventing the use of a static CAKE bandwidth. For fixed connections without such variation setting a static bandwidth is still optimal. Nevertheless testing on fixed bandwidth connections is still helpful, albeit for connections where there really is no capacity variation then the best the script can do is not make things significantly worse than just setting the static CAKE bandwidths.
By the way there have been a number of very significant commits to the OpenWrt 22.03 branch affecting Wi-Fi on the RT3200 such as this:
Using 3x RT3200's in a WDS setup myself I also find it frustrating that I spend all this time getting a latency free experience in respect of my LTE connection, only for clients to nevertheless experience latency associated with the use of Wi-Fi.
@richb-hanover-priv myself and I imagine also @nbd would be interested to know if the latest 22.03 snapshot incorporating these commits:
reduces latency. I think you can get a 22.03 snapshot that contains these commits by setting the sysupgrade server to: https://chef.libremesh.org/
and then using auc or LuCi attended sysupgrade to get the latest 22.03 snapshot on your RT3200. I think you need to change the sysupgrade server because the default one lags behind the most up to date snapshots for some reason.
You can verify which commit the snapshot corresponds to on the 22.03 github page here:
and when it offers upgrade of form rXXXX-YYYYYY then checking which github hash the YYYYY corresponds with. 08e1812 seems to be the latest as of writing this post.
I would be really interested to know whether from your perspective the latest 22.03 snapshot reduces latency as compared to the rc4 given the changes to the WiFi driver and these airtime fairness alterations. Of course you could also just wait for the release of rc5 and test then.
Thanks for the update. I'll check the most recent snapshot builds soon-ish.
I also appreciate your tip about backing up CAKE-autorate files. Could I suggest the following change for the README? (I also edited the wiki page to make it clear about using the Configuration tab.) Thanks again.
## Preserving CAKE-autorate files for backup or upgrades
The [Backup and Restore page on the wiki](https://openwrt.org/docs/guide-user/troubleshooting/backup_restore#customize_and_verify)
describes how files can be saved across upgrades.
[Add these files on the **Configuration** tab](https://openwrt.org/docs/guide-user/troubleshooting/backup_restore#back_up),
so they will be saved in backups and preserved across snapshot upgrades.
PS I am a huge fan of the Attended Sysupgrade package. It is the future of OpenWrt, since gives a "one-click" upgrade to your router from the LuCI web GUI while preserving all the packages and settings of your current setup. That said...
Neither chef.libremesh.org nor asu.aparcar.org servers have snapshot builds - just the -rc1 build. I will just wait for -rc5, since I'm up to my elbows in other work right now...
Thanks for making this revised version. I gave it a try this morning but I wasn't exactly sure what settings you wanted me to try, if you wanted me to revert some of the earlier things we had done or not. So basically I took your provided config.sh and changed:
reflector_ping_interval_s=0.15 # (seconds, e.g. 0.2s or 2s)
no_pingers=6
delay_thr_ms=75 # (milliseconds)
min_dl_shaper_rate_kbps=10000 # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=50000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=200000 # maximum bandwidth for download (Kbit/s)
min_ul_shaper_rate_kbps=2000 # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=10000 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=30000 # maximum bandwidth for upload (Kbit/s)
So some of the other things we tried aren't in there, but I can put them back in there, let me know what you think.
My impression is that it working well to cut back on the bandwidth to reduce latency, but sometimes kind of got stuck in a lower bandwidth range than it probably needed to be.
which only averaged 26 Mbps download for the 60s period, and it seems like it probably wasn't ramping up the bandwidth fast enough, so I'm thinking I need to adjust some other parameters from the defaults again.
Perhaps still some room for improvement? Maybe we need more lengthy download data like from long sustained download lasting 2 mins? Thoughts @moeller0?
Just curius, I still seem to see "stay the course" periods, but should these not go away with:
medium_load_thr=0.75 # % of currently set bandwidth for detecting medium load
high_load_thr=0.75 # % of currently set bandwidth for detecting high load
Am I right that here CAKE is not actually shaping because there is a continued large discrepancy between shaper rate and achieved rate? Or is there another explanation why download is not increasing to closer to the shaper bandwidth?
I mean obviously the medium rate logic is still being applied (which I hope is because @gba did not set the values). But I'm still interested in what is causing the connection to cruise along subject to a maximum other than our own shaper without RTT spiking? Does this mean Starlink has its own shaper?
For comparison purposes, see how with my LTE connection there is not such a discrepancy between the shaper rate and the achieved rate:
Ahh, so sorry, I missed changing medium_load_thr so it was set to 0.50. I just changed it to 0.75 and double-checked everything else to be the same as your list there. I'll run another test with medium_load_thr set to 0.75. Sorry again...