CAKE w/ Adaptive Bandwidth [Historic]

Can we test both on the 'RTT' column here:

                 log_time;       rx_load;       tx_load;  baseline_RTT;           RTT;     delta_RTT;   cur_dl_rate;   cur_ul_rate;
20211119T210141.632780922;          0.00;          0.10;         45.59;         47.88;          2.29;      25000.00;      25000.00;
20211119T210142.543874758;          0.02;          0.36;         45.59;         47.95;          2.36;      25000.00;      25000.00;
20211119T210143.573865440;          0.02;          0.67;         45.59;         48.43;          2.84;      25000.00;      25050.00;

I would like to see how with this real RTT data set this alternative routine compares in terms of how it adapts the baseline RTT. Maybe we need a larger data set though.

I'm a bit dubious about the importance of the RTT baseline. For my LTE connection it stays around 50ms. Some tests on my mobile phone show similar values in completely different locations. We have discussed above someone driving around, but that sounds like a pretty extreme case!

For most use cases, so long as RTT adjusts UP very slowly, isn't the EWMA with a constant that takes say 15 mins to adjust just fine?

As @moeller0 observed above, if baseline RTT is underestimated, this means punishing the bandwidth increase a little bit, which is not so bad because we still avoid bufferbloat then. Viewing this whole concept as a kind of 'turbo' function against the bandwidth that would otherwise be set (based on worst case bandwidth scenario) in CAKE, we are still offering an increase over what would otherwise have been set. Just perhaps not as much as would be available.

In stark contrast, by allowing baseline RTT to increase too quickly, we introduce the danger of letting through bufferbloat by overestimating the true baseline RTT. If we try to squeeze too much we introduce the possibility of letting lemon pips (or in our case bufferbloat) through:

Seems like the safe bet is slow RTT increase and fast RTT decrease.

Set against this backdrop, as for EWMA vs this two-pole butterworth filter, what benefit would the latter give in terms of 'reacting faster to changes'?

A complexity is that the RTT we look at has an increase associated with bufferbloat, and we need to avoid capturing the increase associated with bufferbloat in the baseline. Would the two-pole butterworth be good at that?

The present approach at least seems to handle the above issue well:

20211119T210541.207899347;          0.78;          0.08;         45.01;         49.69;          4.69;      34332.00;      25000.00;
20211119T210542.207050486;          0.77;          0.09;         45.01;         48.33;          3.32;      34557.00;      25000.00;
20211119T210543.253130538;          0.75;          0.09;         45.06;         91.12;         46.11;      32307.00;      25000.00;
20211119T210544.345748622;          0.69;          0.07;         45.15;        134.99;         89.93;      30057.00;      25000.00;
20211119T210545.464845349;          0.48;          0.05;         45.46;        358.24;        313.09;      27807.00;      25000.00;
20211119T210546.274829760;          0.82;          0.06;         45.46;         47.49;          2.03;      28032.00;      25000.00;

Yes, I think that's correct. I think for EWMA, the alpha value is typically calculated as 2 / (period + 1). So you use that to calculate a comparable setting for this filter

Just had a really good example tonight of why users with variable rate connections really need adaptive bandwidth for CAKE.

A CAKE user faced with a variable rate connection is typically forced to sacrifice bandwidth so that bufferbloat is avoided only most of the time. Which is undesirable.

My LTE connection goes up to 70Mbit/s at best, and mostly, as in say 90% of the time, it can safely handle above 30Mbit/s. Having to set the absolute worst case in CAKE means taking a huge bandwidth hit. So I go for a compromise of 30Mbit/s or 35Mbit/s.

But during a lengthy 2.5 hour video call with my client today I went into the time of day of peak congestion for my connection and started to see stuttering on Microsoft Teams. I strayed into the 5% or 10% worst case scenario.

During the call I had to login to my router and manually set the CAKE bandwidth to 20Mbit/s for the stuttering to go away.

Had I been using an adaptive bandwidth routine then I'd have more happily set bandwidth of 20Mbit/s as minimum, in knowledge that the 'turbo' function of the routine will allow the bandwidth to scale up during periods of heavy usage, so long as the connection will sustain it without introducing heavy bufferbloat.

Of course the key point is the 'so long as the connection will sustain it without introducing heavy bufferbloat' and a key goal is to ensure that the bandwidth increase does not result in heavy bufferbloat, which is especially important when latency sensitive applications are presently in operation.

So the idea is that an adaptive bandwidth routine allows a user to opt for a lower bandwidth than they would be comfortable with compromising with, mindful that the routine will recover the extra bandwidth when it is needed for heavy downloads or uploads.

An alternative or complementary solution would be to vary bandwidth of CAKE in dependence upon detected traffic type. This latter option may warrant some further consideration.

That does not sound bad, coefficients can be pre calculated at start up, and keeping two instead of one historic value is not too bad (except doing that in shell still is not much fun :wink: )

But why the two-pole butterworth for RTT baseline?

I can see its application in financial data. But baseline RTT based on actual RTT + noise + bufferbloat?

What should the baseline look like? Should it look like the below? But obviously smoother (I just drew this with lines on paint by eye).

I think I can imagine how the two-pole butterworth might give something more like the purple line above than the blue line. Is that better? The promise is faster response to true RTT increase, right? Because the increase related to bufferbloat should only survive a few ticks?

I wonder how the two approaches will compare when handling a true change:

The existing approach will result in lost bandwidth as it slowly catches up with the new RTT.

Just thinking out loud as usual. I think I am getting it.

Hi,

Will this work for me?

Thanks

1 Like

That really depends on the true change in capacity and our downward adjustment step size, say if true capacity drops from 80Mbps to 20Mbps, but we only decrement in (made-up) steps of 1Mbps it will take us a minute (60 ticks) to reduce the rate enough so that the bufferbloat-related increased RTT subsides. If however we only have X steps than at worst it takes us X steps * tick duration seconds.

Mmmh, hard to say in general, but luckily easy to test.... just try it out and report success/failure/issues back to this thread, please.

1 Like

What bandwidth do each of your LTE connections provide (max / min) and out of curiosity what are your LTE connection stats for each? Is this a fixed location or moving location? If the latter then especially given your wacky setup reported on the other thread, you present a very interesting test case for shell script.

Network1

Why the separate WANs I wonder? One for heavy downloads and one for latency sensitive applications? That is one way to help manage things with LTE. Or redundancy? I am curious.

I personally have the B818-263 by the way. It's possible to write a bash script to do stuff like send and receive SMS. Also do you have phones connected via the RJ11 to use sim cards? I find mine drops from 4G to 3G when I use that. I think that is because the device identifier in the IMEI is not whitelisted by any UK provider. With your older router I think you might be able to spoof that so you can get VoLTE and hence stay on 4G during calls, but this is just speculation. And obviously just an academic / theoretical point.

https://device-wiki.com/en/changing-imei-on-a-3g-4g-modem-huawei-step-by-step.html

@moeller0 I received a pull request to replace every single bc call with awk calls. Excited to look into this and verify all is still well. Do the substitutions look reasonable to you?

I have Lynx's script in use and it works good in my case, thank you! I made it much quicker to respond, because in my case the LTE bandwidth changes are big. My usable download speed varies between 20 to 120 Mbit/s and upload from 5 to 30 Mbit/s. Evenings 18-22 and weekends are worst, because everyone is streaming at that time. Workdays are actually very good, because there is so little streaming in the neighbourhood.

1 Like

That is good in that it reduces the external dependencies (but I have not looked at the PR yet). What I wanted to do however was to see how far we can get with pure shell, that is switch to integer/fixed point math avoiding the calls/hand-overs to external binaries. But since it is not clear whether that will work maybe doing the all awk thing first might be a good idea.

Mmmh, maybe I should change my todo list and work on getting CPU load numbers into the log next, so we can see how switching from bc to awk affects the run-time costs (the same is also indicated fr my all shell tests, because going all shell only makes sense if it actually helps in some dimension, as it is going to be painful).

Will need ot have a look, but at earliest tonight after work.

Out of curiosity, which numbers did you change?

I am now testing with

alpha_RTT_increase=0.2
alpha_RTT_decrease=0.2

I am not sure if I am doing it right though.

Delighted to hear this. We have a similar setup albeit mine is 20 to 70 Mbit/s and 25 to 35 Mbit/s. You must benefit from carrier aggregation to get up to 120 Mbit/s right? My cell tower only supports one band. Which device do you have?

I don't think you will want to change those so much.

I'd recommend rather tweaking the bandwidth increment and decrement factors. I would imagine these would work fairly well for you given our similar connections. LTE takes a while to improve its connection upon sustained connection anyway. At least that has been my experience.

rate_adjust_RTT_spike=0.05 # how rapidly to reduce bandwidth upon detection of bufferbloat
rate_adjust_load_high=0.005 # how rapidly to increase bandwidth upon high load detected
rate_adjust_load_low=0.0025 # how rapidly to decrease bandwidth upon low load detected

Could you please paste output from script when load is saturated with a few ISO downloads? I'd really like to graph the results on your connection. Specifically how the bandwidth scales up from minimum and maintains around the max available bandwidth.

1 Like

Mmmh, these basically increase the up and down learning rates for the baseline RTT, I am not sure whether that is the ideal solution for big rate changes...

I would probably try:

rate_adjust_RTT_spike=0.2 # how rapidly to reduce bandwidth upon detection of bufferbloat
rate_adjust_load_high=0.05 # how rapidly to increase bandwidth upon high load detected
rate_adjust_load_low=0.025 # how rapidly to decrease bandwidth upon low load detected

instead. Changing the alphas IMHO really only makes sense if the true path RTT to your reflectors changes quickly, otherwise you are simply learning to accept your self induced bufferbloat, which will result in generally higher throughput, but also worst lateny-under-load.
That said, these are policy questions, and if you are happy with the performance with the changed alphas, just keep them that way; your network, your rules :wink:

I have added some more info on my other post.
It's fixed LTE, speed ranges from 2mbps to 80mbps.
I use 2 WAN's because I need to cater for 6 users with +- 25 devices (PC's, laptops, phones, tablets, Android TV media players, IOT devices)
I do not have any phones connectd via RJ11.

Had a quick look, and as far as I can tell it looks fine, maybe pull this into a fresh branch and test it on your link (as on my link this script does not really help, at best it does not hurt, making it tough to get meaningful testing done).

Side note, if we make sure all rates are in integer number of Kbps (which we should anyway, as cake does not really like fractional values there IIRC), we will be able to convert a few tests to pure shell

If, as I assume the rpi4 acts as router and is the only one talking directly to the LTE device in the switch, you should be able to run two instances of this script one for each link. (Note to self, we need to add the interface name ot the log file/tmp folder so we are certain to not clobber ourselves here?)

Yes, that's correct.
Okay, I will try the script when I get a chance and report back later.
Thanks

@ryan74 and @smoltron please please please do copy and paste the lines output from the script into pastebin. You can even just run script from ssh then copy lines from terminal into pastebin. Then I can plot the data and we can evaluate how well it works on your test cases. Please start test with no load, then saturate download with multiple simultaneous .ISO downloads, and the script should ramp up the download bandwidth until it hits bufferbloat, and then oscillate around a bit. We are interested in the initial ramping up process and then the maintenance at high load. 5-10 mins data would be fine.

Quite a few people have indicated they use the script (either in this thread or private message to me), but I would really like more data to see how well it is actually performing on different connections. Such data helps us improve the script to make it work better.