CAKE w/ Adaptive Bandwidth [August 2022 to March 2024]

I misread the question, apologies. And I agree with @Lynx in such a case reducing the shaper rate as response to bloat epochs seems like the right thing to do. Especially since in the absence of both load and buffebloat things will converge again on the the baseline rates...
That leaves the situation where an idle link experiences exceptional high delays that goes away once a bit of constant traffic appears. But before trying to handle such a hypothetical case, I guess it makes sense first wait for such a beast to appear first :wink:

1 Like

Seems like we may have such a beast here:

When load resumes the shaper rate should recover back pretty quickly. But data/plots showing what's going on would really help.

Not sure, if this happens occasionally, I would just ignore it. The point is do these latency spikes go away under mild load (so are they unique to the network idling) or are these simply extreme delays in the network that affect everything including the low rate probe traffic cake-autorate uses.

Yes!

Interesting. Thank you @Lynx and @moeller0 for the great explanation.

I'll change my xl_owd_delta_thr_ms/xl_avg_owd_delta_thr_ms and bufferbloat_detection_window/bufferbloat_detection_thr settings back to the default values to see if the behavior returns, and will post the logs if it does.

Given the explanation (for xl_idle_bb), a follow-up question I have is, what if the connection changes for example so that the latency "permanently" doubles both when idle and when under load?

Would the consequense of that be, that the speeds would get stuck at min_xl_shaper_rate_kbps due to both the xl_idle_bb and xl_high_bb events never allowing it go back to the baseline? At least until the xl_owd_delta_thr_ms/xl_avg_owd_delta_thr_ms parameters were manually adjusted to reflect the new reality?

That's what I suspected was happening to my connection, though perhaps not literally "permanently" but at least for several days. And I'll see if it's happening so we could confirm it one way or the other.

Thank you! I wasn't aware of this and will make sure to adjust accordingly.

So the code already (slowly) adjust to increases in persistent delay. This works by dissecting the total delay into a "static" part and a "dynamic" part, we use a slow EWMA to make the baseline auto adjust to the "minimal RTT/OWD" along a path, so if the true path RTT increases, autorate will sooner or later catch up to it, during that time however autorate likely will be stuck to the configured minimal rate. If the path RTT decreases, autorate will follow much quicker (almost instantaneous) that is decreases in minimal RTT are truly indicative of a shorter path (or clock issues :wink: ) while increases in RTT can be either from structural path length changed (e.g. 1.1.1.1, being anycasted to a close by location or anycasted to the other side of the world) or simply from sustained bufferbloat, so adjusting the baseline delay valie upwards needs to be slow otherwise autorate desensitized against bufferbloat/congestion, which would not be helpful ;).

No what slowly adapts is the baseline value we substract from the actual RTT/OWD samples so that the delta_delay values are meaningful again. In our experience so far the actual dynamic delay changes due to queue/bufferbloat are surprisingly constant and mostly independent of the baseline delay values (but note that @Lynx implemented some pretty clever reflector sanity checks so reflectors that behave massively different will be pruned from the active set). So if there was a true increase in path delay, autorate will take a while to adjust to it, and while adapting it will likely be stuck to the configured minimal rate.

That adaptation process will not take several days... logs would be really helpful, if you do not want/can't to run the analysis yourself, collect the autorate logfile and post them at soe share hoster like easyupload.io and I will see that I can have a go....

Now, I am sure there are situations where my recommendation does not hold, but the general idea is to ask a set of different reflectors and let them "vote" on whether they show indicators of bufferbloat. The diversity is a clever idea @dlakelan brought to the table that deals with the fact that we measure the total delay along a path, but we want to control the rate for our access link. And if e,g, your local google data center is congested and 8.8.8.8 returns massively increased RTTs/OWDs, but your real link is just fine, the idea is to also ask cloudflare (1.1.1.1) and quad9 (9.9.9.9) and others and only of most/all agree that there is bufferbloat we act.

If we had a robust and reliable reflector just on the other side of the link in the ISPs network the autorate code could be much simpler ;), but generally we can not fully trust any single reflector, hence the observed complexity.

The other factor is that the delay measurements are interleaved not to overload any reflector and that means that taking more into account (mostly in bufferbloat_detection_thr) will make the control loop take longer to engage.

1 Like

Here's a log file of the current behavior: https easyupload io/jgy6e0

(Sorry for the lack of clickable link, for some reason it would not allow me to post a valid link to easyupload.)

It's not a perfectly clean experiment due to my wife using the internet too and I cannot kick her out right now, but it still shows bufferbloat taking the speeds all the way to the minimum (20/3 Mbps).

It also wasn't as bad this time compared to what I've seen before, as this time it did not start immediately adjusting the speeds down after restarting cake-autorate, but I had to wait for awhile until I noticed the speeds dropping.

Also, to add a little more context to the issues I've seen lately, one way I have noticed to increased latency and the increased variance in the latency is during Rocket League game play, where the mid-game ping used to be ~50ms, but this weekend for example it was mostly in the range of 100-200ms with the lowest near the 50ms but occasionally being as high as 1,000ms.

Based on your explanation on the EWMA adjusted baseline RTT, if it adjusts slowly to increased latency but quickly to decreased latency, could it be that the baseline therefore follows more closely the minimum observed latency than the average latency? And under conditions where the latency has a very high variance, it adjusts down quickly whenever a 50ms or so latency measurement comes in, but does not adjust up when the 100-200ms or 1,000ms latency measurements come in, and therefore fails to measure the average baseline latency? I may be completely off as I don't know the system the way you do, but that might explain why my connection now tends to adjust to the minimum speeds even when the connection is idle sometimes?

That sounds harsh, divorcing (kicking out) your wife just for easier network debugging :wink:

Exactly, as that static minimum is most descriptive of the underlaying network path.

Not exactly, you need X out of the last Y delay samples (likely from different reflectors) to indicate increased delay before the controller engages... in your case the latency in absence of a noticeable load simply kept creeping up at least in the beginning:



Yes, your data sows this odd rate increase... but the actual load stays well below the minimum rate. What I find odd is that I fail to see larger rate reduction steps in the graphs, may you could post your config again, please?

2 Likes

Of course, sorry about that oversight. I do have my custom settings, as I just changed the ms-thresholds back to defaults (30/60ms) and also the bufferbloat detection back to default that's 3 out of 6 samples required.

min_dl_shaper_rate_kbps=20000  # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=50000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=100000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=3000  # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=6000 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=10000  # maximum bandwidth for upload (Kbit/s)

shaper_rate_min_adjust_down_bufferbloat=0.998    # how rapidly to reduce shaper rate upon detection of bufferbloat (min reduction)
shaper_rate_max_adjust_down_bufferbloat=0.99    # how rapidly to reduce shaper rate upon detection of bufferbloat (max reduction)
shaper_rate_adjust_up_load_high=1.001           # how rapidly to increase shaper rate upon high load detected
shaper_rate_adjust_down_load_low=0.999          # how rapidly to return down to base shaper rate upon idle or low load detected
shaper_rate_adjust_up_load_low=1.002            # how rapidly to return up to base shaper rate upon idle or low load detected

Note that I'm no longer confident with the above parameters, as they were adjusted to improve my Rocket League gaming experience before my connection started behaving strangely.

Thanks, sure these :grinning:

shaper_rate_min_adjust_down_bufferbloat=0.998    # how rapidly to reduce shaper rate upon detection of bufferbloat (min reduction)
shaper_rate_max_adjust_down_bufferbloat=0.99    # how rapidly to reduce shaper rate upon detection of bufferbloat (max reduction)

are super gentle, at worst you reduce the shaper rate by 1% but mostly 0.2%... this will result in a somewhat sluggish control loop. Maybe think about increasing shaper_rate_max_adjust_down_bufferbloat to 0.95 or 0.9? That would still be gentle but at least pack some punch if there is true congestion/queueing delay. At the moment your control variable are really close to the shaper_rate_adjust_XX_load_low values, which are intended to be slow and sluggish.

That is unfortunately often the case with optimizations, we end up optimizing the specific condition we can test, and that sometimes results in less generality than one would hope... Maybe try to find new optimal values for the current condition and then after having found them pick values somewhere in the middle between the two sets?

1 Like

Not disagreeing in theory, but are you sure that this method also leads to the best results in practice? (I know you added a little more context than that, but I'm still guessing that this might be the root cause to the issues I've been seeing.)

For example, if my connection actually has an idle-latency between 20 and 1000 ms (both extremes being somewhat rare), averaging somewhere in the 100-200ms range, I think what can happen is that the baseline stays below 100ms most of the time (due to adjusting back down much faster than it adjusts up), but since my idle-latency is higher than 100ms most of the time, my speed will keep getting adjusted down with xl_idle_bb events.

And if this is the case, then that is not what I would want it to do, since adjusting the speeds down is not going to improve latency, if the baseline does not reflect the average idle-performance of the connection.

Instead, what I would like to see in this scenario is a baseline of ~150ms, and then be adjusted down by xl_low_bb or xl_high_bb events if the latency under load exceeds 150ms by some margin.

Or is my logic somehow flawed?

Agreed, and in my previous iteration I actually had it at 0.90. I think I started adjusting it even higher after seeing the first issues with cake-autorate reducing the speeds without a high-load based bufferbloat, and I wanted to try to reduce that.

To restart my optimizations, I am going to test disabling cake-autorate first, and get a fresh feel how the latency/lag in Rocket League is with the standard Cake without autorate. (Of course "feel" is the most unscientific way possible for testing or optimizing anything, but just in case the difference is big enough, then even that can work.)

1 Like

So, from my perspective the problem is that we want to deduce the actual capacity share we can use over a link (under the assumption that staying inside one's capacity share helps to avoid massive queuing delays and hence the negatice effects of 'bufferbloat') but we really only have limited information.
So we try to decompose the observed latency into an invariable and a variable component and reason that since queuing is mostly transient and variable, we can ignore the invariable component and focus on the variable component. This approach allows us to effortlessly evaluate and aggregate over a diverse set of reflectors with differing distances/path RTTs.
However, one challenge is that we can really only dissect invariable from variable latency while we really want the bottleneck queue's queuing delay. If queuing is truly only transient our approach works really well, if queing delay is too sluggish (like potentially in your case) we assign part of the queing delay to the invariable component thereby compromising on latency. That is why we are be default conservative in growing the baseline delay values, so we do not accidentally account a persistent queue as part of the "path RTT". However, this is not set in stone, if your local policy is to adapt faster to changing baselines, just change the following parameters:

# OWD baseline against which to measure delays
# the idea is that the baseline is allowed to increase slowly to allow for path changes
# and slowly enough such that bufferbloat will be corrected well before the baseline increases,
# but it will decrease very rapidly to ensure delays are measured against the shortest path
alpha_baseline_increase=0.001  # how rapidly baseline RTT is allowed to increase
alpha_baseline_decrease=0.9  # how rapidly baseline RTT is allowed to decrease

That way the baseline estimates can be adjusted faster or slower...

An additional challenge with our approach is that we really only have proxy data for the value we want to know, that is we use latency changes to reason about underlaying capacity share changes. And out three rate values per direction are really only describing our policy and not anything true for the bottleneck link. E.g. in your case the observed latency increase epochs might well be caused by other users in your radio cell pushing/pulling massively too much traffic through the link, thereby increasing the queueing delay OR it could be caused by intermittent RF noise that affects transmissions and causes massive amounts of retransmissions and/or fall back to low "MCS" values (not sure whether GSM-type networks call this MCS) in which case the path capacity massively changes.

The problem with your "average latency as baseline" approach as a general policy is that it can make the control loop insensitive against queuing delay quite quickly, and if that happens during a longer term data transfer low-latency will be compromised for the duration of that transfer. Recently members of my family started palybg modern games that occasionally require multi dozen GB downloads, which over our 100/40 Mbps link can take a few hours, it would be really sad, if videoconferencing, VoIP, and gaming for the other members would be compromised for such long durations...

Well adding more traffic to an already overfull queue will increase the queuing delay even more and/or drive the queue into tail-drop (though mobile carriers apparently often opt for massive queues* so dropping is going to be rare)

*) My best guess is, that this is intended to help over the variability over the radio link, if you use your phone in a car and e.g. drive through a tunnel you really expect not to notice any hiccups, but the radio link might actually be lost or massively reduced transiently. In such a case shallow buffering will lead to noticeable packet loss which will take time to recover from which for many things will be user noticeable, while transiently letting the queues fill up to "insane" values will be far less noticeable (unless the user is using a real real-time service at the time).

1 Like

Thank you for the very detailed explanation @moeller0 !

I think I understand the current logic much better now, and can see that it is well thought out.

But yes, my link seems to have such a high latency variance in the base/invariable component, that a significant part of it gets counted as the "variable" part, which then triggers the bufferbloat events and speed adjustments.

I could try to adjust that based on the parameters you pointed out, but after some other tests I ran last night and this morning, I'm somewhat convinced that my connection is just so bad now that it's not really compatible with the adaptive bandwidth approach.

What I did last night was that I disabled cake-autorate, and set the basic SQM speed limits to 70/6 Mbps (down/up), which are higher than my base rates were and far higher than what my minimum rates were in cake-autorate.

With this setup the Wafeform bufferbloat test gave an A+ rating with zero latency added when under load, and at that exact time the average latency was 60-70ms (which would be good enough if it stayed stable).

The next test was to play some Rocket League, and while it started ok, soon the ping started jumping around causing visible lag int he game play and triggering the "Latency Variation" warnings in the game. (So, latency-wise this Rocket League test was no better, nor any worse, than what it has been lately when using cake-autorate. But it was better in a sense that there was no download/upload speed punishment due to the latency variation, which is what I do see with cake-autorate.)

So, my conclusion from the above (and from all my other recent experiences) is that the new behavior of my link is one where the real bufferbloat happens at much higher speeds, but regardless of the speed, there is just too much variation in the base latency (i.e. the "invariable" component) for cake-autorate to be able to tell the difference between the "base latency variance" and the actual bufferbloat-related latency-spikes.

Perhaps with a lot more tuning of the parameters I could get it to work decently, but I'm not confident enough in my ability to tune them that well. And given the time it would take even if I did get it to work, I'm not sure it's worth the effort compared to running just the basic SQM with fixed upload and download speeds for now. And for my Rocket League hobby, I'd need a new internet provider either way.

I blame Verizon. :frowning:

1 Like

Buried in this thread somewhere is the idea to develop profiles for certain situations like a gaming profile in which the active DHCPs or fwmarks within a time period is monitored to activate a given profile and then activate certain cake settings based on that.

So, for example, there could be two modes:

  • normal - maximise shaper rate whilst maintaining latency below set criteria; and
  • gaming - use fixed shaper rate.

Or perhaps:

  • normal - maximise shaper rate whilst maintaining latency below relaxed criteria; and
  • low-latency - maximise shaper rate whilst maintaining latency below aggressive criteria.

But this would add yet another layer of complexity to cake-autorate and certainly for my use case cake-autorate seems to perform well enough now on my 4G connection that I no longer think about it - everything just works satisfactorily, including at least: VOIP; Teams/Zoom; browsing; YouTube 4K videos; Netflix, Prime; Windows Update; and occasional heavy downloads.

1 Like

Well, these two parameters allow you to get pretty much the behaviour you asked for, so maybe try:

alpha_baseline_increase=0.05  # how rapidly baseline RTT is allowed to increase
alpha_baseline_decrease=0.5  # how rapidly baseline RTT is allowed to decrease

This will make the baseline adjustments faster when it goes up and slower when it goes down...

I can understand if you are done and go for a static configuration, but then I would also love to learn whether your proposal results in a more usable link for your use-cases :wink:

1 Like

Side-note:
I have been hearing for some time now, how wired internet is going to go extinct, because the younger generations all flock to mobile only; I wonder given the results we see in this thread whether this claim is made up mostly by mobile carrier PR folk?
(I also traveled last week only with my LTE phone and did a few speedtests on the go, with latency during the speedtests up to the full second range, quite irritating even for plain browsing).

So what is it that makes one opt for wireless only? Is it lack of alternatives or is it convenience of being available everywhere or even price? Truly puzzled why anybody would do this voluntarily without a good rationale :wink: ?

Funny this makes me think about amateur radio. I studied and sat tricky exam covering what was for me back then pretty tricky physics, partly so that I could brag about it on my CV for university applications. Even though I did think from time to time that there was something more special about the idea of wireless rather than wired communication, I was drawn more as a teenager around the turn of the millennium, when 'the internet' became a big thing to fixed line internet over 56k telephone line and later ADSL than my amateur radio hobby.

In terms of the internet, fixed cable ruled for long enough, but 3G, 4G, LTE, Starlink and friends have changed things.

These days I live in the middle of nowhere in the Scottish Highlands and there is only copper cable with circa 2-4 Mbit/s. 4G is relatively inexpensive compared with having to pay: a) line rental; and also b) internet contract, and as a mobile contract it simply conveniently rolls over each month rather than me having to setup a new fixed-term contract every time the fixed term expires to avoid killer variable rate fees.

Perhaps when I get more time I will return to amateur radio and the intrigue associated with wireless communication.

1 Like

Sure, flexibility is great, but the change clearly was not all for the better either?

Ah, okay over here things are simpler, we pay totals for internet access without having to lease a line and then find an ISP, and contracts after their initial fixed period (in which neither side can easily cancel) now continue on afterwards with each side only having to give 1 month notice before cancelation. ISP so far rarely initiate cancelations as that might poke consumers enough to research their options and switch ISP... so essentially the roll-over you have in your mobile contract (and I fully agree that is considerably more convenient than having to go through the motions every X months*). I am not 100% sure whether it is by law, but here price dies not change after the fixed period. I remember this from California though where prices increased significantly after 2 years with the only chance of getting decent prices again was to sign for another 2 years.

*) I still occasionally check prices with other ISPs but just because I want to know instead of having to do so on a fixed schedule...

Since you made it that easy for me to just plug in the new values for the alpha_baseline variables, I'm more than happy to try them out. I'll plug them in later today...

If we go really deep into the reasons, then for me it starts from the legalized corruption in America, where ISPs are allowed to divide areas among themselves so that each area is only served by one ISP who can then charge monopolistic prices for very poor service.

At least for the last 3 addresses that I have lived in, each was served by just one ISP at the time we moved in, and in each location it was a different ISP even though my last 2 addresses are just a couple of miles apart.

My previous address was served by Verizon DSL, and they took 2-3 weeks just to get that service up and running from the day I signed up for their internet service.

In our current address the only non-wireless option is Spectrum cable modem, and though I cannot remember the exact details, I think they charged $90/mo for maybe a 20 Mbps connection, which is so overpriced that I hate it with a passion even if I didn't care about the $90 myself.

So, I switched from Spectrum to Verizon 5G Home Internet as soon as it became available and after testing it out first. For $25/mo I got 100 down / 10 up Mbps connection (which actually delivers up to 300/20 speeds on a good day), and with just basic SQM the performance in low latency gaming was excellent at ~20ms ping. (Note that that is not at all what I see from Verizon 5G Home now, but I just can't see myself giving in to the corporate greed and going back to Spectrum at this point.)

I have pre-ordered a 10 Gbps fiber from Sonic that is promising to start servicing this area soon (and charges only $50/mo for it), but I'm not holding my breath that they'll actually do that anytime soon.

2 Likes

Everything seems to work well as far as the settings I was able to hone in on.

I'm running it as a service with logging disabled and am consistently seeing 90-95% cpu1 load on my RT3200 (irqbalance, no steering) while running the download portion of this bufferbloat test.The actual results are quite nice and I haven't seen issues yet but am concerned under certain loading conditions I could max out the CPU. Is this something I should concern myself with or is htop not the best means to check for this.

I tried restarting the service after changing the following settings yet it seems to make little to no difference.

no_pingers=4 # number of pingers to maintain
reflector_ping_interval_s=0.4 # (seconds, e.g. 0.2s or 2s)

This is what htop looks like while downloading a steam game at approx 200mbps

@Lynx Is there a chance this script could be included as an option to select in sqm/qos from luci?