CAKE w/ Adaptive Bandwidth

So looking a tiny bit closer the DL OWD delta is positive around 0 but the UL OWD delta is massively negative, so something with the compensation code is incorrect...

The UL delta is circa 1188810 microseconds? The compensation code only kicks in with delta circa 50 minutes.

So there are two issues here:

  • compensation code to deal with rollover with deltas > 50 mins

  • the selection of alpha in dependence upon whether the OWD value is greater or lower than the baseline

Wouldn't this be fixed by adopting what I thought was your suggestion above:

dl_alpha = ${dl_owd_us#-} >= ${dl_owd_baselines_us[${reflector}]#-} ? alpha_baseline_increase : alpha_baseline_decrease,

That is, remove any negative sign in the comparison?

Maybe this would work, I do not have the time/calm to actually think this through....

Actually maybe nothing needs changing:

This situation would resolve itself as the baseline is just conservatively increasing.

Bottom line: don't have local clock randomly jumping around if using tsping / ICMP type 13 pings?

This is what @Simon10362 concluded:

@Simon10362 you could also consider switching to fping (since then clock issues are no longer an issue), albeit we lose some accuracy in respect of managing heavily mixed (both download and upload) loads.

But we are not seeing clocks jumping around, really the two clocks are just out of sync with each other....

I thought we were seeing an issue with local clock being out of whack on startup. Maybe @Simon10362 can offer more details surrounding the clock skew thing he was describing.

Maybe the issue is clock changes on startup after cake-autorate started? Like clock is set on boot?

It seems we are both feeling mentally tired at the moment!

I guess clock skew here just implies two clocks that are incrementing not at the same rate... with something like NTP on each server that is unfortunately the normal behavior of two random clocks... (and to some degree expected, time progression depends on gravity)
I hope that our normal baseline adaptation code should handle that gracefully, but I am not sure whether that is fully debuged yet, and my hunch is @Simon10362 just brought a novel corner case :wink:

I think it’s as simple as if the clock(s) change to result in a positive delay then we don’t immediately fix unless that delay is greater than 50 minutes (which we assume relates to midnight rollover). We can’t so easily distinguish between positive delta that occurred from bufferbloat and delta that occurred from clock changes. So we assume it’s the former and the baseline is allowed to slowly catch up.

Well, we are doing something sub optimally as in the trace the UL delta OWD seems to have stayed negative for a long time... which is not really possible :wink:

Uhh by skew I guess what I meant was 'out of sync'?
Because it corrects itself after a delay and then it no longer errors out.
At the moment I have chrony synced with a local NTP server and things are good, even after boot.

I havent given it thorough thought regarding the possibility of clocks incrementing at different rates.

1 Like

250 samples there. So not that long, right?

And remember, we only allow baselines to catch up very slowly by design:

Not really, generally slow... we try to increase the baseline slowly, but decrease is fast.

Let's go through an example, with true minimal delays:
DL OWD: 20ms UL OWD 10ms
(asymmetric so we do not confuse ourselves with accidentally equal values)

Case A:
local clock (2000) > remote clock (1000)
our sequence of timestamps (assume no remote delay):
local origin send -> remote receive -> remote send -> local receive
2000 -> 1000 + 10 = 1010 -> 1000 + 10 = 1010 -> 2000 + 30 = 2030
UL OWD baseline: 1010 - 2000 = -990
DL OWD baseline: 2030 - 1010 = 1020
RTT = -990 + 1020 = 30
Now add 100ms queueing delay in DL direction
2000 -> 1000 + 10 + 100 = 1110 -> 1000 + 10 + 100 = 1110 -> 2000 + 30 + 100 = 2130
UL OWD: 1110 - 2000 = -890 -> delta: -890 - (-990) = 100
DL OWD: 2130 - 1110 = 1020 -> delta: 1020 - 1020 = 0

Case B:
local clock (1000) < remote clock (2000)
our sequence of timestamps (assume no remote delay):
local origin send -> remote receive -> remote send -> local receive
1000 -> 2000 + 10 = 2010 -> 2000 + 10 = 2010 -> 1000 + 30 = 1030
UL OWD baseline: 2010 - 1000 = 1010
DL OWD baseline: 1030 - 2010 = -980
RTT = 1010 + (-980) = 30
Now add 100ms queueing delay in DL direction
1000 -> 2000 + 10 + 100 = 2110 -> 2000 + 10 + 100 = 2110 -> 1000 + 30 + 100 = 1130
UL OWD: 2110 - 1000 = 1110 -> delta: 1110 - (1010) = 100
DL OWD: 1130 - 2110 = -980 -> delta: -980 - (-980) = 0

This is the case we seem to have with @Simon10362's data

So indeed we take a long time to adjust for the UL baseline, because we do not believe the initial value to be correct (we initialize with the baselines with 100000µs and here we are simply off by a factor of ten.... and it takes quite a while to reach the 'correct' range). So yes, you are right this is working as intended...

But I guess that a hall mark of this might be that one of UL_OWD_US or DL_OWD_US is always negative and the other positive... and we might simply implement a special rule that on the first repetition of a reflector with sign(UL_OWD_US) != sign(DL_OWD_US) we simply initialise the baselines with the first respective OWD_US value....

So I think we might need to revisit the baseline initialisation code somewhat, and maybe always initialise with UL_OWD_US or DL_OWD_US for tsping data... IFF one and only one of the two is negative... (the current logic is to start really high)

Or alternatively play with

if (( (${dl_owd_delta_us#-} + ${ul_owd_delta_us#-}) < 3000000000 ))

now here the difference was around 2.5 seconds and that is still within what true networks can show as induced queuing delay. So I am not sure that would be all that helpful...

Wow nice example. That’s helpful.

I wonder how frequent these time sync issues would be. I mean since the situation corrects itself if only on startup or not so frequently then perhaps it’s no biggie.

That would help on startup, but then what about clock changes that occur randomly?

Maybe we can gain some insight from @tievolu’s post here:

One idea is some sort of correction logic that triggers if the deltas remain above a threshold for a certain length of time whilst the load is classified as idle. But I worry this might not be reliable enough.

Another idea: can we look at the signs and/or work with the total RTT somehow when working out whether to use the high or low alpha?

I have a custom x86 router with a PCI 5G modem with modemmanager and MHI driver. I have tried installing Cake autorate to test it with my 5G connection.
I am following the guide, but after installing luci-app-sqm, I select my interface and start sqm. Immediately OpenWRT reboot and the modem is not working after reboot (in dmesg, I see MHI Kernel when starting the modem), so I have to uninstall all packages installed for sqm and reboot to make it work again.

Why is this happening? I have already tested two times.

Ouch. One for @moeller0 methinks.

Not really, just diabling sqm should solve this issue, after all sqm-scripts just set up qdiscs that exist already on your system, but that will do nothing if not used. Worst case you might need to unload sone qdisc kmods. But the whole thing smells like a bug potentially in the 5G modem driver...

Unfortunately I have no clue...

I would really not bother with this... the point is, if the two clocks actually start jumping in relation to each other more often then twice a day (if they are out of sync, both with roll-over at different times, so we expect 2 jumps, but these jumps are already handled by the 50 minutes thing...) then in all respect that reflector should be avoided (modulo corner cases like the user changing the router's clock manually). That IMHO leaves only the initial values for the baselines, currently we set these to 100000µs IIRC which typically is OK, just not when the true baseline is considerable above 100ms, so either we declare the current operation as 'by design' or we fix this cosmetic glitch by changing the initialisation for the tsping case if the two baselines have different signs... (or switch to initialising with the first OWD values in all cases*).
Maybe we could just add a note that autorate will typically take 10 minutes (I am guessing here, but others might know precisely) before it has reached 'optimal operating temperature' so users will know that they might have to wait a bit before starting their bufferbloat tests :wink:

*) I seem to recall we pondered that and did not do it to avoid starting with a totally crap baseline if autorate is accidentally started during heavy congestion... (but even then our 'slow increase fast decreases' approach should get us to the true baseline quickly once the congestion subsides).

Today I upgraded cake-autorate on my x64 debian router and tried to plot the octave graphs.
It fails with this:

[gustavo@dellg155 cake-octave]$ octave -qf --eval 'fn_parse_autorate_log("./cake-autorate.primary.log", "./outpug.tif")'
error: parse error near line 411 of file /home/gustavo/bin/cake-octave/fn_parse_autorate_log.m

  syntax error

>>>                     x_vec.DATA = (autorate_log.DATA.LISTS.PROC_TIME_US .- first_sample_timestamp);
[gustavo@dellg155 cake-octave]$ ls
cake-autorate.primary.log  fn_parse_autorate_log.m

Am I missing something?

I'm running stable branch 3.1.

In case of any help, the log is here:

Mmmh, which version of ovtave are you using?