CAKE w/ Adaptive Bandwidth [August 2022 to March 2024]

Just to clarify, when autorate enters "sleep/idle" mode it stops sending latency probes (based on the observation that over a day quite some traffic volume accumulates, and on volume-metered links that can get expensive, so the goal is to spare these probes at times when they are not needed).
So getting epochs with only LOAD records is fine AS LONG AS the record type switches almost instantaneously back to DATA (plus LOAD) after you start a heavier usecase. I tend to simply run a speedtest, then two things need to happen:
a) if looking at autorate's log DATA entries need to appear
b) when running the speedtest the reported rate needs to be well above your configured minimum-rate assuming min_shaper_rates_enforcement is set to 1; if min_shaper_rates_enforcement was set to zero this becomes less clear as the "residual" shaper rate depends on what rate was active when sleep was entered.

So do you have any recollection what type of sleep you encountered?

We have no good theory why deep sleep happens, but it might involve a race condition somewhere that triggers rarely enough on decent links to make debugging a pain. However, my ISP or modem for a few days got my link in a mode with high packet loss which triggered the condition frequently, so maybe your starlink link with its inherent higher packet loss might help us finally solving that riddle...

@Lynx: if the dependence on packet loss holds, I think we really should strictly serialize sleep and reflector replacements and stall; I understand that we currently see not how these could interfere, but I argue that we currently do not actually make interference impossible we just do not see how they could interfere. This might reflect that they are mutually exclusive or just imply that we can not see far enough, so let me argue we should make sure these things are mutually exclusive.

Here are a bunch of lines from yesterday if you want to look at them. I will note, though, that I read on the Starlink reddit that a bunch of people have observed unusual ping spikes on Starlink the past few days, so it's possible that I'm seeing a temporary thing that will pass after they get that fixed, I'm not sure.

Well, you're in luck???? as it looks like it's happened again. I haven't been on the connection much all day so I'm guessing there were long idle periods. But when I checked the logs it only contains LOAD lines, and I tried a speed test with no effect -- it continues to only show LOAD lines.

One thing I'll note is that when I do "tc -s qdisc" I can see cake is currently set to my maximum shaper rate of 200 Mbps down / 30 Mbps up, which is unusual as I don't think I've ever noticed cake-autorate actually getting fully up to the maximum I have defined as the max shaper rate.

So perhaps there could be a clue in that? Why would it be running at the maximum rate?

Here are some log lines, unfortunately as far back as the logs go it is just LOAD lines. I'll keep autorate running as is if there is something else you want me to try before rebooting:

LOAD; 2023-01-22-13:31:20; 1674415880.021940; 1674415880.021201; 0; 0; 25414; 4973
LOAD; 2023-01-22-13:31:20; 1674415880.232015; 1674415880.231271; 0; 0; 25414; 4973
LOAD; 2023-01-22-13:31:20; 1674415880.442129; 1674415880.441390; 6; 3; 25414; 4973
LOAD; 2023-01-22-13:31:20; 1674415880.652076; 1674415880.651330; 0; 0; 25414; 4973
LOAD; 2023-01-22-13:31:20; 1674415880.862088; 1674415880.861349; 8; 8; 25414; 4973
LOAD; 2023-01-22-13:31:21; 1674415881.072102; 1674415881.071359; 0; 0; 25414; 4973
LOAD; 2023-01-22-13:31:21; 1674415881.282119; 1674415881.281369; 113; 117; 25414; 4973
LOAD; 2023-01-22-13:31:21; 1674415881.492402; 1674415881.491656; 58; 6; 25414; 4973
LOAD; 2023-01-22-13:31:21; 1674415881.702008; 1674415881.701276; 2; 0; 25414; 4973
LOAD; 2023-01-22-13:31:21; 1674415881.912034; 1674415881.911295; 0; 0; 25414; 4973
LOAD; 2023-01-22-13:31:22; 1674415882.122701; 1674415882.121956; 0; 4; 25414; 4973
LOAD; 2023-01-22-13:31:22; 1674415882.332168; 1674415882.331418; 65; 36; 25414; 4973
LOAD; 2023-01-22-13:31:22; 1674415882.541945; 1674415882.541212; 0; 0; 25414; 4973
LOAD; 2023-01-22-13:31:22; 1674415882.752068; 1674415882.751332; 0; 8; 25414; 4973
LOAD; 2023-01-22-13:31:22; 1674415882.962109; 1674415882.961369; 0; 29; 25414; 4973
LOAD; 2023-01-22-13:31:23; 1674415883.172166; 1674415883.171422; 63; 65; 25414; 4973
LOAD; 2023-01-22-13:31:23; 1674415883.382180; 1674415883.381433; 122; 131; 25414; 4973
LOAD; 2023-01-22-13:31:23; 1674415883.592124; 1674415883.591369; 212; 35; 25414; 4973
LOAD; 2023-01-22-13:31:23; 1674415883.803009; 1674415883.801870; 28; 71; 25414; 4973
LOAD; 2023-01-22-13:31:24; 1674415884.012416; 1674415884.011645; 107; 78; 25414; 4973
LOAD; 2023-01-22-13:31:24; 1674415884.222124; 1674415884.221382; 18; 2; 25414; 4973
LOAD; 2023-01-22-13:31:24; 1674415884.432143; 1674415884.431402; 10; 75; 25414; 4973
LOAD; 2023-01-22-13:31:24; 1674415884.642192; 1674415884.641455; 91; 33; 25414; 4973
LOAD; 2023-01-22-13:31:24; 1674415884.852371; 1674415884.851622; 13; 13; 25414; 4973
LOAD; 2023-01-22-13:31:25; 1674415885.062093; 1674415885.061352; 18; 16; 25414; 4973
LOAD; 2023-01-22-13:31:25; 1674415885.272120; 1674415885.271381; 12; 13; 25414; 4973
LOAD; 2023-01-22-13:31:25; 1674415885.482161; 1674415885.481414; 13; 15; 25414; 4973
LOAD; 2023-01-22-13:31:25; 1674415885.692442; 1674415885.691690; 2; 6; 25414; 4973
LOAD; 2023-01-22-13:31:25; 1674415885.902448; 1674415885.901694; 13; 10; 25414; 4973
LOAD; 2023-01-22-13:31:26; 1674415886.112182; 1674415886.111439; 19; 125; 25414; 4973
LOAD; 2023-01-22-13:31:26; 1674415886.322087; 1674415886.321347; 91; 5; 25414; 4973
LOAD; 2023-01-22-13:31:26; 1674415886.532147; 1674415886.531409; 26; 0; 25414; 4973
LOAD; 2023-01-22-13:31:26; 1674415886.742243; 1674415886.741496; 0; 0; 25414; 4973
LOAD; 2023-01-22-13:31:26; 1674415886.952007; 1674415886.951272; 0; 0; 25414; 4973
LOAD; 2023-01-22-13:31:27; 1674415887.162375; 1674415887.161619; 0; 3; 25414; 4973
LOAD; 2023-01-22-13:31:27; 1674415887.372166; 1674415887.371432; 3; 6; 25414; 4973
LOAD; 2023-01-22-13:31:27; 1674415887.582093; 1674415887.581351; 3; 7; 25414; 4973
LOAD; 2023-01-22-13:31:27; 1674415887.792071; 1674415887.791329; 0; 3; 25414; 4973
LOAD; 2023-01-22-13:31:28; 1674415888.002104; 1674415888.001364; 3; 85; 25414; 4973
LOAD; 2023-01-22-13:31:28; 1674415888.212130; 1674415888.211385; 86; 14; 25414; 4973
LOAD; 2023-01-22-13:31:28; 1674415888.422158; 1674415888.421410; 21; 9; 25414; 4973
LOAD; 2023-01-22-13:31:28; 1674415888.632085; 1674415888.631348; 3; 7; 25414; 4973
LOAD; 2023-01-22-13:31:28; 1674415888.842527; 1674415888.841780; 43; 14; 25414; 4973
LOAD; 2023-01-22-13:31:29; 1674415889.052104; 1674415889.051360; 5; 84; 25414; 4973
LOAD; 2023-01-22-13:31:29; 1674415889.262123; 1674415889.261380; 27; 30; 25414; 4973
LOAD; 2023-01-22-13:31:29; 1674415889.472127; 1674415889.471381; 3; 10; 25414; 4973
LOAD; 2023-01-22-13:31:29; 1674415889.682146; 1674415889.681409; 4; 11; 25414; 4973
LOAD; 2023-01-22-13:31:29; 1674415889.892425; 1674415889.891681; 8; 2; 25414; 4973

1 Like

Great? Those are the symptoms we are looking for! But now we need to make sure that we enable the correct logging/debugging options to catch the required evidence for what is happening.

So the current script will do either of the following when entering idle-sleep:
a) reduce the shaper to the minimum rate (if min_shaper_rates_enforcement=1 in the config file)
b) keep the shaper setting just as it was when sleep was entered (if min_shaper_rates_enforcement=0 in the config file)

In the second case that can in theory happen with the rate at the maximum, even though I would expect the rate to be somewhat below the maximum time due to the regression to baserates without load.

Could you set the follwing in your config file please:

log_DEBUG_messages_to_syslog=1	# enable (1) or disable (0) logging of all DEBUG records into the system log. 

Then you can use logread | grep autorate > autorate.syslog.txt to capture all DEBUG and ERROR messages into a file, which might help in figuring out which processes (with which process ids) should be running so we can check whether these still run and what their status is.

In the past we got as far as seeing that the termination of maintain_pingers_pid was somehow stuck in a wait because the termination of monitor_reflector_responses_fping() was somehow blocked, but we could not get any deeper with this.

@Lynx, the re-occurrence of deep sleep seems to indicate that the timer on read might not actually solve the blocking read (or our diagnosis might be wrong). And also that the issue is not likely to be caused by an error of concurrent_read_integer, as the new version should be robust against odd inputs...

What again was the command to manually write something to the fping fifo to get things unstuck?

Because I think the next logical debugging step would be to see whether in deep sleep tickling the fifo/read_loop might get things going again?

1 Like

@gba this is with the latest code from 'testing'?

Just e.g.:

echo "anything" > ${run_path}/pinger_0_fifo

In any case, the 1.2.0 release can wait until we've resolved this. There is presumably a fault in the logic somewhere. I wondered before about the pause maintain reflectors, but that still involves checking for termination so it is not that.

Oh and another thing: what if rx_bytes is reset on interface reset? Perhaps that could mean e.g. negative loads so that we never wake because achieved rates is less than what they need to be? We don't check for something like this in the rate monitoring code.

For whatever reason, this doesn't manifest on my RT3200.

My testing indicates that signal handlers still run while the read is waiting for the data in the FIFO. Perhaps we can stick the echo to unblock the read in the signal handler, subject to a check that the FIFO exists? Maybe even remove it in the signal handler then, too?

2 Likes

But would we not see this in the LOAD records?

I guess if we are certain that the read timeout does not solve the issue, then we might want to reconsider the "send a termination request" to pinger_0_fifo after the actual pinger has been killed and waited on approach? with all the pattern matching we already do in one of our hottest loops it should not kill us to add a single additional value comparison...

Yes, either the latest commit or one of the latest. One of the ones on January 20.

I changed the config file and will capture the log lines and let you know.

So the read timeout is not effective to unblock in the context of FIFO held open for read/write?

In fairness we don't know for sure where it's blocking for @gba. @gba if you could install strace you can see exactly which command each process is presently running.

root@OpenWrt:~/cake-autorate# strace -y -p 18903
strace: Process 18903 attached
read(0</tmp/run/cake-autorate/primary/pinger_0_fifo>(deleted),

@Lynx please note that the result of your command may be misleading. It will show where in the code bash is sitting at the time when you look at it, which may or may not be the same as the place where it misbehaved first. I think we had it already somewhere earlier in this thread, where bash "successfully" completed the timeout processing, while we wanted it to be interrupted, and then hung in a different read, where we didn't want it to go at all.

1 Like

Good point. I don't suppose you'd have time to try testing and see if you see this or any other issue. Has been running for days without issue on my RT3200, but something for sure seems not quite right at the moment. Perhaps there's another sticking point we've not tracked down.

Maybe post both the config files for the lua and for the bash version here?
The lua version uses icmp timestamps to measure congestion for both directions independently, for the bash version we still use RTT data so can not differentiate the congested direction (changing that is planned for the next development cycle after we release version 1.2)

Also both versions have different CPU costs so you might run into CPU limits for the bash version.
But my first guess would be that your cake-autorate config does not match your link's characteristics well and needs tuning.

Thanks for your reply. I deleted my post because it turned out to be a fluke - lua is not running at all as I'm on a SNAPSHOT release with lualanes - 3.13.0-2. So, in the end I was comparing bash autorate vs sqm.

I'll keep an eye on the bandwidth once I figure out how to patch sqm-autorate, thanks.

We probably should still try to optimize your cake-autorate config :wink:

my connection is:
download burst: 900Mbit
download max: 650Mbit
download most of the time: 400Mbit

upload burst: 40Mbit
upload max: 22.5Mbit
upload most of the time: 22.5Mbit

EDIT:
I set bandwidth settings to:
dl min/base/max:250000/400000/900000
up min/base/max:18000/22500/40000

EDIT:
question1: as I mentioned, my connection is allowed to burst for about three to five seconds. Which parameters should I tweak to increase sensitivity and rate of change for allowed bandwidth up and down?

question2: what is achieved rate vs shaper rate? How often is rate calculated? Does 0.9 value mean 10% reduction?

achieved_rate_adjust_down_bufferbloat=0.9 # how rapidly to reduce achieved rate upon detection of bufferbloat
shaper_rate_adjust_down_bufferbloat=0.9   # how rapidly to reduce shaper rate upon detection of bufferbloat
shaper_rate_adjust_up_load_high=1.01      # how rapidly to increase shaper rate upon high load detected
shaper_rate_adjust_down_load_low=0.9      # how rapidly to return down to base shaper rate upon idle or low load detected
shaper_rate_adjust_up_load_low=1.01       # how rapidly to return up to base shaper rate upon idle or low load detected

Makes sense, I probably would set the minimum for up a bit lower to allow for more congestion in the but if this works like this nothing ot worry.

We do not model that kind of burst so you need to use the generic rate_adjust factors for that, the further above 1 the larger the increase per step-up, the further below 1 the larger the reduction per step-down. But you need to trade these off between capturing your bursts well and still not overshooting too hard under normal non-power-boost conditions...

the achieved rate is measured by the code over:

monitor_achieved_rates_interval_ms=200 # (milliseconds) 

milliseconds, by default 200ms, the shaper rate is the rate the code configures into the running cake instances...
For the download direction we can actually use the achieved throughput during bufferbloat epochs as measure of how much traffic can pass the bottleneck and use that to reduce the shaper rate, but going to 100% of the last achieved rate is likely too high, so we allow a reduction factor, we also allow a reduction factor based on the current shaper setting.

But you really need to take some cate in configuring:

# delay threshold in ms is the extent of OWD increase to classify as a delay
# these are automatically adjusted based on maximum on the wire packet size
# (adjustment significant at sub 12Mbit/s rates, else negligible)  
dl_delay_thr_ms=30 # (milliseconds)
ul_delay_thr_ms=30 # (milliseconds)

These 30/30ms work okay for @Lynx' LTE link, on my VDSL2 link I use 7.5/7.5ms, so these likely will need to be adjusted for your link as well.
Our controller will measure (by default, most of these numbers can be configured) 6 reflectors and maintain the last 6 samples, and if 3 of these samples show a (base line corrected*) delta-delay value above dl_delay_thr_ms or ul_delay_thr_ms (currently these are identical, but this will change once we integrate OWDs, the code is already prepared for that)

I would recommend you try the "testing" branch as that is will become the new stable branch soon, and has already many substantial changes compared to main. You could e.g. use the loging feature to record a log file during say 15 minutes of quiet network and post it so we can figure out a suitable starting point for the thresholds...

) Different reflectors will have a wildly differing RTT to your router depending mostly on the "network distance and the resulting propagation delay*), interstingly the absolute amount of RTT change under congestion is pretty similar even for wildly different reflectors, so we maintain a running average of the minumum RTT for each reflectuor and subtract that baseline from each sample so we only have the delate-delay vales for each reflector, as these tend to be in the same order of magnitude and hence we get away with a single delta-delay threshold setting per direction.

**) Not only but let's not complicate things any further

1 Like

Thanks for the tip!

changing to 5ms made a huge difference for upload- It's as good as it can be I think.
Download latency is ok-ish but bandwidth 200M too little - will have to work on that.

@moeller0 is a special case for burst warranted in which frequency is temporarily increased? Like a sort of steroids mode? Hitherto we have kept frequency static rather than dynamic.

Also does changing cake rate extremely rapidly have any potentially negative effects such that at very high reflector rates we should skip a few beats in terms of cake rate changes?

@juliank you could consider trying the perl version linked at the top of this thread as well. Not sure how it would fare under this burst scenario.