CAKE w/ Adaptive Bandwidth [August 2022 to March 2024]

The omnia actually allows to use a SSD (vie its 2 mini PCIe slots) if the eMMC is truly dead, I just want to push this off as long as possible (the omnia uses btrfs, and the snapshot feature is pretty sweet even for a router). I already connected a spinning rust USB harddisk and direct loging towards that (I like hard disks in that they effortlessly allow r/w filesystems and compared to most routers' internal storage they are not that slow either).
But I think that installing stuff or running git in /root is not the best way forward on this device.

According to the posted output you use squashfas and ubifs so no f2fs.

Ah interesting. Should I be switching to f2fs?

Maybe not given post above.

For core OpenWrt I personally would stick to the defaults, I would however consider F2FS for USB sticks and SD cards and the like (unless data exchange with other OSs without F2FS support is absolutely unavoidable).

@patrakov I put in the change about resuming from sleep.

        # wait until load increases again
        while true
        do
                t_start_us=${EPOCHREALTIME/./}
                get_loads

                if ((10#${dl_achieved_rate_kbps}>$connection_active_thr_kbps || 10#${ul_achieved_rate_kbps}>$connection_active_thr_kbps)); then
                        (($debug)) && log_msg "DEBUG" "dl load percent: $dl_load_percent or ul load percent: $ul_load_percent exceeded medium load threshold percent: ${medium_load_thr_percent}. Resuming normal
                        break
                fi
                sleep_remaining_tick_time $t_start_us $reflector_ping_interval_us
        done

The 10#$ is to deal with very rare event that there is a leading zero. Is there a way in bash to make this global for all operations?

I don't know. And I don't have time to test - completely swamped with meetings, emails, and also the need to make a presentation before Thursday :frowning:

Here are the return to minimum and reflector randomization toggles added:

And in previous commit I added the missing comments you provided @patrakov.

I think that's everything now. Let me know if I've missed anything.

If anyone could test the testing branch code all still works with all these changes that'd be super helpful.

Started the new code for testing.

Minor nit:
I changed the launcher from
cake_instances=(/root/cake-autorate/cake-autorate_config*sh)
to
cake_instances=(/root/cake-autorate/cake-autorate_config.*.sh)
since I think we should enforce an instance_id, like in my case:
cake-autorate_config.vdsl2.sh

To harvest the log file(s) I now use:

ssh root@192.168.42.1 'kill -USR1 $( cat /var/run/cake-autorate/*/maintain_log_file_pid )'  && sleep 30 && scp  root@192.168.42.1:/var/log/cake-autorate*.log.gz ./SCRATCH/ && ssh root@192.168.42.1 'rm /var/log/cake-autorate*.log.gz'

Here an example withe the new code, seems to work as expected...

1 Like

Hello!

Fyi, the log parser is returning a no valid data warning message, but I believe data is there.

The log file presenting the issue is cake-autorate_2022-12-06_16_56_32.log.

Hi @gadolf that log file has no latency/delay data at all. But the load/traffic data should still be plotted. Will need to fix this...

Mmmh, got it to plot something, but the only thing we have is literally achieved up- and download, which is IMHO not worth the hassle....

1 Like

The smaller the binary, the higher the likelyhood it will stay in L2 cache.

I wish far, far more tools used busybox like methods, unfortunately we're in an era of virtual machines and docker containers all slowing our worlds down.

2 Likes

What's your rationale for this? I thought if we just have cake-autorate_config.sh then we can just have empty instance ID and the code writes our to /var/run/cake-autorate and /var/log/cake-autorate.log. And that instance ID matters more for multiple instances.

One reason to force instance identifier might be to ensure we don't have one blank and then one not blank, in which case the cleanup of the blank one might wipe out the non-blank one.

Since the identifier results in subdirectory as per:

So would that mean that if user tries to run with cake-autorate_config.sh we refuse to run and explain that an instance identifier must be signalled with cake-autorate_config.X.sh?

Should we require user to also put in matching identifier in the config itself as a kind of sanity check?

Then again both steps create more hoops for the user to have to jump through to get cake-autorate to work.

By the way, thanks for showing the graph - looks good. Do you think setting initial baseline to 1s and delta ewma to 1.5s makes sense? We could also set delta ewma to 0 but I thought 1.5 might work better as the baseline smooths out to reasonable value. I think you've called for waiting and setting baseline based on real values before. We could also do that if you think it's worth it.

Given this graph

Am I right to conclude that the circled reflectors should be removed from the reflectors list?

1 Like

BTW I just identified and fixed an issue with monitoring process termination - detailed explanation here:

I noticed that I sometimes had a few leftover processes, and I think this was the culprit. It probably surfaced because at the end of the monitor processing we now write out all baselines and delta ewmas for every reflector, and this did not always complete before a new monitoring process was created, resulting in issues with empty baselines / delta ewmas, or even worse, stray monitoring processes.

So now we retain monitoring process pids and wait for the monitoring processes to end (upon deletion of the fifo, which ends the while/read loop, and begins the baseline and delta ewma write outs) on any pinger kill.

This also relates to my query above about 1.0s baseline and 1.5s delta ewma. This aspect could need some attention, or perhaps it is already fine now.

At the moment on any monitor process termination (sleep/stall/reflector rotation) the baselines / delta ewmas are written to a file and then read back in on new monitor process creation for any common reflectors. I think that makes sense, but if not I could remove the write/read lines and instead adopt a new way to initialize the baselines / delta ewmas.

1 Like

Mostly selfish, then my "how to get autorate logs to your analysis machine' instructions work generically:

ssh root@192.168.42.1 'kill -USR1 $( cat /var/run/cake-autorate/*/maintain_log_file_pid )'  && sleep 30 && scp  root@192.168.42.1:/var/log/cake-autorate*.log.gz ./SCRATCH/ && ssh root@192.168.42.1 'rm /var/log/cake-autorate*.log.gz'

Note, that this expects /var/run/cake-autorate/*/ subdirectories if these are missing the instructions need to be changed...

Yes, but that means we need two instructions how to send the signal to nicely package the compressed log one for empty and one for filled instance_id....

Sure it is only really needed for multiple instances, but in a sense "single instance" is just one possibility in the set "permissive multiples" :wink:

Yes, pretty much that is what I would do. Most users however would never change the supplied cake-autorate_config.single.sh or whatever we name it... The name is already special and users can not change it arbitrarily...

I would not bother with this, as part of the filename this should be onbious enough...

Not at all, we simply supply an appropriately named config file just as we do now?

I really think either collect a few samples per reflector before starting to report OR just initialize baseline to the first value measured and let the EWMA handle the rest, these large transients when coming from 0 or even more 1.5 are probematic for autpmatic scaling of the delay data Y-axis.

OK - but we terminate with error if the user tries on cake-autorate_config.sh?

I would also store the timestamp together with the baseline values and do something like if the stored baseline is older than 4 hours (courtesy of "thin air") re-initialize with the newest sample, otherwise inherit the old value...

How about then:

# ======= Start of the Main Routine ========

trap ":" USR1

# cake-autorate first argument is config file path
if [[ ! -z $1 ]]; then
        config_path=$1
else
        config_path=/root/cake-autorate/cake-autorate_config.sh
fi

[[ ! -f "$config_path" ]] && { log_msg_bypass_fifo "ERROR" "No config file found. Exiting now."; exit; }
. $config_path
[[ $config_file_check != "cake-autorate" ]] && { log_msg_bypass_fifo "ERROR" "Config file error. Please check config file entries."; exit; }

if [[ $config_path =~ cake-autorate_config\.(.*)\.sh ]]; then
        instance_id=${BASH_REMATCH[1]}
        run_path=/var/run/cake-autorate/$instance_id
else
        log_msg_bypass_fifo "ERROR" "Instance identifier X set by cake-autorate_config.X.sh cannot be empty. Exiting now."
fi

if [[ ! -z "$log_file_path_override" ]]; then
        if [[ ! -d $log_file_path_override ]]; then
                broken_log_file_path_override=$log_file_path_override
                log_file_path=/var/log/cake-autorate${instance_id:+.${instance_id}}.log
                log_msg_bypass_fifo "ERROR" "Log file path override: '$broken_log_file_path_override' does not exist. Exiting now."
                exit
        fi
        log_file_path=${log_file_path_override}/cake-autorate${instance_id:+.${instance_id}}.log
else
        log_file_path=/var/log/cake-autorate${instance_id:+.${instance_id}}.log
fi

The relevant portion being:

if [[ $config_path =~ cake-autorate_config\.(.*)\.sh ]]; then
        instance_id=${BASH_REMATCH[1]}
        run_path=/var/run/cake-autorate/$instance_id
else
        log_msg_bypass_fifo "ERROR" "Instance identifier X set by cake-autorate_config.X.sh cannot be empty. Exiting now."
fi

That is a policy question you need to answer yourself.... But subjectively I agree, using reflectors with > 100ms larger RTT than other reflectors seems sub optimal, and you seem to have sufficient reflectors left....
The RTT to/from each reflector accumulates the total queueing delay over the whole network path, but what we are interested in is the queueing delay over a specific link (typically the internet access link), so we employ multiple reflectors partially so that hopefully they mostly share the access link as common path and then take a vote over all reflectors to decide whether we encounter congestion over the access link (that is if the path to a some reflectors is congested somewhere upstream of our access link and return above threshold delay data, but other reflectors stay below the threshold we can differentiate coursely between 'near' and 'far' congestion).

I really like the clarity of forcing all instances to have an ID string... that will make it relatively intuitive to add more instances later. But I do not feel deeply about that as I can just supply an instance_id for myself. But supporting both options results in more complexity in the code and code-paths that are more and paths that are less well tested...

How much value is there in reading/writing from the baseline and delta ewma files? I'm struggling to figure out the best way to handle this.

I just started with this:

        # Read in baselines if they exist, else just set them to 1s (rapidly converges downwards on new RTTs)
        for (( reflector=0; reflector<$no_reflectors; reflector++ ))
        do
                if [[ -f $run_path/reflector_${reflectors[$reflector]//./-}_baseline_us ]]; then
                        read reflector_last_timestamp_us < $run_path/reflector_${reflectors[$reflector]//./-}_last_timestamp_us
                        if (($reflector_last_timestamp_us < ($t_start_us+4*60*60*1000000))); then
                                read rtt_baselines_us[${reflectors[$reflector]}] < $run_path/reflector_${reflectors[$reflector]//./-}_baseline_us

                else
                        rtt_baselines_us[${reflectors[$reflector]}]=??
                fi

But it's looking a bit fussy.

I want to avoid if check inside the tight while read loop:

while read timestamp reflector _ seq_rtt
        do
                t_start_us=${EPOCHREALTIME/./}

                [[ $seq_rtt =~ \[([0-9]+)\].*[[:space:]]([0-9]+)\.?([0-9]+)?[[:space:]]ms ]] || continue

                seq=${BASH_REMATCH[1]}

                rtt_us=${BASH_REMATCH[3]}000
                rtt_us=$((${BASH_REMATCH[2]}000+10#${rtt_us:0:3}))

                alpha=$(( (( $rtt_us >= ${rtt_baselines_us[$reflector]} )) ? $alpha_baseline_increase : $alpha_baseline_decrease ))

                ewma_iteration $rtt_us $alpha rtt_baselines_us[$reflector]

                rtt_delta_us=$(( $rtt_us-${rtt_baselines_us[$reflector]} ))

                ewma_iteration $rtt_delta_us $alpha_delta_ewma rtt_delta_ewmas_us[$reflector]

That is, ideally the baselines and delta ewmas have been initialized in advance somehow.

Any ideas?