CAKE w/ Adaptive Bandwidth [August 2022 to March 2024]

Lynx · March 25, 2023, 12:35pm

I'm working on a new way to share variables between processes:

#!/bin/bash

var_store()
{
        while read -r -u "${var_store_input_fd}" command arg1 arg2
        do
                echo $command $arg1 $arg2
                case ${command} in

                        WRITE)
                                declare ${arg1}=${arg2}
                                ;;
                        READ)
                                printf "${!arg1}\n" > ${arg2}
                                ;;
                esac

        done
}

var_store_write()
{
        local var=${1}
        local val=${2}

        printf "WRITE ${var} ${val}\n" >&"${var_store_input_fd}"
}

var_store_read()
{
        local -n var=${1}

        printf "READ ${!var} var_store_output_fifo\n" >&"${var_store_input_fd}"
        read -r var < var_store_output_fifo
}

var_store_link()
{
        mkfifo var_store_output_fifo
        exec <> var_store_output_fifo
}

var_store_unlink()
{
        [[ -p var_store_output_fifo ]] && rm var_store_output_fifo
}

process_1()
{
        var_store_link

        var_store_write x 100

        var_store_read x

        echo ${x}

        var_store_unlink
}

exec {var_store_input_fd}<> <(:) || true

var_store &

process_1

This should be faster than writing out to temporary files and reading/rereading from those temporary files.

Any thoughts on how I might improve?

moeller0 · March 25, 2023, 2:04pm

My plan is that each process creates a FIFO and reads that, so all information for that process should be send as a "message". This will require some playing around with how to parse the different record types, but all in all that looks like the cleanest design:
use exported variables from the calling script to distribute invariant variables to child processes (or positional arguments)
use a FIFO for everybody else to send specific information to any process.

Lynx · March 25, 2023, 5:38pm

Have just about completed shift over to this new methodology and CPU usage does indeed look significantly lower (not that CPU usage was ever a problem on my RT3200).

It only recently dawned on me that we could have one variable store FIFO to write values to variable store and then individual bridge FIFOs for each process for reading values from global store and sending to each process. It seems to work.

moeller0 · March 25, 2023, 7:01pm

I do not think that trying to turn all/most variables into global variables is actually required... we really just seem to want to shave the cost down of getting per reflector information to maintain_pingers(), so I will go and just write them into maintain_pingers FIFO, problem solved (I want more generic parsing anyway that can differentiate between different commands, so shutdown requests can be sent the same way, and having one FIFO per process seems nicely symmetric...)

Lynx · March 25, 2023, 8:22pm

I'm only proposing this for those shared between processes. There are more pathways though like from the achieved rates monitor. So I think a variable store that is accessible from any process is helpful.

dlakelan · March 31, 2023, 5:14am

For some of the Lua autorate crowd, @lochnair, @_FailSafe, @CharlesJC and whoever else is interested, has anyone looked at the nim language https://nim-lang.org/ ? It compiles to C which should make it able to produce binaries for any OpenWrt platform and it seems like it's the right mix of high level language yet capable of doing the sort of stuff needed for nimble embedded purposes. Just a thought.

glaucun · April 2, 2023, 7:09am

Hi all. First off thanks for all your work on this project. I don't know if this is the right place to ask questions about the project so let me know if it's not. I'm wondering if the issue with Starlink filtering out ICMP time response packets has been resolved yet. And is this something I could install onto an Edgerouter x. I am new to bluffer bloat and networking in general but looking for a way to increase latency stability on Starlink brought me to you guys. Thanks again.

Lynx · April 2, 2023, 7:53am

Perhaps @gba can answer in respect of Starlink, but in respect of cake-autorate I'm personally still working with ordinary ICMPs via fping on my own 4G connection. This actually seems to work very well - I've had this running for months for day-to-day personal and busines suse. I understand that it works reasonably well for Starlink connections too and we even tried to compensate for Starlink satellite switching, but it's been a while since we have seen active testing of that.

You can try out cake-autorate on your Starlink connection using the code on the master branch of cake-autorate.

I have implemented support for timestamp ICMPs via tsping, but that is untested and probably needs some more work.

I have been very distracted by trying to make a variable store in bash using FIFOs as a replacement for temporary files for inter-process variables, but it is looking like this will not work out because bash uses byte-by-byte reads when reading from a named pipe, which kills performance.

moeller0 · April 2, 2023, 8:32am

So here is a question: we converged on using FIFOs for passing the delay records to the main loop, didn't we do so after cheching passing by standard file would be slower....
I am btw not convinced that reading byte by byte itself is a showstopper, after all to split the records read needs to do that anyway....

P.S.: Even if not for efficiency, FIFOs solve the concurrency/atomicity problem with multiple writers (if multiple pingers are used) in a pretty elegant way, so might still be a good enough solution for the delay records... also for the delay records we already multiplex a set of variables into each record...

Lynx · April 2, 2023, 9:39am

Ah that is immensely helpful - I hadn't appreciated that and reminds me how much I depend on such insights. Does that mean that for streams with log lines and reflector data lines there would be no benefit in forcing fixed length reads, reading with "-N" and then rereading the read variable to split up into the individual records?

It seems that a huge benefit of reading from temporary files that I hadn't appreciated before is that lseek() is available. With a FIFO the data is consumed as it is read and any next character could be the new line to trigger a given read return, so reads are byte-by-byte.

But yes, FIFOs give benefits not so readily available with temporary files including those you mention and also the consumption of data on read, which is also helpful for our needs.

I'm hazy since that was so long ago now.

Yes it's definitely all good enough since what we've been doing works just fine on my RT3200 and can even work on @richb-hanover-priv's Archer C7.

Nevertheless I thought that compared to simple file writes and reads/rereads my var_store mechanism would be an improvement.

My idea was that individual processes write/read by sending write/read command to var_store FIFO, and a new var_store process reads from the var_store FIFO and sends write confirmation or read data back to each individual process FIFO, but this turned out to be very slow (circa max 50% CPU) compared to just writing/reading temp files (with rereading as necessary) for those variables (circa max 20% CPU).

I assumed that this byte-by-byte read would be the cause of this jump, but perhaps it's something else? Like needing to read at both ends for any read/write and separate process churning away in the background? I'd be really interested to know what you think.

moeller0 · April 2, 2023, 10:04am

I really do not know, that would require testing. After all, we are interesting in the behaviour of the existing bash implementation and not just theoretical musings

Yes, but our records are small anyway... so I so not think this should be too big a problem?

Yes that solves the costly truncate issue, however that can be solved differently as well, if e.g. the writer writes a new file every time and the consumer deletes that file (which could be done as background job), but this would require considerably more complexity. As much as I was hesitant initially, I think using FIFOs was/is a decent design decision, I would probably not revisit right now...

Not sure either, could be that the main argument was the simplicity of the solution (keeping the code maintainable has value as well).

And it still might be once you test "batched writes" (I did not check whether you tested that already...)

Again I think we really only need to send information from:
a) pinger binary to parse_pinger(), here the FIFO seems working well
b) parse_pinger() to main_loop, again here FIFO seems working well
c) parse_pinger() to maintain_pingers(), here we currently use per-value file writes

I think for c) either batched file writes/read or batched FIFO writes/reads should be usable.
I would try the FIFo route first (simply because that would establish a nice symmetry, where all/most of our processes maintain a FIFO that can be used to send information their way, unless this turns out to be absolutely detrimental to performance).

In your solution I see:
a) non-batches writes/reads into the var_store FIFO
b) additional writes to the per-process FIFOs

I think, if my analysis is correct (and it likely is not, I have not looked at the code recently) we could reduce this to a single FIFO write/read of a full record of the per-reflector state/history information that maintain_pingers() needs, no?

Lynx · April 2, 2023, 10:59am

There are a few different routes as exposed by this trial commit for my var_store concept here:

So there are multiple different processes that need to read in data written by different processes.

The point of the var_store was:

to remove the need for the rereads on the basis that writing and reading is now much more disciplined - any write waits for confirmation and any read waits for data and FIFOs ensure atomicity for data smaller than PIPE_BUF (4096 bytes); and
to offer increased performance compared to constantly truncating, writing and opening temporary files.

But in my testing whilst 1) is satisfied, 2) is definitely not since CPU load actually massively increased.

I'd love to know exactly why and whether there might be a way to make it work. I'm wondering if it's the processing overhead needed to have reads at both ends for every write/read since any write waits for write confirmation and any read waits for the data, but I'm really not sure.

If anyone reading this (@colo, @patrakov) can offer any insight into why this commit actually slows things down rather than speeds things up, I'd really like to know. The var_store is introduced here:

github.com

lynxthecat/cake-autorate/blob/4b7efe70a318e7c476b30a8c29a8e3467a1fb5ce/cake-autorate_lib.sh#L254


      
          proc_man_wait()
          {
          	proc_man wait "${@}"
          }
          
          
proc_man_signal()
          {
          	proc_man signal "${@}"
          }
          
          
var_store()
          {
          	trap '' INT
          
          
	while read -u "${var_store_input_fd}" -r command arg1 arg2 arg3 arg4
          	do
          		case ${command} in
          
          
                        WRITE)
                                          declare ${arg1}=${arg2}
          				printf "${arg3}\n" > ${arg4}

Else if we do just abandon this var_store concept then I can certainly try to group together some of the temporary file writes and reads by reflector and also to group together download and upload pairs.

moeller0 · April 2, 2023, 11:18am

Which ones exactly, if I might ask?

Lynx · April 2, 2023, 12:31pm

I think:

I've linked in examples for each.

I believe one or two variables are read in by multiple processes.

moeller0 · April 2, 2023, 12:45pm

Regarding the first, doesn't the main loop see all reflector responses anyway? In that case it could just maintain this information itself without needing to read a file/FIFO at all.

Regarding the second, this is conditional on output_load_stats and was added IIRC on my request, but it seems OK to drop the shaper rates from the load records again, if it turns out to be too costly...

Regarding the third: do we really need to maintain and log rtt_delta_ewma_us at all? If I understand correctly we only added this as part of experiment, but ended up sticking to min-base_line, no?

And for maintain pinger we should try the batch approach anyway, no?

Lynx · April 2, 2023, 12:51pm

Yes but this is for the stall handling in which the main loop waits for increased load or a new reflector timestamp more recent than at the beginning of the stall.

Yes it does seem this could be dropped. I should check to see if that would improve performance. Maybe not by very much.

We also use this for culling bad reflectors. If the delta ewma is too big compared to others we slash that reflector. Same for baseline.

Agreed it seems the biggest performance gain might be had here since this amounts to ten writes (from the writer end at say 20Hz) and ten reads (at the reader end at the much lower maintain pingers frequency).

I think the writes could be grouped by reflector. And what would we do? Put as lines into temp file of form "X=Y like that is sourced?

Another alternative would be rather than writing at 20Hz we could have maintain pingers signal the monitor(s) that it wants an update of the latest state of the variables and update that way.

moeller0 · April 2, 2023, 12:56pm

But it clearly should still service its input FIFO so it should see new timestamps pretty quickly, no?

I like this idea of measuring the cost first before dropping something, but if the cost is noticeable maybe it could at least be made configurable?

Nah, if you write a file, write an array to file and read an array from file....

Sure, but let's see whether that is still needed after trying the batch approach?

Lynx · April 2, 2023, 1:03pm

The stall detection and handling is outside the point where its input FIFO is serviced. Say there's a read timeout from the servicing then we get into the stall handling stuff.

Is that because the built-in readarray is very fast? That does mean making a dedicated array within which we stuff the variables and then readarray and extracting the variables again. But maybe that's fine. I think I've read something like this endorsed elsewhere too.

Yes makes sense. Signalling and traps seems to require careful attention.

Any other ideas why my var_store concept doesn't work? I think this has potential as a robust solution but the present massive CPU hit doesn't seem worth it. I just wish I understood why or what I did wrong with that.

Perhaps I should abstract and put both the existing concurrent_read_integer temp file based solution and the newer var_store solution into a script and profile.

moeller0 · April 2, 2023, 1:11pm

Well, it should not be... that is part of my point, we should never "block" hard somewhere. I think we should put attempted reads into the common part of the main loop... (we could reduce the timeout during stalls so we check conditions more often).

Mostly because is is much cleaner/simpler than creating a sourceable bash file to begin with, but sure we should test its speed.

So, my idea is to trap TERM and INT and not do anything for those, but have the main loop issue commands to monitor_pingers (via the FIFO) and have monitor_pingers deal with the parse_pinger processes in turn. And as a big gun run a SIGKILL after a long enough timeout to make sure "we ask softly, but carry a big stick"....

No idea, but I have not looked too closely. My bet is on using a few named record types for inter-process information exchange...

Lynx · April 2, 2023, 2:42pm

@moeller0 I think I have finally identified an efficient way to propagate variables from one process to another without using temp files in bash that can be integrated into cake-autorate.

each process must call 'var_bridge_link' to register itself and get assigned its own proc_var_bridge_fifo
variables are sent to a target process by calling 'var_bridge_send_var', which prints variables in the form "x 100" to the proc_var_bridge_fifo associated with the target process
each process must periodically call 'var_bridge_get_vars', which reads from its proc_var_bridge_fifo using a tight while read loop with '-t 0.02' and sets the accumulated variables that were sent to the process since the last run
when a process terminates, it calls 'var_bridge_unlink' to deregister itself, which removes the proc_var_bridge_fifo for that process
the printf calls to send variables use 'set -o noclobber' and thus if a process and its corresponding proc_var_bridge_fifo is down, sends will fail

#!/bin/bash

set -o noclobber

var_bridge_get_vars()
{
        while read -r -t 0.02 var_name var_val
        do
                [[ ${var_name} && ${var_val} ]] || continue
                ((cnt++))
                export -n "${var_name}=${var_val}"
        done<${proc_var_bridge_fifo}
}

var_bridge_send_var()
{
        { printf "${1} ${2}\n" > "${var_bridge_path}/${3}_var_bridge_fifo"; } 2>/dev/null || return 1
        return 0
}

var_bridge_link()
{
        var_bridge_path=${1}
        local proc_name=${2}

        proc_var_bridge_fifo="${var_bridge_path}/${proc_name}_var_bridge_fifo"
        [[ -p ${proc_var_bridge_fifo} ]] && printf '%s\n' "${proc_name} already registered" >&2
        mkfifo ${proc_var_bridge_fifo}
        exec <>${proc_var_bridge_fifo}
}

var_bridge_unlink()
{
        rm -f ${proc_var_bridge_fifo}
}

process1()
{
        var_bridge_link "/tmp" process1

        cnt=0
        t=0

        for((i=5;i--;))
        do
                echo "started process1 iteration"
                var_bridge_get_vars
                sleep 1
        done

        echo "read ${cnt} variables"
        echo "last read variable: ${t}"
        var_bridge_unlink
}

process1 ${process1_var_bridge_fifo} &

var_bridge_link "/tmp" main

sleep 1

for((i=50000;i--;))
do
        var_bridge_send_var "t" "${EPOCHREALTIME}" "process1"
done

var_bridge_unlink

wait

One possible drawback of this approach compared to temp files is that if process1 sends variables at a higher frequency than process2 reads them in, there is a degree of redundancy in respect of the variables that process2 reads in. For example process2 might read in 'x=2' then later 'x=3' in the same call of 'var_bridge_get_vars'. There might be a clever way round that?

Nevertheless, in my testing with simply sending 50000 variables from one process to another, this is approximately twice as fast as writing and reading/rereading temp files on my RT3200.

Any thoughts?