CAKE w/ Adaptive Bandwidth [October 2021 to September 2022]

Lynx · October 16, 2021, 11:56am

Here is the run up to a different crash with very light traffic. It tries to respawn the bandwidth monitor but apparently doesn't.

So there are presently two crash causes then? Reading from the rx/tx files in the log in the link below and the crash identified in the log links in my previous posts above?

Looking at the various output lines it all looks a little random to me. If you look at the lines does it seem right to you?

Here is another run with a download running:

It seems it is time to admit defeat?

dlakelan · October 16, 2021, 4:28pm

No not at all. It really just needs a little more attention to make sure things get respawned. Unfortunately without a testbed it's harder for me to reproduce the tests, but let me look at it over the weekend, we'll see if I can get something robustified by monday or tuesday.

Weird those logs you show don't show the same crash, they've got some crash associated with Erlang itself, in the formatting code... we'll see if I can reproduce that, or maybe work around it by avoiding the use of that code. That might be an issue with Erlang on routers.

Lynx · October 17, 2021, 7:11am

Aside from the issues causing the crashes, I wonder if is is problematic to check bandwidth used every second and make decisions on that depending on whether a random number is less than 1/20. Might it be better to check bandwidth less frequently to get a more reliable indication or use moving average? For example there could be a thread that reads the Rx and Tx counters every second and uses that to keep and update record of the present load. Then that record could be read to make change decisions. That way a more accurate representation of present load feeds into the decision making process.

I still wonder about my original idea:

load > 50% AND no delay encountered → increase bandwidth
load < 50% OR delay encountered → reduce bandwidth

Here the load would be that record kept in moving average. Perhaps this is too simplistic. But perhaps this helps.

I have also been wondering about keeping track of unloaded ping (load < 50% minimum) using moving average and comparing a moving average of the loaded ping with the former.

Otherwise I am conscious of:

So see e.g.:

When the link goes above 10% utilisation, the pinger is turned on and the latency is actively monitored.
As long as the latency is under the target, and there is more demand for the link, the fair link limit is increased towards 100% of the line rate. If the latency goes above the target, the fair link limit is reduced towards 15% (but no lower).
The algorithm it uses to adjust the bandwidth drops more aggressively depending on how much the latency has exceeded the target. For recovering back to the line rate it steps up in some amount (can't remember).

Does anything there help?

Since that seems to be a tried and tested technique.

@lantis1008 how do you monitor load? Do you retain moving average of something like that? Over what time period?

lantis1008 · October 17, 2021, 7:48am

Direct calls to TC via tc_util.h and tc_common.h

dlakelan · October 17, 2021, 1:44pm

The problem with longer running average bandwidth is that lots of processes will do spiky bandwidth usage. Watching a streaming video for example will do 50% nothing and 50% fully saturated (or 75% nothing and 25% saturated). This could go on for hours without you detecting more than 50% bandwidth usage yet that full hour could be terrible quality for a video call.

I actually don't think the algorithm is bad. Changing the bandwidth every 10-30 seconds is no big deal, but having it crash is definitely a problem.

dlakelan · October 18, 2021, 5:30pm

I believe that the issue is related to not explicitly closing the bandwidth files that I open up. I had assumed that Erlang would garbage collect those, but it looks like maybe not. I pushed a 2 line fix to devel that will hopefully avoid this issue. @Lynx

Lynx · October 19, 2021, 8:00am

No crash so far.

Does:

if
(((C2 > L2) and (RxBW < C2/2)) or ((C1 > L1) and (TxBW < C1/2))) and (RND < 1.0/20) ->
    self() ! {factor,0.85} ;
true -> true
end,

not mean that it will decay even if download is high (and needing more bandwidth), but there is no ongoing upload? Or am I reading this wrongly?

Does lumping them together in this way make sense? Should download/upload not be handled independently?

In any case, the issue I have now is that it stays around the minimum set download/upload (20Mbit/s download and 25 Mbits upload) even with an active download of an .iso file at a time when I know the connection can handle more - does it do the same on your connection, I wonder?

But just manually setting to 35Mbit/s download and 30Mbit/s upload does not seem to result in significant bufferbloat:

Any thoughts?

I don't like giving up, but I am close to just going back to 30Mbit/s upload and download (and accepting the loss of bandwidth and occasional bufferbloat during heavy congestion).

I really appreciate and am grateful for your work in adapting the script to try to take bandwidth into account. But looking at the output lines and calls to 'tc qdisc change', something doesn't feel right My gut feeling is that a different approach altogether may be needed. Would that be fair, or is it just a case of further tweaking?

It certainly seems handling variable bandwidth connections well in SQM is not an easy problem.

I am thinking there may be value in going back to basics and trying to think about the simplest possible algorithm and working from there, to see if that helps at all. I can have a go at writing a bash script based on such an algorithm.

How about something along the lines of the following:

if load < 50% of minimum set load then assume no load and update moving average of unloaded ping to 8.8.8.8
if load > 50% of minimum set load acquire set of sample points by pinging 8.8.8.8 and acquire sample mean
measure bufferbloat by subtracting moving average of unloaded ping from sample mean
ascertain load during sample acquisition and make bandwidth increase or decrease decision based on determined load and determination of bufferbloat or not

Thoughts?

Before giving up with this, I would really value your thoughts or suggestions @moeller0. If you had to propose the most basic possible universally applicable algorithm, what would you propose?

moeller0 · October 19, 2021, 11:06am

So, unllike @dlakelan I did not actually try to implement anything like this at all, so can really just speculate, so take my words with caution.

I understand that what you probably desire is to be able to configure a hard minimum and hard maximum shaper rate (per direction) and want to keep the shaper at a minimum, unless the network is actually used and the increased shaper setting does not compromise latency-under-load/bufferbloat mitigation too much. I hope that this is a decent description of your issue?

In that case, I see the need for two measurements, "network load" and "bufferbloat" on which to base the actual shaper decisions.

"network load ratio": easy to measure periodically sample a counter (either from an interface or via tc) and divide the number of bytes by the duration of the sampling interval, and then divide this by the shaper rate for a give direction.

"bufferbloat": trickier, but @dlakelan has found a decent algorithm, sample a X known good ICMP echo reflectors and assess their RTT increase relative to a slowly updated baseline, only assume access link congestion, when Y out of the X RTT sources indicate latency increase. Sampling just a single source like 8.8.8.8 will cause false negatives, if you can live with those set X to one otherwise "voting" seems to be a decent method, especially since access link congestion will result in all RTTs being increased (so you could set Y = X).

Let's assume you generate estimates of these two quantities every 30 seconds or so (TCPs generally take a few RTTs to adjust to changing rates, so there is a lower limit at which changing the shaper makes sense (especially for ingress)).

Now, I would probably do the following logic (modulo all the bugs I m sure to introduce):

INIT:

set current_load_sample_time = now
set last_load_sample_time = now
set last_transfer_volume = read transfer counters
set ingress_shaper_rate = read SQM's ingress shaper rate
set egress_shaper_rate = read SQM's engress shaper rate
for all RTT_IPs get the current_average_RTT for 10 samples
	and set last_average_RTT to current_average_RTT

-> save all of these out to /tmp in a file

RUNTIME
1) get timestamp t(start)
2) read in data from file (if file does not exist or values are missing, run the INIT function and goto)
3) for all RTT_IPs get the current_average_RTT for 10 samples (in parallel)
4) wait for completion and calculate: dRTT = current_average_RTT - last_average_RTT
	update last_average_RTT (as exponentially weighted moving average to give it some persistence, here alpha 0.1 but that needs tuning, probably):*
		last_average_RTT = (1 - 0.1) last_average_RTT + 0.1 * current_average_RTT
	if dRTT > bufferbloat_threshold
		-> bufferbloat detected need to reduce the shaper rate, unless we can differentiate 
		between ingress and egress congestion, we need to reduce both shapers:
		set shaper_rate = max(min_rate, shaper_rate - (max_rate - min_rate)/4)
	
5) get current_transfer_volume
6) get timestamp t(current_load_sample_time)
7) calculate load_ratio per direction: ((current_transfer_volume - last_transfer_volume) / (t(current_load_sample_time) - t(last_load_sample_time))) / (shaper rate)
	set last_load_sample_time = current_load_sample_time

8) IF dRTT < bufferbloat_threshold AND current_load_ratio >= load_threshold
	increase shaper rates: set shaper_rate = shaper_rate + (max_rate - min_rate)/8
	Note: for stability it seems best to increase with a lower increment then to decrease, but the exact numbers probably need tuning

9) IF dRTT < bufferbloat_threshold AND current_load_ratio < load_threshold
	slowly decay to the minimum shaper rate if no load and no bufferbloat detected
		set shaper_rate = max(min_rate, shaper_rate - (max_rate - min_rate)/8)

10) write out all parameters to backing store and  and sleep for 30 - (t(end) - t(start))


This obviously needs to be integrated with init.d so it gets automatically restarted if accidentally killed. And all of the numbers need tuning.

*) Here the trick is to select alpha such that transiently increased RTTs when the shaper is set too high do not significantly pull up the reference RTT in the time it takes this script to adjust the shaper such that the RTT gets under control again, the idea is that this allows the reference RTTs to adjust to path changes

None of this is tested, and almost all numerical values need to be tuned/selected properly. Shell does not allow floating point arithmetic so a big question is what run time environment to use.

I guess once al of this is implemented it will look pretty close to @dlakelan's solution, except it will be biased towards the configured minimum rate not the maximum (then again, I have not looked too closely into the erlang script -ECANNOTREADERLANGFLUENTLY so I might be wrong in the similarity between the approaches).

MakeWiFiGreatAgain · October 19, 2021, 4:37pm

Would autorate-ingress be useful on a wireguard link that behaves like a variable rate link? My vdsl2 internet connection is very stable at 55Mb/12.5Mb but the wireguard connection can vary from 30Mb to about 50Mb on the DL and 9-12Mb on the UL. Was thinking that Autorate-Ingress would work good for this if it actually worked?

right now I have layer_cake set up on my wan connection eth0.2 and this works very nice for devices not using wireguard and gives a bufferbloat score of A+, but I wonder if I should also set up cake on my wireguard connection and run it parallel with the first instance, is this what people do when they have vpn's? How does one determine the per packet overhead? My cake instance on the wan is set right now to 34 for packet overhead.

moeller0 · October 19, 2021, 4:46pm

Ideally by piping a packet through an otherwise quiescent wireguard link and take packet captures before and after wireguard did its thing.... I believe you will find wireguard to add 60 or 80 bytes of overhead depending on whether it runs over IPv4 or IPv6....

Lynx · October 19, 2021, 5:04pm

ping -l 1234 1.1.1.1

Pinging 1.1.1.1 with 1234 bytes of data:
Reply from 1.1.1.1: bytes=1234 time=57ms TTL=59
Reply from 1.1.1.1: bytes=1234 time=68ms TTL=59
Reply from 1.1.1.1: bytes=1234 time=61ms TTL=59
Reply from 1.1.1.1: bytes=1234 time=71ms TTL=59

Ping statistics for 1.1.1.1:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 57ms, Maximum = 71ms, Average = 64ms

root@OpenWrt:~# tcpdump -vpni vpn |grep "1.1.1.1"
tcpdump: listening on vpn, link-type RAW (Raw IP), capture size 262144 bytes
    10.5.0.2 > 1.1.1.1: ICMP echo request, id 1, seq 130, length 1242
    1.1.1.1 > 10.5.0.2: ICMP echo reply, id 1, seq 130, length 1242

root@OpenWrt:~# tcpdump -vpni wan |grep "length 1324"
tcpdump: listening on wan, link-type EN10MB (Ethernet), capture size 262144 bytes
17:02:17.330951 IP (tos 0x0, ttl 64, id 44611, offset 0, flags [none], proto UDP (17), length 1324)
17:02:17.396561 IP (tos 0x28, ttl 54, id 50885, offset 0, flags [none], proto UDP (17), length 1324)

Does that make it 82?

dlakelan · October 19, 2021, 8:02pm

Yes, the original code biased upwards until it got some delays, and then brought the ceiling down. It didn't distinguish between the two directions, and it didn't measure bandwidth usage. Keeping track of upload and download separately would require some significant refactoring, to be able to send messages for upload factors and download factors separately.

The problem is fairly complicated. suppose for example that you have some download going, and you widen the bandwidth for download, now you get delays... is that because you're using too much download? Or is it because suddenly some radio noise is interfering with your upload speed and your upload is the real problem? I'm not saying the problem is insurmountable, but it's not trivial either. In essence, we're trying to learn two functions of time: download capacity, and upload capacity and we have only observational measures of the SUM of the delays (round trip time) and current bandwidth usage in each direction (not controlled by the script).

The original script was a hack that I did so I could learn more about Erlang while maybe solving the original person's particular problem, but it wasn't a principled approach, and the choice of Erlang means fewer people will be able to work on it / modify it.

Doing a version in shell script would make it so more people could work on it, but would severely limit the mathematics/inference tools available, and it'd be more fragile as a running daemon (though perhaps it could persist state to ram disk and just run on cron). this is really a kind of statistical inference problem: Given some measurements of bandwidth in current use, and without injecting any traffic other than a few pings to measure round trip delay time, estimate the channel capacity in each direction.

Most likely to get a good, fast method we either need to do some research to get a principled approach to the statistical estimation problem, and then create a principled approximation to that which runs relatively quickly on router type hardware in a language available on the router (shell, or lua being good candidates), or people need to fiddle with heuristics and see whether they can come up with one that's "good enough" for their personal preferences.

The principled approach is certainly a doable way to go, if maybe 1000 people all were willing to pay $100 to develop it over the next year on some kickstarter / GoFundMe type project, we could get somewhere. I'd head down to Caltech and find a CS Masters student to work on it with me or something. We'd set up a network lab, use a token bucket filter and a pfifo which we could adjust the rate of dynamically to emulate the ISP's variable rate link. then have 3 or 4 devices on the LAN try to "surf" naturally while the rate is adjusted, and have the software do the estimation, and compare the estimates to the known true dynamically changing rate... But, realistically, that's an actual Masters Thesis problem in Computer Science (and I'd use Julia to do the math on an RPi router so it had gigabytes of ram and gigahertz of multi-core power to do statistical model fitting)

Short of that, I think it's heuristic time. Shell seems a bit masochistic, since Lua is already on the router, and is a small and reasonable language it might make sense to use that. https://www.lua.org/about.html

Lynx · October 19, 2021, 8:14pm

I see what you mean about masochistic:

#!/bin/sh

read -d '' reflectors << EOF
1.1.1.1
8.8.8.8
EOF

no_reflectors=$(echo "$reflectors" | wc -l)
sum_RTT=0;

for reflector in $reflectors;
do
        sum_RTT=$(echo $sum_RTT + $(ping -i 0.2 -c 10 $reflector | tail -1 | awk '{print $4}' | cut -d '/' -f 2) | bc)
        echo $sum_RTT
done
avg_RTT=$(echo "scale=2; $sum_RTT / $no_reflectors" | bc)
echo $avg_RTT

Lynx · October 19, 2021, 9:09pm

Anyone know how to get pings to run in parallel? Would I need to fork each ping to background and output to tmp file and then read tmp file?

#!/bin/sh

read -d '' reflectors << EOF
1.1.1.1
8.8.8.8
EOF

no_reflectors=$(echo "$reflectors" | wc -l)

RTTs=$(mktemp)

for reflector in $reflectors;
do
        echo $(ping -i 0.2 -c 10 $reflector | tail -1 | awk '{print $4}' | cut -d '/' -f 2) >> $RTTs&
done
wait
echo $(cat $RTTs) | awk '{ total += $2; count++ } END { print total/count }'

Lynx · October 20, 2021, 6:51am

@moeller0 I am implementing your algorithm in ash using bc.

You proposed always updating the moving average of the RTT average to subtract from the current RTT average to get the delta RTT. I understand the point about setting alpha such that bufferbloat encountered will not unduly raise the baseline.

I just wanted to check with you your thoughts on not always updating the RTT average and only updating when load is below a threshold. If that would be better would you still recommend using exponential moving average with alpha for when it is updated?

Otherwise any comments on what I have so far?

root@OpenWrt:~/sqm-autorate# cat sqm-autorate
#!/bin/sh

read -d '' reflectors << EOF
1.1.1.1
8.8.8.8
EOF

no_reflectors=$(echo "$reflectors" | wc -l)

alpha=0.1

max_delta_RTT=20

rx_bytes_file="/sys/class/net/wan/statistics/rx_bytes"
tx_bytes_file="/sys/class/net/wan/statistics/tx_bytes"

RTTs=$(mktemp)

function get_avg_RTT {

for reflector in $reflectors;
do
        echo $(ping -i 0.2 -c 10 $reflector | tail -1 | awk '{print $4}' | cut -d '/' -f 2) >> $RTTs&
done
wait
avg_RTT=$(echo $(cat $RTTs) | awk '{ total += $2; count++ } END { print total/count }')
> $RTTs
}

get_avg_RTT

prev_avg_RTT=$avg_RTT;

while true
do
        t_start=$(date +%s)
        get_avg_RTT
        delta_RTT=$(echo "scale=2; $avg_RTT - $prev_avg_RTT" | bc)
        prev_avg_RTT=$(echo "scale=2;(1-$alpha)*$prev_avg_RTT+$alpha*$avg_RTT" | bc)
        echo $delta_RTT
        echo $max_delta_RTT
        if (echo "$delta_RTT > $max_delta_RTT" | bc -l); then
                echo "bufferbloat detected!!!"
        fi
        sleep 2 # work in progress 
done

moeller0 · October 20, 2021, 7:06am

Have a look at https://github.com/richb-hanover/OpenWrtScripts/blob/master/betterspeedtest.sh for how to handle pings in the background and how to robustify your script.

If you only update the reference RTT under load it will not reflect the shortest currently possible RTT and if you only update if the load is below a threshold you might not update it for a long time, if say you download a multi-gigabyte software update. But the proof is in the pudding here....

Lynx · October 20, 2021, 11:13am

Thanks. Will keep programming.

Am I right about the 82 above by the way?

moeller0 · October 20, 2021, 12:33pm

It depends, when I run the numbers to get from the pre-encrypted IP payload to the full IP packet size:

ICMP payload + ICMP header + IPv4 header + plus documented wireguard per-acket overhead:
(pre-encrypted IP payload) + pre-encripted IP header + wireguard overhead
(1234 + 8) + 20 + 60 = 1322
which seems 2 byte shorter than what you measure, so I would go with your higher number (because to fight bufferbloat err on the side of too much overhead). cake on the pre-encrypted interface will handke the IP header size just fine, so 62 bytes is the number

But you still need to add the additional overhead on top of the encrypted encapsulated IP packet, which for LTE is complicated especially because it is unclear to me which oherhead we can safely ignore....

Lynx · October 20, 2021, 1:11pm

I also came across this:

I used to actually work on Intel/Samsung LTE patent applications, but every single application was on one very detailed aspect. I recall so many hacks to make each new innovation work with the (revised) 3GPP standard. Anyway that was a fair while ago now.

I also have the fun complexity that I am applying CAKE to interfaces (veth-lan for download, and wan for upload) that see both WireGuard and non-WireGuard traffic.

So presumably I should go for worst case with VPN and should set 82 + 10 = 92? Or just 100 because it is a nice number and with this variable and uncertain 4G LTE connection, approximation rather than precision seems apt?

Very much enjoying working on this script. Thank you so much for taking the trouble to write out the algorithm above. I feel fairly confident this will get something that roughly works.

Perhaps I should use this as an opportunity to try github. I have some experience in C and C++, even though I opted for a different career path than one involving programming.

moeller0 · October 20, 2021, 1:47pm

Interesting, but ROHC does not seem to guarantee a strict header size, but then according to your plot and the wikipedia article it might actually result in less size then the full (outer) IP packet... I just have no clue how to test whether it is actually in use... but assuming ROHC is used, then your 82+10 should be on the safe side....

See, you are the resident LTE expert then (as demonstrated by the helpful chart above)

Well, that is the easy part, implementing it and addressing all the ignored corner-cases to make it robust is the real work.