CAKE w/ Adaptive Bandwidth [August 2022 to March 2024]

# Think carefully about the following settings
# to avoid excessive CPU use (proportional with ping interval / number of pingers)
# and to avoid abusive network activity (excessive ICMP frequency to one reflector)
# The author has found an ICMP rate of 1/(0.2/4) = 20 Hz to give satisfactory performance on 4G
no_pingers=4 # number of pingers to maintain
reflector_ping_interval_s=1 # (seconds, e.g. 0.2s or 2s)

# delay threshold in ms is the extent of OWD increase to classify as a delay
# these are automatically adjusted based on maximum on the wire packet size
# (adjustment significant at sub 12Mbit/s rates, else negligible)  
dl_delay_thr_ms=250 # (milliseconds)
ul_delay_thr_ms=250 # (milliseconds)

# Set either of the below to 0 to adjust one direction only 
# or alternatively set both to 0 to simply use cake-autorate to monitor a connection
adjust_dl_shaper_rate=1 # enable (1) or disable (0) actually changing the dl shaper rate
adjust_ul_shaper_rate=1 # enable (1) or disable (0) actually changing the ul shaper rate

min_dl_shaper_rate_kbps=160    # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=1800   # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=50000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=160    # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=1800   # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=50000  # maximum bandwidth for upload (Kbit/s)

# sleep functionality saves unecessary pings and CPU cycles by
# pausing all active pingers when connection is not in active use
enable_sleep_function=1 # enable (1) or disable (0) sleep functonality 
connection_active_thr_kbps=100   # threshold in Kbit/s below which dl/ul is considered idle
sustained_idle_sleep_thr_s=20.0  # time threshold to put pingers to sleep on sustained dl/ul achieved rate < idle_thr (seconds)

min_shaper_rates_enforcement=0 # enable (1) or disable (0) dropping down to minimum shaper rates on connection idle or stall

startup_wait_s=5.0 # number of seconds to wait on startup (e.g. to wait for things to settle on router reboot)

So despite the shaper rates having been held down to circa the minimum shaper rates set in the config, i.e. 160Kbit/s, we see greater than one second RTT!

I notice a very low ping response frequency has been set - 4 pingers with an interval of 1 second, so an effective interval of 250ms. I wonder why? Because in my experience a higher response frequency works better.

The delay threshold has been set very high to 250ms OWD (500ms RTT).

The settings could be relaxed even further so as not to punish such huge latency periods, e.g. by setting a larger bufferbloat detection window and detection threshold?

The minim shaper rates could be set higher?

Also the rate of increase of the shaper rates on load high could be increased from 1.01 to something higher.

But at every 250ms this is hardly going to be very responsive.

What do you think? Any suggestions?

I think this is a very challenging case. I mean what can we really do in situations like this?

I didn't manage to get julia going that way, either. (email might be better). I am focused on my other slides just now.

1 Like

I seem to recall that his LTE ISP does something hostile like throttling all ICMP traffic if there is too much traffic...



But with a delay threshold > 200ms, I am amazed this results in usable applications anyway, These latency spikes like the one at around second 30 look especially mean. This is already with a traffic rate well below 1 Mbps... The ramp looks a bit like we are overfilling a buffer resulting in linearly increasing delay, but the steep step down is odd. I think it would be great to have a faster delay sampling... which in his case might mean to harness hping3 or similar and use either UDP or TCP probes to work around the asinine ICMP policy of his ISP.

I really really think on @patrakov's link we need to go to real OWDs, to at least untangle both directions...

1 Like

Ok Dave, send me a direct message with your email. We can work out either how to get Julia working for you or I can output some graphs for you. Maybe we should do a video conference and work through what the graphs should look like. It's very fast to iterate on that in Julia since it's interactive REPL based language.

I likely will have time around 11-1 Pacific time.

1 Like

@patrakov if you want to rely on your backup LTE connection you probably need to configure it and make sure all is well before it kicks in even when your main connection is up and running.

Thx!

I am at the understandinglatency.com conference til about then. (starts at 7AM my time) - free signup, stuart is talking today. I'm on a panel, also...

1 Like

Thanks for the idea, I will definitely run some tool to measure OWDs during the meeting.

This post will be updated with the results.

Baseline without SQM at 20:14 PST (the meeting will be at 23:00 PST):

Speedtest: https://www.speedtest.net/result/14440296383 (but note that the ISP cheats and gives speedtest a very preferential treatment - so this only indicates what the radio channel is capable of)

Waveform bufferbloat: https://www.waveform.com/tools/bufferbloat?test-id=56f48ca0-427d-49fe-9587-8fb58d60644d

Note: after these measurements, the modem has been slightly repositioned (literally 2 cm closer to the wall) in hope to get a better uplink quality - unfortunately at the expense of downlink.

Baseline without SQM just before the meeting (22:57 PST):

Speedtest: https://www.speedtest.net/result/14441050030

Waveform bufferbloat: https://www.waveform.com/tools/bufferbloat?test-id=ef844441-a846-4060-b7fe-9e3484332aa5

Baseline with SQM set high enough just after the meeting (23:36 PST):

Speedtest: https://www.speedtest.net/result/14441225196

Waveform bufferbloat: https://www.waveform.com/tools/bufferbloat?test-id=64fef20f-b531-4057-aeb8-54fa2cb9d67a

Logs and configs used during the meeting: https://u.pcloud.link/publink/show?code=kZufOPVZw7Pt1fRuLuy6DUH5YV7VpjksN8bV (too large for the pastebin)

During the first part of the meeting, I tried to keep cake-autorate on, with the cake-autorate_config.lte.sh config that you can see in this folder. It worked for some time, then dropped to 200 kbps (and yes I have to keep the minimum that low, because during heavy rains it is sometimes that bad) and never recovered. The log is saved as cake-autorate.lte.log.bad. Bad, because Discord, at least inside Firefox, does not adapt to such low bandwidth.

During the second part of the meeting, the SQM has been restarted and reset to a high bandwidth limit that surely could not be hit (15000/15000 kbps). Then, cake-autorate was run with a config that never actually adjusts the rates, cake-autorate_config.lte.sh.new. The corresponding log is cake-autorate.lte.log. The second part of the meeting went OK-ish, but there was one complaint that my voice is choppy. Also this log has the two speedtests (speedtest.net + waveform bufferbloat) at the end recorded.

The OWDs have not been recorded properly, because I forgot to use mwan3 use lte. Sorry!

Also, the signal quality was monitored, and saved as hcsq.log. The columns are the timestamp, the constant string "LTE", and the four numbers (r1, r2, r3, r4) that "AT^HCSQ?" returns after it on a Huawei E3372s modem. Interpretation:

RSSI_dBm = -120 + r1 
RSRP_dBm = -140 + r2
SINR_dB = -20 + (r3 * 0.2)
RSRQ_dB = -19.5 + (r4 * 0.5)

I hope that this array of raw data would be of some use to determine how to deal with such bad links.

2 Likes

You could try ts-ping!

1 Like

Do they do this only for actual speedtest endpoints or do the unlock your link during a speedtest. If the latter I would constantly run speedtests parallel to my actual load...

Constantly running speedtests would be prohibitively expensive, this is not an unlimited-gigabytes plan. And no, they don't unlock the link completely.

Well... here is the piece of the log:

LOAD; 2023-03-06-08:18:45; 1678090725.200001; 1678090725.199551; 254; 71; 395; 173
DATA; 2023-03-06-08:18:45; 1678090725.389069; 1678090725.388542; 254; 71; 64; 41; 1678090725.379890; 208.67.220.123; 28; 34904; 235500; 86644; 200596; 253797; 34904; 235500; 86644; 200596; 258670; 0; 0; dl_low; ul_idle; 395; 173
LOAD; 2023-03-06-08:18:45; 1678090725.452103; 1678090725.451715; 268; 47; 395; 173
LOAD; 2023-03-06-08:18:45; 1678090725.704648; 1678090725.704230; 271; 8; 395; 173
DATA; 2023-03-06-08:18:45; 1678090725.740085; 1678090725.739544; 271; 8; 68; 4; 1678090725.731080; 94.140.14.141; 28; 35014; 285500; 82296; 250485; 253797; 35014; 285500; 82296; 250485; 258670; 0; 0; dl_low; ul_idle; 395; 173
LOAD; 2023-03-06-08:18:45; 1678090725.957256; 1678090725.956841; 256; 3; 395; 173
LOAD; 2023-03-06-08:18:46; 1678090726.209206; 1678090726.208820; 274; 3; 395; 173
LOAD; 2023-03-06-08:18:46; 1678090726.461923; 1678090726.461498; 277; 5; 395; 173
DATA; 2023-03-06-08:18:46; 1678090726.489668; 1678090726.489119; 277; 5; 70; 2; 1678090726.479380; 185.228.168.10; 29; 36497; 535000; 101031; 498503; 253797; 36497; 535000; 101031; 498503; 258670; 1; 1; dl_low; ul_idle; 395; 173
DATA; 2023-03-06-08:18:46; 1678090726.510097; 1678090726.509557; 277; 5; 70; 2; 1678090726.500760; 9.9.9.10; 29; 46148; 420500; 93130; 374352; 253797; 46148; 420500; 93130; 374352; 258670; 2; 2; dl_low; ul_idle; 395; 173
DATA; 2023-03-06-08:18:46; 1678090726.538820; 1678090726.538249; 277; 5; 70; 2; 1678090726.518380; 208.67.220.123; 29; 35173; 304000; 103951; 268827; 256024; 35173; 304000; 103951; 268827; 259375; 3; 3; dl_low_bb; ul_idle_bb; 249; 160
DATA; 2023-03-06-08:18:46; 1678090726.545078; 1678090726.544525; 277; 5; 111; 3; 1678090726.526160; 94.140.14.141; 29; 35161; 182500; 88475; 147338; 256024; 35161; 182500; 88475; 147338; 259375; 3; 3; dl_high_bb; ul_idle_bb; 249; 160
DATA; 2023-03-06-08:18:46; 1678090726.566100; 1678090726.565563; 277; 5; 111; 3; 1678090726.557580; 185.228.168.10; 30; 36534; 73500; 101031; 36966; 256024; 36534; 73500; 101031; 36966; 259375; 3; 3; dl_high_bb; ul_idle_bb; 249; 160
LOAD; 2023-03-06-08:18:46; 1678090726.714613; 1678090726.713463; 191; 57; 249; 160
DATA; 2023-03-06-08:18:46; 1678090726.787126; 1678090726.786454; 191; 57; 76; 35; 1678090726.776430; 9.9.9.10; 30; 46159; 57500; 93130; 11341; 256024; 46159; 57500; 93130; 11341; 259375; 3; 3; dl_low_bb; ul_idle_bb; 249; 160
LOAD; 2023-03-06-08:18:46; 1678090726.966438; 1678090726.966044; 166; 28; 249; 160
DATA; 2023-03-06-08:18:47; 1678090727.026173; 1678090727.025639; 166; 28; 66; 17; 1678090727.017750; 208.67.220.123; 30; 35190; 53000; 95767; 17809; 256024; 35190; 53000; 95767; 17809; 259375; 2; 2; dl_low; ul_idle; 249; 160
LOAD; 2023-03-06-08:18:47; 1678090727.219073; 1678090727.218679; 172; 3; 249; 160

I think this is explainable with a stall in the upload direction only (so not meeting our definition of a stall). The modem buffered the pings, and then, when the upload channel cleared up, released them all at once. And then they got reflected all at once and, as you can see, they arrived almost simultaneously.

I think this near-simultaneous arrival of multiple responses within just 80 ms, with no bloat in the last one, serves as a good indicator to ignore the spike.

Well you could throttle the actual speedtest and only run it if you want to use the link...

But that would not help.

In the past IIRC Telekom Mobile in the US used a "trick" in which the unlocked links during a speedtest... but that obviously was not above the board and they stopped doing this once caught... (Their argument was that they wanted to show customers the capacity they could have, except they did not bother to inform customers about this fact.)

Any idea what 'finished' relates to:

root@OpenWrt-1:~# tsping 9.9.9.9
Starting tsping 0.2.2 - pinging 1 targets
9.9.9.9         : [0] Down: 50373319, Up: 50373344, RTT: 50373344, Originate: 50373378, Received: 59, Transmit: 34, Finished: 25
9.9.9.9         : [1] Down: 50373419, Up: 50373434, RTT: 50373434, Originate: 50373468, Received: 49, Transmit: 34, Finished: 15
9.9.9.9         : [2] Down: 50373520, Up: 50373534, RTT: 50373534, Originate: 50373568, Received: 48, Transmit: 34, Finished: 14
9.9.9.9         : [3] Down: 50373620, Up: 50373633, RTT: 50373633, Originate: 50373666, Received: 46, Transmit: 33, Finished: 13
9.9.9.9         : [4] Down: 50373720, Up: 50373723, RTT: 50373723, Originate: 50373756, Received: 36, Transmit: 33, Finished: 3
9.9.9.9         : [5] Down: 50373821, Up: 50373833, RTT: 50373833, Originate: 50373866, Received: 45, Transmit: 33, Finished: 12
9.9.9.9         : [6] Down: 50373921, Up: 50373933, RTT: 50373933, Originate: 50373966, Received: 45, Transmit: 33, Finished: 12
9.9.9.9         : [7] Down: 50374021, Up: 50374033, RTT: 50374033, Originate: 50374067, Received: 46, Transmit: 34, Finished: 12
9.9.9.9         : [8] Down: 50374121, Up: 50374132, RTT: 50374132, Originate: 50374166, Received: 45, Transmit: 34, Finished: 11
9.9.9.9         : [9] Down: 50374222, Up: 50374233, RTT: 50374233, Originate: 50374266, Received: 44, Transmit: 33, Finished: 11
^C
root@OpenWrt-1:~# tsping -m 9.9.9.9
Starting tsping 0.2.2 - pinging 1 targets
9.9.9.9,0,50395549,50395563,50395563,50395596,47,33,14
9.9.9.9,1,50395650,50395663,50395663,50395696,46,33,13
9.9.9.9,2,50395750,50395763,50395763,50395796,46,33,13
9.9.9.9,3,50395850,50395863,50395863,50395896,46,33,13
9.9.9.9,4,50395951,50395963,50395963,50395996,45,33,12
9.9.9.9,5,50396051,50396063,50396063,50396096,45,33,12
9.9.9.9,6,50396151,50396163,50396163,50396196,45,33,12
9.9.9.9,7,50396252,50396271,50396271,50396303,51,32,19
9.9.9.9,8,50396352,50396363,50396363,50396396,44,33,11
9.9.9.9,9,50396452,50396463,50396463,50396496,44,33,11

@Lochnair?

All I will say is: :person_facepalming:

Didn't catch that the ordering of the columns is different when I changed how the printing works...

	char FMT_ICMP_TIMESTAMP_HUMAN[] = "%-15s : [%u] Down: %d, Up: %d, RTT: %d, Originate: %u, Received: %u, Transmit: %u, Finished: %u\n";

and

				printf(FMT_OUTPUT, ip, result.sequence, result.originateTime, result.receiveTime, result.transmitTime, result.finishedTime, rtt, down_time, up_time);

1 Like

Yes, somehow I didn't catch that when testing my changes :see_no_evil:
So need to fix that up

Any chance you could also add option to print out in us?

ICMP timestamps only have millisecond resolution... so that is the natural way of reporting them. However bash-autorate uses µs due to bash's EPOCHREALTIME reporting in µs.... so µs would be convenient...

I have updated the post with more data:

EDIT: useless, I did not record OWDs properly. Please wait for the next Discord meeting, which is on Wednesday.

Something doesn't feel right with the time intervals.

root@OpenWrt-1:~/cake-autorate# tsping  --print-timestamps --machine-readable=' ' --sleep-time 200 --target-spacing 100 9.9.9.9 9.9.9.10
Starting tsping 0.2.2 - pinging 2 targets
1678118756.728162 9.9.9.9 0 57956682 57956703 57956703 57956728 46 25 21
1678118756.829150 9.9.9.10 0 57956782 57956803 57956803 57956829 47 26 21
1678118757.129260 9.9.9.9 1 57957082 57957103 57957103 57957129 47 26 21
1678118757.229163 9.9.9.10 1 57957183 57957204 57957204 57957229 46 25 21
1678118757.528159 9.9.9.9 2 57957483 57957503 57957503 57957528 45 25 20
1678118757.630785 9.9.9.10 2 57957583 57957603 57957603 57957630 47 27 20
1678118757.928160 9.9.9.9 3 57957883 57957903 57957903 57957928 45 25 20
1678118758.028105 9.9.9.10 3 57957984 57958003 57958003 57958028 44 25 19
1678118758.328203 9.9.9.9 4 57958284 57958303 57958303 57958328 44 25 19
1678118758.428184 9.9.9.10 4 57958384 57958403 57958403 57958428 44 25 19
1678118758.728151 9.9.9.9 5 57958685 57958703 57958703 57958728 43 25 18
1678118759.130258 9.9.9.9 6 57959085 57959105 57959105 57959130 45 25 20
1678118759.229255 9.9.9.10 6 57959186 57959203 57959203 57959229 43 26 17
1678118759.528163 9.9.9.9 7 57959486 57959502 57959502 57959528 42 26 16
1678118759.630210 9.9.9.10 7 57959586 57959603 57959603 57959630 44 27 17
1678118759.928276 9.9.9.9 8 57959886 57959903 57959903 57959928 42 25 17
1678118760.028298 9.9.9.10 8 57959987 57960003 57960003 57960028 41 25 16

I expected this to give a round robin spacing of 1s, with 500ms spacing between reflectors, to give one response every 500ms, but we don't see that:

root@OpenWrt-1:~/cake-autorate# tsping  --print-timestamps --machine-readable=' ' --sleep-time 1000 --target-spacing 500 9.9.9.9 9.9.9.10
Starting tsping 0.2.2 - pinging 2 targets
1678119096.278095 9.9.9.9 0 58296228 58296253 58296253 58296278 50 25 25
1678119096.768772 9.9.9.10 0 58296728 58296743 58296743 58296768 40 25 15
1678119098.277984 9.9.9.9 1 58298229 58298253 58298253 58298277 48 24 24
1678119098.777821 9.9.9.10 1 58298729 58298753 58298753 58298777 48 24 24
1678119100.278943 9.9.9.9 2 58300229 58300253 58300253 58300278 49 25 24
1678119100.769150 9.9.9.10 2 58300730 58300743 58300743 58300769 39 26 13
1678119102.277942 9.9.9.9 3 58302230 58302253 58302253 58302277 47 24 23
1678119102.778920 9.9.9.10 3 58302730 58302753 58302753 58302778 48 25 23

@Lochnair is the sleep between rounds from the last response rather than from the first send?

I'm trying to figure out how these correspond with fpings.