CAKE w/ Adaptive Bandwidth

2.0.0 uses a different config format and the log will show 2.0.0 running.

It could be that your settings aren't reflecting what you want or need. Or that you need a better LTE connection. Or there is more we could do to work with such a connection. I'm open minded.

I thought that in 2.0.0 the majority of the config was moved to cake-autorate_defaults.sh. I thought that just copying the old config would override everything, and thus be still valid. Is it not true?

Regarding the need for a "better LTE connection" - well, as Discord works well during daytime without cake-autorate, this is ruled out, I think. The other two operators are even worse.

1 Like

So as explained above and in the readme there is now the set of defaults and interface names, whether to apply shaper rates, bandwidths and overrides are now placed in the config. But again your log shows running 1.2.0.

As I said I am open minded, but my point is that your connection presents a significant challenge. Spurious latency spikes are mixed in with real bufferbloat. How much have you tried playing with the settings?

That's because the config says, wrongly,

cake_autorate_version="1.2.0"

I still maintain it is valid to override everything by copying the old config.

Oh I see. Well yes maybe. Although that's really not what was intended.

Maybe you could run your data through @moeller0's Octave plotter? Would be nice to see what's going on. I can't easily plot myself right now.

Well, I think it would be a good idea. There is another optional internal meeting in a few hours, I will try to attend it from LTE and to record half of it with cake-autorate on and half of it with adjust_{dl,ul}_shaper_rate=0.

Maybe this should not be set in a user-configurable file in the first place? :wink: it is not that one could change versions by editing this variable.

1 Like

Please take to separate log files for the two conditions, otherwise the CDF plots will simply mix up both conditions. Or if you want a unified log, use the option to only plot certain time ranges and do this separately for each of the two conditions.

1 Like

# Think carefully about the following settings
# to avoid excessive CPU use (proportional with ping interval / number of pingers)
# and to avoid abusive network activity (excessive ICMP frequency to one reflector)
# The author has found an ICMP rate of 1/(0.2/4) = 20 Hz to give satisfactory performance on 4G
no_pingers=4 # number of pingers to maintain
reflector_ping_interval_s=1 # (seconds, e.g. 0.2s or 2s)

# delay threshold in ms is the extent of OWD increase to classify as a delay
# these are automatically adjusted based on maximum on the wire packet size
# (adjustment significant at sub 12Mbit/s rates, else negligible)  
dl_delay_thr_ms=250 # (milliseconds)
ul_delay_thr_ms=250 # (milliseconds)

# Set either of the below to 0 to adjust one direction only 
# or alternatively set both to 0 to simply use cake-autorate to monitor a connection
adjust_dl_shaper_rate=1 # enable (1) or disable (0) actually changing the dl shaper rate
adjust_ul_shaper_rate=1 # enable (1) or disable (0) actually changing the ul shaper rate

min_dl_shaper_rate_kbps=160    # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=1800   # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=50000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=160    # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=1800   # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=50000  # maximum bandwidth for upload (Kbit/s)

# sleep functionality saves unecessary pings and CPU cycles by
# pausing all active pingers when connection is not in active use
enable_sleep_function=1 # enable (1) or disable (0) sleep functonality 
connection_active_thr_kbps=100   # threshold in Kbit/s below which dl/ul is considered idle
sustained_idle_sleep_thr_s=20.0  # time threshold to put pingers to sleep on sustained dl/ul achieved rate < idle_thr (seconds)

min_shaper_rates_enforcement=0 # enable (1) or disable (0) dropping down to minimum shaper rates on connection idle or stall

startup_wait_s=5.0 # number of seconds to wait on startup (e.g. to wait for things to settle on router reboot)

So despite the shaper rates having been held down to circa the minimum shaper rates set in the config, i.e. 160Kbit/s, we see greater than one second RTT!

I notice a very low ping response frequency has been set - 4 pingers with an interval of 1 second, so an effective interval of 250ms. I wonder why? Because in my experience a higher response frequency works better.

The delay threshold has been set very high to 250ms OWD (500ms RTT).

The settings could be relaxed even further so as not to punish such huge latency periods, e.g. by setting a larger bufferbloat detection window and detection threshold?

The minim shaper rates could be set higher?

Also the rate of increase of the shaper rates on load high could be increased from 1.01 to something higher.

But at every 250ms this is hardly going to be very responsive.

What do you think? Any suggestions?

I think this is a very challenging case. I mean what can we really do in situations like this?

I didn't manage to get julia going that way, either. (email might be better). I am focused on my other slides just now.

1 Like

I seem to recall that his LTE ISP does something hostile like throttling all ICMP traffic if there is too much traffic...



But with a delay threshold > 200ms, I am amazed this results in usable applications anyway, These latency spikes like the one at around second 30 look especially mean. This is already with a traffic rate well below 1 Mbps... The ramp looks a bit like we are overfilling a buffer resulting in linearly increasing delay, but the steep step down is odd. I think it would be great to have a faster delay sampling... which in his case might mean to harness hping3 or similar and use either UDP or TCP probes to work around the asinine ICMP policy of his ISP.

I really really think on @patrakov's link we need to go to real OWDs, to at least untangle both directions...

1 Like

Ok Dave, send me a direct message with your email. We can work out either how to get Julia working for you or I can output some graphs for you. Maybe we should do a video conference and work through what the graphs should look like. It's very fast to iterate on that in Julia since it's interactive REPL based language.

I likely will have time around 11-1 Pacific time.

1 Like

@patrakov if you want to rely on your backup LTE connection you probably need to configure it and make sure all is well before it kicks in even when your main connection is up and running.

Thx!

I am at the understandinglatency.com conference til about then. (starts at 7AM my time) - free signup, stuart is talking today. I'm on a panel, also...

1 Like

Thanks for the idea, I will definitely run some tool to measure OWDs during the meeting.

This post will be updated with the results.

Baseline without SQM at 20:14 PST (the meeting will be at 23:00 PST):

Speedtest: https://www.speedtest.net/result/14440296383 (but note that the ISP cheats and gives speedtest a very preferential treatment - so this only indicates what the radio channel is capable of)

Waveform bufferbloat: https://www.waveform.com/tools/bufferbloat?test-id=56f48ca0-427d-49fe-9587-8fb58d60644d

Note: after these measurements, the modem has been slightly repositioned (literally 2 cm closer to the wall) in hope to get a better uplink quality - unfortunately at the expense of downlink.

Baseline without SQM just before the meeting (22:57 PST):

Speedtest: https://www.speedtest.net/result/14441050030

Waveform bufferbloat: https://www.waveform.com/tools/bufferbloat?test-id=ef844441-a846-4060-b7fe-9e3484332aa5

Baseline with SQM set high enough just after the meeting (23:36 PST):

Speedtest: https://www.speedtest.net/result/14441225196

Waveform bufferbloat: https://www.waveform.com/tools/bufferbloat?test-id=64fef20f-b531-4057-aeb8-54fa2cb9d67a

Logs and configs used during the meeting: https://u.pcloud.link/publink/show?code=kZufOPVZw7Pt1fRuLuy6DUH5YV7VpjksN8bV (too large for the pastebin)

During the first part of the meeting, I tried to keep cake-autorate on, with the cake-autorate_config.lte.sh config that you can see in this folder. It worked for some time, then dropped to 200 kbps (and yes I have to keep the minimum that low, because during heavy rains it is sometimes that bad) and never recovered. The log is saved as cake-autorate.lte.log.bad. Bad, because Discord, at least inside Firefox, does not adapt to such low bandwidth.

During the second part of the meeting, the SQM has been restarted and reset to a high bandwidth limit that surely could not be hit (15000/15000 kbps). Then, cake-autorate was run with a config that never actually adjusts the rates, cake-autorate_config.lte.sh.new. The corresponding log is cake-autorate.lte.log. The second part of the meeting went OK-ish, but there was one complaint that my voice is choppy. Also this log has the two speedtests (speedtest.net + waveform bufferbloat) at the end recorded.

The OWDs have not been recorded properly, because I forgot to use mwan3 use lte. Sorry!

Also, the signal quality was monitored, and saved as hcsq.log. The columns are the timestamp, the constant string "LTE", and the four numbers (r1, r2, r3, r4) that "AT^HCSQ?" returns after it on a Huawei E3372s modem. Interpretation:

RSSI_dBm = -120 + r1 
RSRP_dBm = -140 + r2
SINR_dB = -20 + (r3 * 0.2)
RSRQ_dB = -19.5 + (r4 * 0.5)

I hope that this array of raw data would be of some use to determine how to deal with such bad links.

2 Likes

You could try ts-ping!

1 Like

Do they do this only for actual speedtest endpoints or do the unlock your link during a speedtest. If the latter I would constantly run speedtests parallel to my actual load...

Constantly running speedtests would be prohibitively expensive, this is not an unlimited-gigabytes plan. And no, they don't unlock the link completely.

Well... here is the piece of the log:

LOAD; 2023-03-06-08:18:45; 1678090725.200001; 1678090725.199551; 254; 71; 395; 173
DATA; 2023-03-06-08:18:45; 1678090725.389069; 1678090725.388542; 254; 71; 64; 41; 1678090725.379890; 208.67.220.123; 28; 34904; 235500; 86644; 200596; 253797; 34904; 235500; 86644; 200596; 258670; 0; 0; dl_low; ul_idle; 395; 173
LOAD; 2023-03-06-08:18:45; 1678090725.452103; 1678090725.451715; 268; 47; 395; 173
LOAD; 2023-03-06-08:18:45; 1678090725.704648; 1678090725.704230; 271; 8; 395; 173
DATA; 2023-03-06-08:18:45; 1678090725.740085; 1678090725.739544; 271; 8; 68; 4; 1678090725.731080; 94.140.14.141; 28; 35014; 285500; 82296; 250485; 253797; 35014; 285500; 82296; 250485; 258670; 0; 0; dl_low; ul_idle; 395; 173
LOAD; 2023-03-06-08:18:45; 1678090725.957256; 1678090725.956841; 256; 3; 395; 173
LOAD; 2023-03-06-08:18:46; 1678090726.209206; 1678090726.208820; 274; 3; 395; 173
LOAD; 2023-03-06-08:18:46; 1678090726.461923; 1678090726.461498; 277; 5; 395; 173
DATA; 2023-03-06-08:18:46; 1678090726.489668; 1678090726.489119; 277; 5; 70; 2; 1678090726.479380; 185.228.168.10; 29; 36497; 535000; 101031; 498503; 253797; 36497; 535000; 101031; 498503; 258670; 1; 1; dl_low; ul_idle; 395; 173
DATA; 2023-03-06-08:18:46; 1678090726.510097; 1678090726.509557; 277; 5; 70; 2; 1678090726.500760; 9.9.9.10; 29; 46148; 420500; 93130; 374352; 253797; 46148; 420500; 93130; 374352; 258670; 2; 2; dl_low; ul_idle; 395; 173
DATA; 2023-03-06-08:18:46; 1678090726.538820; 1678090726.538249; 277; 5; 70; 2; 1678090726.518380; 208.67.220.123; 29; 35173; 304000; 103951; 268827; 256024; 35173; 304000; 103951; 268827; 259375; 3; 3; dl_low_bb; ul_idle_bb; 249; 160
DATA; 2023-03-06-08:18:46; 1678090726.545078; 1678090726.544525; 277; 5; 111; 3; 1678090726.526160; 94.140.14.141; 29; 35161; 182500; 88475; 147338; 256024; 35161; 182500; 88475; 147338; 259375; 3; 3; dl_high_bb; ul_idle_bb; 249; 160
DATA; 2023-03-06-08:18:46; 1678090726.566100; 1678090726.565563; 277; 5; 111; 3; 1678090726.557580; 185.228.168.10; 30; 36534; 73500; 101031; 36966; 256024; 36534; 73500; 101031; 36966; 259375; 3; 3; dl_high_bb; ul_idle_bb; 249; 160
LOAD; 2023-03-06-08:18:46; 1678090726.714613; 1678090726.713463; 191; 57; 249; 160
DATA; 2023-03-06-08:18:46; 1678090726.787126; 1678090726.786454; 191; 57; 76; 35; 1678090726.776430; 9.9.9.10; 30; 46159; 57500; 93130; 11341; 256024; 46159; 57500; 93130; 11341; 259375; 3; 3; dl_low_bb; ul_idle_bb; 249; 160
LOAD; 2023-03-06-08:18:46; 1678090726.966438; 1678090726.966044; 166; 28; 249; 160
DATA; 2023-03-06-08:18:47; 1678090727.026173; 1678090727.025639; 166; 28; 66; 17; 1678090727.017750; 208.67.220.123; 30; 35190; 53000; 95767; 17809; 256024; 35190; 53000; 95767; 17809; 259375; 2; 2; dl_low; ul_idle; 249; 160
LOAD; 2023-03-06-08:18:47; 1678090727.219073; 1678090727.218679; 172; 3; 249; 160

I think this is explainable with a stall in the upload direction only (so not meeting our definition of a stall). The modem buffered the pings, and then, when the upload channel cleared up, released them all at once. And then they got reflected all at once and, as you can see, they arrived almost simultaneously.

I think this near-simultaneous arrival of multiple responses within just 80 ms, with no bloat in the last one, serves as a good indicator to ignore the spike.