CAKE w/ Adaptive Bandwidth [August 2022 to March 2024]

This actually looks like a misuse of mwan3, but well, if it works and the upstream ISP router supports nothing else (e.g. bonding or teaming) for aggregating the bandwidth of LAN ports, then it is a valid solution.

1 Like

@Lynx a suggestion to replace the help on cake-autorate_config

replace

# a wrapper for ping binary - used as a prefix for the real command
# e.g., when using mwan3, it is recommended to set it like this:
# ping_prefix_string="mwan3 use gpon exec"
# WARNING: the wrapper must exec ping as the final step, not run it as a subprocess.
# Running ping or fping as a subprocess will lead to problems stopping it.
# WARNING: no error checking - so use at own risk!
ping_prefix_string="mwan3 use gpon exec"

with

# a wrapper for ping binary - used as a prefix for the real command
# e.g., when using mwan3, it is recommended to set it like this:
# ping_prefix_string="mwan3 use gpon exec"
# **replacing gpon with your mwan3 interface**
# WARNING: the wrapper must exec ping as the final step, not run it as a subprocess.
# Running ping or fping as a subprocess will lead to problems stopping it.
# WARNING: no error checking - so use at own risk!
ping_prefix_string="mwan3 use gpon exec"

I am using ping_prefix_string="mwan3 use wan exec" & ping_prefix_string="mwan3 use wanb exec"

The two plots look identical, which should not be possible. There might be something wrong with the logging or the plotting (I assume you made sure to load the two different log files for the two plots...)
Also it looks doubly weird that the high delay periods seem to follow the actual higher load periods instead of coinciding with them.
Could you share the two log files for your two wan interfaces?

How that, load-balancing is one of the two use cases that mwan3 advertises, no? Bonding/teaming, as you indicate, requires a participating upstream, and for many mass-market consumer ISPs such options are not available (in Germany the incumbent, as far as I know does not even offer bonding for affordable business contracts, things might be different for the bespoke "SLAd" business contracts, but these are outside of my pocket-book range ;)*).

One of the themes here in this thread is "wrking around one's ISPs sub-optimal" services, I feel that mwan3 load-balancing falls into the same category, it would be nice if that would not be necessary, but I also understand why mas-market ISPs do not want to offer special-menu items that likely only interest small minorities...

*) At the time other ISPs in Europe offered bonded-DSKL-links the local incumbent started to deploy vectoring (and later profile 35b) allowing similar ~100 Mbps links as other ISPs reached via bonding, but using up fewer of the sparse end-user-to-CO wires, so I can see why they never rolled-out bonding for the mass market. Some smaller ISPs however did/do offer bonding IIRC.

yes mwan3 offers load balancing and it does quite a good job ...

you are absolutely right my mistake ... tried to combine two and obviously missed the b for wanb ...

here you go wanb

and wan

OK, now the shaper settings and the precise loads differ between the two, but the timing still looks like the load data is shifted in relation to the shaper settings and latency data, Also the weird green `450 seconds part at the left of these plots looks wrong.

Here the fight is not against the ISP, but against the crappy ISP router that has a fast fiber WAN interface but no LAN ports that can match that speed alone. Therefore, the perfect solution would be to ask the ISP about a compatible SFP module and insert it into a switch/mediaconverter that has an SFP port and allows LAN port aggregation (or simply has a 2.5G LAN port).

1 Like

Also your configured rates seem a bit too high for what your router seems to be able to shape here.
It looks like your base rate is set to 1000 and the minimum rate to 500, but in the speedtest you barely get 400 Mbps, which in essence means autorate is just wasting some CPU cycles without actually affecting your latency under load.

right ... ok unsure what happened there just did the octave command against the logs generated by the kill command (let me know if you are interested to check the log gz files) ... same logs .. ok next test no wifi clients just speedtest ran from the router itself (Francesco Laurita speedtest)

wanb

wan

Thanks, I completely misread his description, and I agree that load-balancing is not the typical use-case here. Actually since he seems to be getting considerably less than 1 Gbps using one port, I would argue that the whole exercise does not seem to ain much, no?

+1. Occasionally one can do something like that even without a cooperating ISP, but that depends on the used link technology and how secure the ISP set up the provisioning system.

It seems I was simply confuzed, it is simply that the achievable speed stays below your configured rates so autorate never meaningfully engages, but also queueing delay never noticeably increases as function of load. The latency pattern of the earlier tests still looks odd though.

1 Like

yes with cake i am always at 1/2 of the achievable rate ... and that is sqm / qosify and now cake autorate ... the only benefit I see is actually latency I get an A most of the cases when using cake if not although with rates of 1.2 gbps I get a C with latencies over 100ms sometimes.

this is what I see with luci statistics (note - veth10g is wan ; and vethlan1 is wanb)

Mmmh, what is you physical and what is your logical network topology in regards the the two wan interfaces...

something like this

in order to use mwan3 i need to use macvlan ... cake is setup against the veth devices

Do you pay more for 1.2Gbit/s vs 1Gbit/s? I ask because of the challenge associated with benefiting from the extra 200Mbit/s. Have you tried just connecting via a single 1Gbit/s line and verifying that you can actually shape at that rate and that your ISP is actually providing that throughput?

Otherwise I agree with @moeller0 that the plots don't look right - cake isn't actually biting in your plots. Here is what cake biting looks like:

You see how the achieved rate is fairly close to the cake rate? What is happening here is oscillation around the maximum available capacity at my chosen latency settings on my 4G link.

But in your plots the achieved rate never gets close to the cake rate:

I'm not an expert on connection types, etc. What sort of connection is this? I'm wondering whether your capacity is truly variable in a way that would benefit from dynamic cake bandwidth adjustment or whether you might be better with just a fixed cake rate. But the others here (@moeller0, @dlakelan, @patrakov) can better advise on these things than I.

no i don't pay more for 1.2g vs 1g ... I've asked for an isp router that offers a 2.5g port still waiting ..hence the idea of using mwan3

my observation to date is that sqm helps with latency for use case... in spite of using docsis on the sqm setup etc I still haven't found a setting that bumps it over 600-700mbits ... just took it out ... I will run for a few days see what the report shows ... still what you guys created is tremendeously useful ... I can play with the settings and see if there is any improvement... thank you again also for all your help

I would ask as well, does this approach actually ever yield gross throughput in the expected 1.2 Gbps range?

But then this is apparently DOCSIS which traditionally is uplink starved (without full duplex DOCSIS, upload bands can not be used to distribute what used to be cables mainstay, television and to a lesser extend radio, so DOCSIS providers that still have a TV business are always hesitant about how much uplink they are willing to tolerate).

Great idea, but ATM this does not seem to be a winner yet?

Which, to be honest, is what sqm is designed to do, so yeah: Go sqm, go! :wink:

Would be interesting to see whether speedtests without sqm over a single link will hit around the 1000 * ((1500-20-20)/(1500+38)) = 949.28 Mbps maximal IPv4/TCP throughput possible over gigabit ethernet?

As @Lynx mentioned, unless your docsis segment varies wildly, you might happier with just setting a single sqm/cake instance to fixed limits and be done with. Then again, some users of DOCSIS segment experience regular throughput degradation and latency increase during peak usage time and for those segments autorate can still be a great match.

1 Like

@rmandrad just to add to the post above that @_FailSafe had a DOCSIS connection and used variable rate cake given variability he experienced. So it could be that in your case using an autorate algorithm is beneficial. It seems some testing with just a simple setup (1Gbit/s cable) is in order. It's be nice to run cake-autorate to track bandwidth and latency when hammering the connection with just normal single cable setup.

I tried setting stall_detection_thr=10 and restarted and I'm still getting the reflector misbehaving and "dl_owd_baseline_us exceeds the minimum by set threshold" a couple times per hour. Does this rarely happen on your connection?

So starlink has both repeated epochs of increased delay (like every 15 seconds) as well as I know relative high packet loss, both of these can result in stall/misbehaviour detection with the current defaults. So it might be ecessary to relax some of these thresholds for your link. Yes, it would be nice if these could auto-tune themselves for a given link but ATM we are having to little data to figure out what to next... @Lynx, which parameters would you recommend to tweak here to help?

@gba can you grep all the reflector entries in a log and we can plot them?

The relevant parameters here are:

reflector_comparison_interval_mins=1       # how often to compare reflectors 
reflector_owd_baseline_delta_thr_ms=10     # max increase from min baseline before reflector rotated
reflector_owd_delta_ewma_delta_thr_ms=10   # max increase from min delta ewma before reflector rotated

So every comparison interval the reflectors are tested and if either of the baseline or delta ewma exceeds the minimum across the reflectors for any reflector then that reflector is replaced.

We could try increasing 10 to 20 for both.

But it'd be helpful to see the data.

Also don't forget the rotations aren't necessarily a bad thing. The whole point is to try to set up a convergence to a good set.