CAKE w/ Adaptive Bandwidth

I started autorate in manual mode to look at the output:

DEBUG; 2022-11-14-00:29:44; 1668400184.214326; Starting CAKE-autorate 1.1.0
DEBUG; 2022-11-14-00:29:44; 1668400184.216444; Down interface: ifb4eth0.2 (15000 / 25000 / 50000)
DEBUG; 2022-11-14-00:29:44; 1668400184.218474; Up interface: eth0.2 (3000 / 5000 / 10000)
DEBUG; 2022-11-14-00:29:44; 1668400184.220657; rx_bytes_path: /sys/class/net/ifb4eth0.2/statistics/tx_bytes
DEBUG; 2022-11-14-00:29:44; 1668400184.222640; tx_bytes_path: /sys/class/net/eth0.2/statistics/tx_bytes
DEBUG; 2022-11-14-00:29:44; 1668400184.224788; log_file_path: /var/log
DEBUG; 2022-11-14-00:29:44; 1668400184.313899; Warning: bufferbloat refractory period: 300000 us.
DEBUG; 2022-11-14-00:29:44; 1668400184.335878; Warning: but expected time to overwrite samples in bufferbloat detection window is: 1500000 us.
DEBUG; 2022-11-14-00:29:44; 1668400184.338189; Warning: Consider increasing bufferbloat refractory period or decreasing bufferbloat detection window.
DEBUG; 2022-11-14-00:30:03; 1668400203.763477; no ping response from reflector: 1.0.0.1 within reflector_response_deadline: 1s
DEBUG; 2022-11-14-00:30:03; 1668400203.785959; reflector=1.0.0.1, sum_reflector_offences=0 and reflector_misbehaving_detection_thr=3
DEBUG; 2022-11-14-00:30:22; 1668400222.029982; no ping response from reflector: 8.8.8.8 within reflector_response_deadline: 1s
DEBUG; 2022-11-14-00:30:22; 1668400222.032505; reflector=8.8.8.8, sum_reflector_offences=0 and reflector_misbehaving_detection_thr=3
DEBUG; 2022-11-14-00:30:40; 1668400240.314502; no ping response from reflector: 8.8.4.4 within reflector_response_deadline: 1s
DEBUG; 2022-11-14-00:30:40; 1668400240.327080; reflector=8.8.4.4, sum_reflector_offences=0 and reflector_misbehaving_detection_thr=3
DEBUG; 2022-11-14-00:31:55; 1668400315.362999; no ping response from reflector: 8.8.4.4 within reflector_response_deadline: 1s
DEBUG; 2022-11-14-00:31:55; 1668400315.366103; reflector=8.8.4.4, sum_reflector_offences=0 and reflector_misbehaving_detection_thr=3
DEBUG; 2022-11-14-00:31:56; 1668400316.392848; no ping response from reflector: 8.8.4.4 within reflector_response_deadline: 1s
DEBUG; 2022-11-14-00:31:56; 1668400316.406329; reflector=8.8.4.4, sum_reflector_offences=0 and reflector_misbehaving_detection_thr=3
DEBUG; 2022-11-14-00:32:12; 1668400332.653389; no ping response from reflector: 1.1.1.1 within reflector_response_deadline: 1s
DEBUG; 2022-11-14-00:32:12; 1668400332.667514; reflector=1.1.1.1, sum_reflector_offences=1 and reflector_misbehaving_detection_thr=3

What are the warnings about and do I need to change anything?
Also I noticed that its getting NO PING on some reflectors... Is this normal?

No typically not, the code tries to replace unresponsive reflectors, but the initial list is not that long and so it will try to bring these reflectors into service quite quickly again...

@Lynx, maybe we should increase the default list of reflectors to say 10 or so so that the health check has a chance of improving things?

Or we could take inspiration from @tievolu and create a separate script that prunes a large list of reflector candidates down to a reasonable set (say 3 times the number of concurrently used reflectors)?

1 Like

Yes seems like we should. At the moment we use Google, Cloudflare and Quad9. Do you happen to know any other big hitters we should add?

By the way, shouldn't we find it odd that @hammerjoe sees so many missed ICMP responses from Google and Cloudflare?

Could be similar to @patrakov's ISP that has some general ICMP limits, so maybe all that is needed is easy up on the aggregate measurement frequency a bit. For testing I would reduce the frequency by at least a factor of two and see whether that results in less missing replys.

I have not thought about this deeply, anything heavily anycasted should do....
Maybe @tievolu could offer a proposal or @_FailSafe, please?

A few options:

94.140.14.14     AdGuard DNS
64.6.64.6        Neustar DNS 
208.67.222.222   OpenDNS 
185.228.168.168  CleanBrowsing DNS
149.112.112.112  Alternative Quad 9 DNS

There are lots of other alternate IPs for these providers too:

AdGuard

94.140.14.15
94.140.14.140
94.140.14.141
94.140.15.15
94.140.15.16

Neustar

64.6.65.6
156.154.70.1
156.154.70.2
156.154.70.3
156.154.70.4
156.154.70.5
156.154.71.1
156.154.71.2
156.154.71.3
156.154.71.4
156.154.71.5

OpenDNS

208.67.220.2
208.67.220.123
208.67.220.220
208.67.222.2
208.67.222.123

CleanBrowsing

185.228.168.9
185.228.168.10
185.228.169.11
185.228.169.9
185.228.169.168

Quad 9

9.9.9.10
9.9.9.11
149.112.112.10
149.112.112.11
3 Likes

Great, I think we should add like two IPs per public DNS provider, as these likely are anycasted and will expect some traffic/ICMP traffic, or we can add all to the candidate list and simply pick N randomly from the set, and afterwards just do round-robin replacements if necessary?

2 Likes

I think whatever autorate is doing is making explornet throttle my speeds. Yesterday I tested it all day and they would not go over 15-20dpwn and 3up.
I stopped autorate and after a couple hours it went up to 40-50down again and 8up.
I deduct that whatever autorate is doing is deemed too agressive for them.

One thing with xplornet wireless internet and I think its the same for others as well is that altough the speeds do vary thru out the day they do not constantly swing.
ie if the tower is congested then it reduces the speed to say 50% of the plan which means that it will hover around the 25mbps with a bit of fluctuation for quite some time.
It will not swing between 30mbps one second and 15 the other and then 40 the next and so on.
So I think there is no need to check for the dl and up speeds every second.
I am thinking 5 seconds or even longer is probablly more than enough because sqm/cake should still be able to handle that sudden change of speed for that amount of time imo.

So what settings do I need to change so that autorate only checks every 5 seconds?
is it reflector_ping_interval_s? I changed it to 2.

@hammerjoe the first thing you need to do is to increase the delay thresholds in the config file:

# delay threshold in ms is the extent of OWD increase to classify as a delay
# these are automatically adjusted based on maximum on the wire packet size
# (adjustment significant at sub 12Mbit/s rates, else negligible)  
dl_delay_thr_ms=20.0 # (milliseconds)
ul_delay_thr_ms=20.0 # (milliseconds)

(this is still a small change, so you ciuld try larger numbers if these do not work).
According to your cake-autorate_config.sh you had these at the default value of 12.5 each, which is too low for your link (this is in addition to any other issue). To elaborate your idle RTTs cross these thresholds sufficiently often that autorate gets essentially stuck on the configured minimal rates, while under idle conditions without load we actually expect the shaper to slowly creep up to the configured baseline rates. The fact it does not do that even outside of the bigger pink load spikes on your link implies you need to adjust the threshold.

Now, te threshold is something that needs to be adjusted for each link anyway... we might come up with an more automatic way of proposing threshold values, but in the end a network's administrator (aka you for your own network) will need to set these values as there is not going to be one size that fits for all in all circumstances.

I would propose we try this first and only embark on the ping frequency question after we know whether adjusting the thresholds did help or not.

By default we are checking this 5 times per second interleaved for 4 reflectors for an approximate aggregate rate of 20Hz (and period of 50ms)...

Autorate has been running since last night and what I am seeing is that whatever its doing xplornet doesnt like and its throttling my speeds.
In my observations autorate is updating SQM correctly, its just its getting the info in a way that my provider hates.

So I have turned it off and I will wait for the thottling to go away and will turn it on again.
I have attached my current config and its the one I was using.

Please let me know what settings you would like me to test to see what happens and I will log it and report.

On another note, something that came to mind.
Autorate is supposed to find what the current internet speed is and adjust sqm with it.
Im not exactly sure what autorate is doing, but it seems to me that its duplicating work that is already being done by others.

Ping, latency is afaik dealt with sqm, thats its purpose, it just needs to know what is the current speed to do its work.
The router knows what the current speed is, so why cant autorate query the router to determine what the current speed is and adjust sqm?

Something like this conversation.

Autorate: "Hey router, whats up? what is the current internet speed you are seeing bro?"
Router:"Not much man, just cruising along at 25mbps"
Autorate: "Cool, see ya later?, hey SQM whats your speed dude?
SQM: " Hey, I was told I can work with 50mbps"
autorate:" Yeah, thats a bit high, bring it down x% please"
sqm : "Done, thx bro"
Autorate :"No problema gringo... wheres my newspaper, have some time to kill".
rinse and repeat.

Why isnt autorate doing that so it doesnt need to anger the boss (xplornet). :slight_smile:

Not exactly, cake-autorate monitors the induced latency and will reduce the traffic shaping rate if the induced delay increases (above a threshold), this results not in the maximum "speed" (aka rate) but the maximum speed which does not induce undue delay.

Yes, the problem is not new, and this thread alone has contributions of at least 4 different implementations of the same general idea.

Yes, alas on variable rate links that speed to put into SQM is a moving target... hence all the complication of tracking latency-under-load and using that to control shaper speed, exacty to deal with those cases were a single fixed shaper rate is unsatisfactory.

Because the router does not really know... for download traffic the ISPs device upstream would know, but it does not tell us, so we need to get creative. For upstream traffic is often is similar, many devices, like DSL modems, cable modems and the like do not tell our router about the speed, so again we need to get creative. In theory DOCSIS 3.1 supports some for of upstream AQM that is interconnected with the request-grant mechanism and hence has the promise to work better than autorate; however in my limited experience that promise stays mostly unfulfilled, let's hope DOCSIS4.0 or low-latency DOCSIS will change this for the better.

Again, we do not know the instantaneous available capacity, if we did, we would plug it into SQM/cake ASAP and be done with (until the capacity changes, then rinse and repeat). Instead what happens if we exceed our capacity share is that some devices buffers will fill up (we are sending data into the buffer faster than it can be emptied) as a consequence of that all packets now experience the queueing delay of that buffer... and that is essential what we measure and what we use to decide how to change the shaper rate (we also look at the achieved throughput as an optimization); so basically at the core, if latency is below the configured threshold we (slowly) increase the shaper rate, if latency is above the threshold we (more abruptly) reduce the shaper rate...

EDIT: this is the wrong place, sorry, I should have payed more attention
try changing:

# interval in ms for monitoring achieved rx/tx rates
# this is automatically adjusted based on maximum on the wire packet size
# (adjustment significant at sub 12Mbit/s rates, else negligible)  
monitor_achieved_rates_interval_ms=200 # (milliseconds) 

to say 4000 (for one probe per second) to test something so slow that your ISP should not throttle it.

EDIT: Try changing

reflector_ping_interval_s=5 # (seconds, e.g. 0.2s or 2s)

instead, but you already did, sorry....

OR set-up a VPN and ping across that VPN to work around your ISP's apparent throttling.

2 Likes

Thanks for the explanation. Im just learning as I go along.
I will wait for the throttle to ease up and will restart autorate and log the results to see how it handles.

Great and helpful explanation from @moeller0 above @hammerjoe.

Should be:

reflector_ping_interval_s=4 # (seconds, e.g. 0.2s or 2s)

And @hammerjoe don't forget to change:

Maybe even try 30.0 or 40.0.

And you could also relax these to:

# bufferbloat is detected when (bufferbloat_detection_thr) samples
# out of the last (bufferbloat detection window) samples are delayed
bufferbloat_detection_window=10  # number of samples to retain in detection window
bufferbloat_detection_thr=5      # number of delayed samples for bufferbloat detection

I use a VPN because my ISP has in the past selectively throttled on ports 80 and 443. But I send ICMPs out in a way that bypasses the VPN.

@hammerjoe I rather doubt your ISP is a) detecting your ICMPs and then b) punishing your available bandwidth for that. That seems far fetched. But you could indeed like @moeller0 suggests just push everything over a VPN and that way the ISP only sees traffic to one IP and that's it.

So I'd recommend relaxing the settings (cake-autorate is very aggressive by default partly because I favour low latency over high numbers in speed tests) first, and posting log file so we can see what's happening.

And don't forget when you upload log file to also upload revised config.

1 Like

I think what happens is that the ISPs detects ICMP "abuse" and then throttles ICMP hard, and as a result autorate does not work... I would probably try to look at the ISP's EULA/terms of service and politely ask for allowing high ICMP use on the link under discussion.

It could well be, but I dont think its just icmp that gets throttled.

btw do I keep monitor_achieved_rates_interval_ms=4000 # (milliseconds)?

No, please don't, my mistake. monitor_achieved_rates_interval_ms controls the granularity of our load measurements, theoretically smaller is better here, except making this very small is a bit costly and if too small it will behave a bit erratic 200ms appears to be an acceptable compromise.

Just did a little more testing on my newly outdoor-mounted Zyxel NR7101 with 4K video around 50% to 80% of the timeline.

Notice the big rate drop on step load (lower bandwidth Netflix traffic) after the sustained baseline on zero load at 20 Mbit/s after the 4K video test? I suppose that means my base rate at 20 Mbit/s is too high because a step load around this time of day can cause bufferbloat.

So perhaps my baseline should be more like 10/10? What do you think @moeller0?

I suppose this could be made part of an automated cake-autorate test connection routine: maintain zero load for a while and then start saturating download, and then check to see how well the step load is managed? After all, such step loads could presumably interrupt low bandwidth Zoom/Teams calls?

@hammerjoe Just an idea, perhaps your provider is rate limiting you because your high speed is a burstable rate? Does your provider reserves a right to limit connection speed based on usage/peak times, etc?

2 Likes

Yes, they have traffic management that nobody really knows how it works but they adapt the user speeds based on congestion and just how much they are using.