CAKE w/ Adaptive Bandwidth [October 2021 to September 2022]

Yes, pinging reflectors located on the WAN side will not show LAN_side WiFi bufferbloat. I think redoing his test with internal reflectors should help. And I agree he needs to use purely internal reflectors. This just illustrates that GiGo is still alive and well, or rather that configuring your script has some hidden dragons.

Yes, still does, but it is important to select the reflectors such thatt they all live on the remote side of the problematic link.

I'm pinging downstairs AP (Ethernet connected) and upstairs AP (wireless connected). Wouldn't it make more sense a median in this case? My thinking is the Ethernet connection is FDX and won't suffer much. Another approach might be just drop pinging downstairs AP and just ping upstairs one getting the minimum value, I reckon.

Hi i try yesterday the script with VDSL connection.
For sure seems that is working and drop bufferbloat even in high load traffic...3 linux iso downloads+android TV+upload video to Youtube all at the same time and doing speedtest to see what it can do.

Pc wasn't in first priority from firewall DSCP tag and i use layer cake in SQM.

I realize 3 "problems" until now.
First problem is that i put the file to create the service to enable it via Luci but when i "click" to enable and then click "start" it looks like it starts the service i think...then i go to another menu in Luci..etc SQM and going back in services tab..service is "disabled" not enabled.
Why this is happening?

Also no logs in file.
log

Biggest problem was when try to play online.
Many many times was disconnected from game servers during gameplay...i cannot finish neither one game session..during gameplay was disconnected..it look likes internet drop off and connects back...or to many packet loss etc.

I dont think it is packet loss because somehow i will be "finding" it during several speedtests.I think that it totally "drops internet" for etc 2 seconds but it is enough to disconnect me from game because it is real time..all of you..you understand what i mean.
Settings was ...



During speedtest from router..

Looks pretty good there, no? Hey @moeller0 we have turned that GiGo into G.I. Joe - he will: 'bust a cap' in that bufferbloat ass!

image

How do you rate the performance in your WiFi case @amteza and what changes have you made over the default? How did you get tick of 0.1s? I assumed such low values would not be possible in shell.

It seems the overall simple routine in the shell script works. Perhaps just some tweaking needed now to make it more robust or to implement in C? Any thoughts @moeller0?

@Knomax I don't know why the service file is not launching the script. Is it in /root with 'chmod +x' set? You have set a very low minimum bandwidth which presumably lingers after manual run and then stays there since the service file does not execute script. Just a guess.

@moeller0 any ideas?

The service script I wrote is just:

Based on @Lynx algorithm I devised something on my own.
Calculate/update median ping via this simple online algorithm: FAME

If delta > 5ms
  if rx_load>tx_load
    reduce download speed as Lynx dictates (rate_adjust_RTT_spike) 
    reduce upload speed half of (rate_adjust_RTT_spike)
  else
    reduce upload speed as Lynx dictates (rate_adjust_RTT_spike)
    reduce download speed half of (rate_adjust_RTT_spike)
else
   try to increase bandwidth as Lynx dictates: rate_adjust_load_high
make sure that speed is not exceeding maximum or minimum

Python:

#!/usr/bin/env python3

from multiping import multi_ping
import subprocess,time

def command(argument):
    try: 
      data=subprocess.run(argument.split(),check=True,capture_output=True)
      return data
    except:
      raise


class RunningPercentile:
    def __init__(self,x,step=1):
        self.step = max(x/2.0,step)
        self.x = x
        #print(x,self.step)

    def push(self, observation):
        if self.x > observation:
            self.x -= self.step
        elif self.x < observation:
            self.x += self.step
        if abs(observation - self.x) < self.step:
            self.step /= 2.0


addrs = ["8.8.8.8", "1.1.1.1"]

ul_if="pppoe-wan" # upload interface
dl_if="ifb4pppoe-wan" # download interface

cur_ul_rate=max_ul_rate=1000
min_ul_rate=500

cur_dl_rate=max_dl_rate=9157
min_dl_rate=4000

tick_duration=1 # seconds to wait between ticks

rate_adjust_RTT_spike=0.05 # how rapidly to reduce bandwidth upon detection of bufferbloat
rate_adjust_load_high=0.01 # how rapidly to increase bandwidth upon high load detected

max_delta_RTT=5 # increase from baseline RTT for detection of bufferbloat

def read_bytes(x,mode):
  with open("/sys/class/net/"+x+"/statistics/"+mode) as f:
    for line in f:
      return(int(line))

def myping():
  try:
    responses, no_responses = multi_ping(addrs,10)
    return(1000*min(responses.values()))
  except:
    return(1000)

mean=RunningPercentile(myping())

t_prev_bytes=time.time()
prev_rx_bytes=read_bytes(dl_if,"rx_bytes")
prev_tx_bytes=read_bytes(ul_if,"tx_bytes")
#print(prev_rx_bytes)
#print(prev_tx_bytes)

def update_rates():
  global prev_rx_bytes
  global prev_tx_bytes
  global t_prev_bytes
  global cur_dl_rate
  global cur_ul_rate

  current_ping=myping()
  delta_RTT=current_ping-mean.x
  mean.push(current_ping)
  #print(delta_RTT,mean.x)

  cur_rx_bytes=read_bytes(dl_if,"rx_bytes")
  cur_tx_bytes=read_bytes(ul_if,"tx_bytes")
  t_cur_bytes=time.time()

  rx_load=(8.0/1000.0)*(cur_rx_bytes-prev_rx_bytes)/(t_cur_bytes-t_prev_bytes)*(1.0/cur_dl_rate)
  tx_load=(8.0/1000.0)*(cur_tx_bytes-prev_tx_bytes)/(t_cur_bytes-t_prev_bytes)*(1.0/cur_ul_rate)
  
  t_prev_bytes=t_cur_bytes
  prev_rx_bytes=cur_rx_bytes
  prev_tx_bytes=cur_tx_bytes
  
  if delta_RTT > max_delta_RTT:
    if rx_load > tx_load:
      cur_dl_rate=cur_dl_rate-rate_adjust_RTT_spike*(max_dl_rate-min_dl_rate)
      cur_ul_rate=cur_ul_rate-rate_adjust_RTT_spike*.5*(max_ul_rate-min_ul_rate)
    else:
      cur_dl_rate=cur_dl_rate-rate_adjust_RTT_spike*.5*(max_dl_rate-min_dl_rate)
      cur_ul_rate=cur_ul_rate-rate_adjust_RTT_spike*(max_ul_rate-min_ul_rate)
  else:
    cur_dl_rate=cur_dl_rate + rate_adjust_load_high*(max_dl_rate-min_dl_rate)
    cur_ul_rate=cur_ul_rate + rate_adjust_load_high*(max_ul_rate-min_ul_rate)
  if cur_dl_rate<min_dl_rate:
    cur_dl_rate=min_dl_rate
  if cur_ul_rate<min_ul_rate:
    cur_ul_rate=min_ul_rate
  if cur_dl_rate>max_dl_rate:
    cur_dl_rate=max_dl_rate
  if cur_ul_rate>max_ul_rate:
    cur_ul_rate=max_ul_rate
  print("%.2f; %.2f; %.2f; %.2f; %.2f; %.2f; %.2f; %.2f;" % (rx_load,tx_load,mean.x,mean.step,current_ping,delta_RTT,cur_dl_rate,cur_ul_rate), file=open("/tmp/autorate.txt","w"))

while True:
  t_start=time.time()
  update_rates()
  command("tc qdisc change root dev " + ul_if + " cake bandwidth " + str(int(cur_ul_rate)) + "Kbit")
  command("tc qdisc change root dev " + dl_if + " cake bandwidth " + str(int(cur_dl_rate)) + "Kbit")
  t_end=time.time()
  sleep_duration=tick_duration-(t_end-t_start)
  if sleep_duration > 0.0:
    time.sleep(sleep_duration)

I have place it in init.d without chmod +x.

Wrong permissions in file....service working now.
logg
logg2
What about log file it increases Filesize continuously!!!!!!
If router running 24hours can be to big.

Read the readme!

And can you please upload that file somewhere or plot data in Excel?

It would be good to see how the bandwidth and RTT looks like in a plot like the plots I did.

1 Like

Terrific work! Looks a lot nicer than the shell given the lack of bc, doesn't it?

Does it work on your connection? What would be the dependencies in OpenWrt for that to work?

How does performance compare with shell?

Any thoughts on efficacy of python as a solution for the general case?

Would you happen to have the skills to implement in C? I would have at one point but haven't programmed in C in over a decade.

I restart it every hour to create new log file.

0 * * * * /etc/init.d/sqm-autorate restart >/dev/null 2>&1

Logfile

Well that can be used to keep the data or, as it says in the readme, you can disable the output lines.

Here is your data plotted:

Following the first unexpected ping spike, this looks like what I expect to see - bandwidth (green) increases until RTT (orange) spikes occur or load (blue) goes low:

image

This is what the routine is designed to do.

What I am troubled by is that you appear to have ping spikes that are not associated with the bandwidth ramping up. In particular, this (presumably relating to your connection, not the script) looks bad to me:

Why do you get these periodic huge ping spikes with no / very small load?

r_load  tx_load b_RTT   RTT     d_RTT   dl_bw   ul_bw
0.20	0.05	11.9	11.96	0.06	30000	3000	
0.00	0.01	12.29	50.61	38.71	30000	3000	
0.20	0.05	11.93	11.89	-0.4	30000	3000	
0.10	0.03	12.29	50.49	38.59	30000	3000
0.00	0.01	11.95	12.92	0.98	30000	3000	
0.15	0.04	12.34	50.57	38.62	30000	3000	
0.00	0.01	12.35	13.73	1.39	30000	3000	

If you have the script not running, and just set CAKE bandwidth to your minimum, and just run ping 8.8.8.8 do you see jumps from your baseline 12ms or so up to 60ms? Those jumps don't look healthy to me. Maybe you need to complain to your ISP.

But I am not an expert.

I would first go at trying to clean it up in shell a bit and add few bits more of information (like adding a timestamp to each log entry, the differences between old and new rates, and potentially also information about the CPU load per CPU on the router). C is not my forte, and I love the flexibility of shell/other interpreted languages where users can make changes directly on their devices. Yes, shell is somewhat limited (especially posix compliant shell) but we might be able to work around its limitations (like using time in units of microseconds and just use integer arithmetic instead of floating point in bc/dc).

Well you can set tick to any old value, whether the cycle time actually meets that setting is another question :wink: this is why I propose to replace:

printf "%14.2f;%14.2f;%14.2f;%14.2f;%14.2f;%14.2f;%14.2f;\n" $rx_load $tx_load $baseline_RTT $RTT $delta_RTT $cur_dl_rate $cur_ul_rate

with

printf "%s;%14.2f;%14.2f;%14.2f;%14.2f;%14.2f;%14.2f;%14.2f;\n" $( date "+%Y%m%dT%H%M%S.%N" )) $rx_load $tx_load $baseline_RTT $RTT $delta_RTT $cur_dl_rate $cur_ul_rate

and

if [ $enable_verbose_output -eq 1 ]; then
        printf "%14s;%14s;%14s;%14s;%14s;%14s;%14s;\n" "rx_load" "tx_load" "baseline_RTT" "RTT" "delta_RTT" "cur_dl_rate" "cur_ul_rate"
fi

with

if [ $enable_verbose_output -eq 1 ]; then
        printf "%14s;%14s;%14s;%14s;%14s;%14s;%14s;%14s;\n" "log_time" "rx_load" "tx_load" "baseline_RTT" "RTT" "delta_RTT" "cur_dl_rate" "cur_ul_rate"
fi

Here is an idea, if you plot rate as ratio between maximum and minimum you can simply add the load curves as additional lines to the plot (or scale everything to %)

@moeller0 are the ping spikes on the VDSL connection above not reflective of something that cannot be fixed with CAKE? Ping spikes to 50ms every now and then without load seems bad, no?

Yes these spikes look odd, in that they seem to be rate independent and pretty uniform in magnitude. These might (deep speculation) be related to some periodic work the router does, and which, if an RTT probe happens at the same time introduces such delay....

I wondered about that too. Hence your suggestion above to add cpu load in output. Can that always be just read from /proc/cpu/loadavg is it?

I guess we need to cat /proc/stat and then construct the load percentages from the individual fields of the columns of the CPU lines:

cpu  960044 4198 250269 30663225 239954 0 229491 0 0 0
cpu0 508049 2024 122522 15348182 127243 0 65569 0 0 0
cpu1 451995 2174 127747 15315043 112711 0 163922 0 0 0

cpu gives the total over all CPUs but for thinns like cake we are intersted in the individual CPUs, because cake's CPU (e.g. cpu0) maxed out and the other CPU (e.g. cpu1) idle end up with cpu being report as 50% idle...

Wouldn't irqbalance always solve that? Or only sometimes? I don't understand why that's not enabled by default actually.

@Knomax what router do you use? Is your router CPU getting loaded with something that might explain your ping spikes? Do you do anything funky on it? You are not running a plex server on it or something are you? Maybe enable irqblance anyway just to help spread load over the CPU cores.

opkg update
opkg install irqbalance

Enable by changing 0 to 1 in /etc/config/irqbalance:

config irqbalance 'irqbalance'
        option enabled '1'

This may not be the cause of your issue, but unless I am mistaken enabling irqbalance will only help under any high load.

It should help but it will not solve all. For example on mvebu (at least on the SoC used in the turris omnia) it seems ethernet IRQ processing needs to be done by CPU0, so irqbalance is not going to help with these...

If I had to speculate, I would guess that for single core routers it would be wasted space (so not included in the default set of packages) and even for multicore routers it is not strictly required... It would be convenient though if it came as default on multi-core routers that are not flash-poor. There probably is a really good reason for not including it though that I am just not recognizing.

I've updated the lua service implementation to use the simpler RTT tracking method proposed by @openwrticon, really like the FAME algorithm you highlighted.

The service will now consider ping readings below moving the median * 1.2 as eligible for bandwidth increases, and ping readings above the median * 2 will trigger the decrease logic. These numbers might need some tweaking.

See the lua extract below for implementation of the fast algorithm for mean estimation (FAME) and simplified updatePingStatistics function.

I've added an 'rtt' config parameter in which the user should specify their unloaded rtt - this defaults to 50 ms.

Testing of this update will be much appreciated.

jeverley/wanmonitor: A lua service providing WAN interface monitoring and SQM autoscaling for OpenWrt routers. (github.com)

function movingMedian(persist, observation)
	if not persist.median or not persist.step then
		persist.median = observation
		persist.step = math.max(math.abs(observation / 2), 1)
	end
	if persist.median > observation then
		persist.median = persist.median - persist.step
	elseif persist.median < observation then
		persist.median = persist.median + persist.step
	end
	if math.abs(observation - persist.median) < persist.step then
		persist.step = persist.step / 2
	end
	return persist.median
end

function updatePingStatistics()
	if not ping.persist then
		ping.clear = 0
		ping.latent = 0
		ping.persist = {}
		if ping.current > rtt then
			movingMedian(ping.persist, rtt)
		end
	end

	ping.median = movingMedian(ping.persist, ping.current)
	ping.limit = ping.median * 2
	ping.target = ping.median * 1.2

	if ping.current > ping.limit then
		ping.clear = 0
		ping.latent = ping.latent + interval
		return
	end

	if ping.current > ping.target then
		return
	end

	ping.clear = ping.clear + interval
	ping.latent = 0
end
1 Like