CAKE w/ Adaptive Bandwidth [October 2021 to September 2022]

Lynx · October 13, 2021, 11:09am

@dlakelan how do I get inbound traffic destined for 192.168.1.1 to go through my veth-lan so that it gets prioritised by CAKE? I think only the outgoing ping request is caught on upload by CAKE on wan but the incoming response is not caught by CAKE on veth-lan. I don't know how to route traffic destined for the router itself through veth-lan :

These IP rules catch all inbound LAN traffic but unfortunately not traffic destined for 192.168.1.1:

14000:  from all iif wan lookup veth-lan
14000:  from all iif vpn lookup veth-lan

Even setting priorities to '0' doesn't catch traffic to 192.168.1.1.

How do I fix this? Without this, I presume the ping responses are not going to get prioritised as they should and so this will give a bleaker picture than reality?

I think this may explain why your script is overly throttling my download down to the minimum level I set even though I know it can handle a larger bandwidth without bufferbloat using CAKE.

@shm0, sorry, I should have referenced your script too:

@shm0 do you presently use one of these rate adjusting scripts? Did you tweak yours any further than as posted above? I'm keen to try it too.

The trouble is that with the exception of @lantis1008's solution (which has received a lot of positive reports, but would require a lot of work to adapt for OpenWrt), there seems to be no tried and tested solution with positive reports readily available on OpenWrt to this problem, and yet a lot of users seem to be very much in need of one.

I wonder why @lantis1008's solution has received as much positive feedback, and whether this means that this is the horse that should be backed in terms of getting something that can be made to work well for users on OpenWrt.

dlakelan · October 13, 2021, 2:04pm

I don't think you can, as there is no further routing of packets destined for the device. However this should mean that the ping is LESS delayed, after all cake can only delay or drop a packet. So I don't think this is the reason for your issue.

Lynx · October 13, 2021, 2:10pm

Thanks a lot for your response. Pity on the one hand, since it means I can't also shape traffic to router, but that's no biggie. For your script - ah, good point. So by bypassing CAKE it should get top priority anyway. Albeit if CAKE knew about it then would it not reduce the priority of everything else? Maybe that's irrelevant.

Any idea why running your script with a download it just throttles down to the minimum bandwidth I set (20Mbit/s) even though if I disable script and restart sqm so it gets set to 30Mbit/s I still get A+ on the waveform bufferbloat test?

root@OpenWrt:~# tail /etc/init.d/sqmfeedback-erl/sqmfeedback.erl

    monitor_ifaces([{"tc qdisc change root dev wan cake bandwidth ~BKbit", 25000, 30000, 35000},
                    {"tc qdisc change root dev veth-lan cake bandwidth ~BKbit",20000,30000,50000}],
    ["google.com","one.one.one.one","quad9.net","facebook.com",
     "gstatic.com","cloudflare.com","fbcdn.com","akamai.com","amazon.com"]),
    receive
        _ -> true
    end.

Should I set the ping difference to higher than 10ms or the number of delays to higher than 2? Any suggestion what to try?

Delay > 10.0 -> 
Rpid ! {delay,Name,Delay,erlang:system_time(seconds)};

if length(RecentSites) > 2 ->

On waveform bufferbloat I think I get good performance so long at 95% percentile is less than 80ms. My normal unloaded ping is between 45-55ms to ISP.

As an aside the change commands only need to specify bandwidth, not all the arguments since they are already set, right?

dlakelan · October 13, 2021, 2:27pm

If you want to be less stringent then yes try 15 or 20 MS and 3 sites maybe

moeller0 · October 13, 2021, 2:38pm

Just as a datapoint under saturating load cake will allow 5ms standing queue per direction, so when both legs of your link are running at capacity you can expect that your RTT (going through cake) increases by 10ms at the very minimum. In my observations often the average/median RTT increase is closer to 20ms under such conditions, so I would certainly try to set something larger than 10ms...

Lynx · October 13, 2021, 2:56pm

Having set delay to 20.0ms and number of sites to > 3, things are looking rather better now. The script has converged on 31Mbit/s upload and 44Mbit/s download. This seems reasonable. And waveform reports A+ bufferbloat despite very heavy traffic.

So so far so good.

Shortly I expect my ISP to get congested and so hopefully I will see these converge down to perhaps 25MBit/s download / upload or so. Fingers crossed.

Will report back.

@moeller0 in the meantime, do you see any issue with the upload part of the ping going through CAKE on wan but the response bypassing CAKE on veth-lan? Also, should I have those pings go through the VPN interface since most of my traffic uses VPN? Or would you say just go through WAN?

moeller0 · October 13, 2021, 3:03pm

As far as I can tell the congestion you want to measure happens on the LTE link, so it should not matter much where you inject the ICMP probe packets into the link. IMHO bypassing cake (for something as low size as a few ICMP packets every few seconds) seems about the right thing to do, after all you are concerned about the LTE link and not what happens inside of cake or your VPN.

Lynx · October 13, 2021, 3:10pm

FYI, with no load it scales up to max and then when I start a waveform bufferbloat test it dials back quickly - see the difference between this:

root@OpenWrt:/etc/init.d# tc qdisc ls
qdisc noqueue 0: dev lo root refcnt 2
qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1518 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
qdisc noqueue 0: dev lan1 root refcnt 2
qdisc noqueue 0: dev lan2 root refcnt 2
qdisc noqueue 0: dev lan3 root refcnt 2
qdisc noqueue 0: dev lan4 root refcnt 2
qdisc cake 8011: dev wan root refcnt 2 bandwidth 35Mbit besteffort flows nonat nowash no-ack-filter split-gso rtt 50ms noatm overhead 70
qdisc noqueue 0: dev br-lan root refcnt 2
qdisc cake 800b: dev veth-lan root refcnt 2 bandwidth 50Mbit besteffort triple-isolate nonat wash ingress no-ack-filter split-gso rtt 50ms noatm overhead 70
qdisc noqueue 0: dev veth-br root refcnt 2
qdisc noqueue 0: dev vpn root refcnt 2
qdisc noqueue 0: dev wlan0 root refcnt 2
qdisc noqueue 0: dev wlan1 root refcnt 2
qdisc noqueue 0: dev wlan1-1 root refcnt 2

And this:

root@OpenWrt:/etc/init.d# tc qdisc ls
qdisc noqueue 0: dev lo root refcnt 2
qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1518 target 5ms interval 100ms memory_limit 4Mb ecn drop_batch 64
qdisc noqueue 0: dev lan1 root refcnt 2
qdisc noqueue 0: dev lan2 root refcnt 2
qdisc noqueue 0: dev lan3 root refcnt 2
qdisc noqueue 0: dev lan4 root refcnt 2
qdisc cake 8011: dev wan root refcnt 2 bandwidth 29868Kbit besteffort flows nonat nowash no-ack-filter split-gso rtt 50ms noatm overhead 70
qdisc noqueue 0: dev br-lan root refcnt 2
qdisc cake 800b: dev veth-lan root refcnt 2 bandwidth 42668Kbit besteffort triple-isolate nonat wash ingress no-ack-filter split-gso rtt 50ms noatm overhead 70
qdisc noqueue 0: dev veth-br root refcnt 2
qdisc noqueue 0: dev vpn root refcnt 2
qdisc noqueue 0: dev wlan0 root refcnt 2
qdisc noqueue 0: dev wlan1 root refcnt 2
qdisc noqueue 0: dev wlan1-1 root refcnt 2

I am pretty happy with this 'impulse response', since this is a brutal test, namely going from zero load to maximum load, and seeing how quickly it reacts. Here it reacts quickly enough to act before the end of the waveform test.

From my observation Netflix traffic is rather spikey. I wonder how well this script will deal with that, i.e. traffic oscillating between 2Mbit/s for an active zoom call and then max bandwidth for each spike of Netflix traffic. In the worst case each spike would result in some periodic lag on the zoom call. I need to protect against that. Not sure how the script would handle this situation. Any thoughts?

I guess each spike will hopefully induce delays caught by the script, which will dial down the bandwidth, but it depends on the relationship between the spikes and the periodicity of the pings. Maybe I need very frequent pings and monitoring. Can we dial those up or is it already about maximum?

BTW on the RT3200 this + SQM + VPN is fine. Plenty of spare ram and CPU.

moeller0 · October 13, 2021, 6:11pm

Yes, most streaming services do not sent at a fixed bit rate with nicely spaced packets, but instead send data for say a second play-out as quickly as possible only to fall silent until the next burst transmission. By monitoring the relative internal buffers they then know when to request the next batch and given the relative fill and consumption rate whether to request higher or lower quality (higher quality correlates with a higher data-rate). So most streaming services are spikey.

The script should not care if configured correctly... that is you will have the same issue users on fixed links have, when the shaper sits downstream of the true bottleneck.

dlakelan · October 13, 2021, 7:47pm

I'm guessing it would be better than without the script, but not as good as if you weren't running Netflix. One good thing is at least in the UPSTREAM direction the other meeting participants shouldn't have choppy results from your video/audio. It's mostly that you will have some choppy experience of others.

Lynx · October 13, 2021, 8:57pm

Interesting point about choppy downstream but smooth upstream since how I come across is actually that is what I really care about for business meetings. Thanks for that observation.

With the modification to 20ms / 3 the script seems to be working well for me. Still testing but so far so good. Seems to have coped with some congestion this evening. Delighted to be able to reclaim back some bandwidth and still avoid huge bufferbloat.

Anyone else out there using this solution today?

Lynx · October 14, 2021, 7:44am

@dlakelan I have some thoughts about your script.

My experience with having tested it plenty so far is that its convergence under load works very well. So I think that problem has been solved. But I think the approach is incomplete owing to its load blindness. Specifically: increasing bandwidth based on no ping delays under no load seems flawed because the ping data in this condition is not indicative of more capacity.

Perhaps with the modification below this script could offer a good solution for the many OpenWrt users wanting autorate-ingress functionality?

To illustrate the problem, the maximum bandwidth my LTE connection ever sees is 70Mbit/s. So I tried increasing the max bandwidth to 60Mbit/s to see what would happen.

Under no load your script will tend to revert to maximum set upload and download. But the ping data that led there is invalid because no delays would ever be experienced anyway since there is no load. And thus any impulse load following no load will tend to result in massive bufferbloat until the script reacts.

This does not seem optimal:

How about only converge upwards so long as load is greater than a set threshold? After all, there is no need to provide extra bandwidth if there is no demand for that extra bandwidth, right? I think @lantis1008's script sensibly factors this in.

By providing the extra bandwidth even when it is not even called for under no load (greed), aren't we setting up the system for a bufferbloat-related fall once the load happens?

Wouldn't it be better to:

[the cautious approach] cautiously increase bandwidth reactive to demand for load, whilst it is safe to do so (so under no load, system is at min bandwidth)

rather than:

[the greedy approach] greedily increase bandwidth even when there is no demand for it, and then have to react to increased ping upon load (so under no load, system is at max bandwidth)

How about only allow an increase in bandwidth when the load is greater than say 50%. And perhaps also converge downwards when load is less than 50%?

So the driving forces for opening and closing would then be:

load > 50% AND no delay encountered → increase bandwidth
load < 50% OR delay encountered → reduce bandwidth

Here is an illustration of the above:

This would mean that under no load the system is low latency and ready to react to an increased demand for load. It will deal with impulse loads better. Also I think we are always reacting to meaningful ping data.

Obviously keeping things simple seems highly desirable, but the modification proposed herein seems simple enough and potentially of significant benefit.

What do you think?

Would there be an easy way to take into account load in this way (or similar) in this part of your script:

monitor_delays(RepPid, Sites) ->
    receive 
	{delay,Site,Delay,Time} ->
	    NewSites=[{Site,Delay,Time}|Sites],
	    monitor_delays(RepPid,NewSites);
	{timer,_Time} -> 
	    io:format("Checking up on things: ~B\n",[erlang:system_time(seconds)]),
	    Now=erlang:system_time(seconds),
	    RecentSites = [{Site, Del, T} || {Site,Del,T} <- Sites, T > Now-30],
	    io:format("Full Delayed Site List: ~w\n",[Sites]),
	    io:format("Recent Delayed Site List: ~w\n",[RecentSites]),

	    %% use random scaling factors, but make sure down followed
	    %% by up averages slightly less than 1, based on
	    %% simulations this averages around .98... this ensures
	    %% we don't grow too fast.

	    if length(RecentSites) > 2 ->
		    Factor = rand:uniform() * 0.15 + 0.85,
		    RepPid ! {factor,Factor};
	       true ->
		    Factor = rand:uniform() * 0.12 + 1.0,
		    RepPid ! {factor,Factor}
	    end,
	    monitor_delays(RepPid,RecentSites)
    end.

Is there an already existing stock system command for obtaining instantaneous upload/download bytes transferred per second that could be called? Or perhaps rx/tx bytes transferred would need to be polled and compared with a timer?

@moeller0 and @lantis1008 I'd also be very interested in your thoughts on this.

dlakelan · October 14, 2021, 2:40pm

yes, you'd have to constantly poll the current bytes transferred. The thing is, I'd be afraid the script and the TCP bandwidth finding algorithm might interact leading to all sorts of weird oscillations. TCP already searches for the right bandwidth, it doesn't just jam packets as fast as possible. Of course both speedtest and streaming sites are tuned to do something different from standard TCP so that they jam packets pretty quick.

Lynx · October 14, 2021, 2:47pm

But what about UDP traffic, e.g. for VPN. Or other UDP traffic.

So I see the script already determines the present time using:

Now=erlang:system_time(seconds),

So how about simply recording rx / tx bytes since the previous iteration and divide by the time between iterations? And using that to determine whether load is > or < than 50% and so proceeding that way? Would that work?

dlakelan · October 14, 2021, 2:53pm

I'm in favor of you doing some testing. I think it should work fine. There's some /sys/ file you can read to see the total transfer on an interface. here's for my desktop's ethernet which is named "lan"

/sys/class/net/lan/statistics/tx_bytes

also rx_bytes

It's been 2 or almost 3 years since I worked on this script. The best approach is for you to try it out and submit some pull requests I think. I spent a lot of time on the Erlang docs site: https://www.erlang.org/docs

Erlang is a pretty sweet language for doing this kind of thing, but I already had a love for Lisp and Prolog before starting

dlakelan · October 14, 2021, 2:55pm

The "ideal" situation would be to use Bayesian learning to estimate the current bandwidth knowing something about "Time of day" effects, and recent info on bufferbloat... keeping a history of the equlibrium rate throughout the day would be useful.

Lynx · October 14, 2021, 2:56pm

OK cool, thanks - I will try this.

Does my algorithm:

load > 50% AND no delay encountered → increase bandwidth
load < 50% OR delay encountered → reduce bandwidth

make sense? Would you propose any change?

So I propose to weave this into (assuming I understand it correctly):

	    if length(RecentSites) > 2 OR load < 50%  ->
		    Factor = rand:uniform() * 0.15 + 0.85,
		    RepPid ! {factor,Factor};
	       true AND load > 50% ->
		    Factor = rand:uniform() * 0.12 + 1.0,
		    RepPid ! {factor,Factor}
	    end,

where load is determined based on the rx/tx bytes transferred per second (calculated by determining rx/tx bytes since last iteration divided by time between iterations) / present bandwidth.

If anyone sees any problems or improvements to the above please chip in!

tievolu · October 14, 2021, 3:07pm

I have my own script to tackle this problem, basing decisions on latency only and ignoring bandwidth usage (for now).

One thing I do which hasn't been mentioned here is use nping to do timestamped pings (ICMP type 13), so that when latency is high I can tell whether it's the upstream or downstream (or both) that needs to be modified.

Lynx · October 14, 2021, 3:08pm

That sounds very intriguing about nping. Would you mind posting your script here? Would you like to test @dlakelan's script to see whether it beats your own? You just need to install 'erlang' and 'erlang-compiler' and then run 'erl' then type 'c(sqmfeedback).' in /root, and then create the service file:

root@OpenWrt:/etc/init.d# cat sqmfeedback
#!/bin/sh /etc/rc.common
# Copyright (C) 2007 OpenWrt.org

export PATH=/usr/sbin:/usr/bin:/sbin:/bin

START=51
STOP=4

start() {
        erl -pa /root/sqmfeedback -eval 'sqmfeedback:main().' -noshell -detached
}

stop() {
        pgrep erl | xargs kill -9
}

If so you could join in? You should probably change the delay from 10.0 ms to 20.0 ms and the change based on number of delays to 3 rather than 2.

There are many with this problem and I feel like some co-ordination could help arrive at a good overall solution. Whereas now everyone has their own DIY hacks to try to address this issue.

From my testing @dlakelan's script works very well under load. At least it does on my 4G LTE connection. But it is let down by opening up under no load.

I feel working with @dlakelan's script may be the best way forward. Or what about yours? @lantis1008's script has received a lot of positive feedback, but adapting for OpenWrt seems too hard.

It would be fantastic to have one horse backed and for multiple people to contribute to perfecting that rather than all these different DIY scripts!

tievolu · October 14, 2021, 3:23pm

It's 1700 lines of pretty amateurish perl. I'm not sure it should be seen in public

I did look at @dlakelan's script and there was a reason I ended up rolling my own, but I can't remember what that reason was now.