CAKE w/ Adaptive Bandwidth [August 2022 to March 2024]

Found a possible cause of this spiral-down. It's a confusion around rx_bytes_path.

By default, with an ifb interface, it is set to /sys/class/net/${dl_if}/statistics/tx_bytes. In other words, monitoring of the achieved rates uses the rates past the shaper. For TCP, it doesn't really matter, because it will quickly settle down to the rate that it is being shaped to. But for media streams over UDP, they are not that responsive, and the difference can be significant, and an overshoot is possible.

Quick demo:

Set both SQM rates to 800 kbps statically, stop cake-autorate. Run these two commands in two terminals in parallel:

iperf3 -u -c speedtest.shinternet.ch -R -t 60

and

r=0
ri=0
while true ; do
    read -r r1 </sys/class/net/wwan0/statistics/rx_bytes
    read -r ri1 < /sys/class/net/ifb4wwan0/statistics/tx_bytes
    echo -e "$(( r1 - r ))\t$(( ri1 - ri ))"
    r=$r1
    ri=$ri1
    sleep 1
done

Result: the iperf server will send 1 Mbit/s of UDP traffic towards your host. The shaper will shape this, but not to 800 kbit/s as configured, but to about 500 kbit/s. OK, doesn't matter for the end result. And the end result, for the monitored rates, is:

135908  66622
134778  66860
132928  66558
134609  66737
134304  67860
141108  66250
135876  66590
134419  66561
135347  67545

In other words, to understand that the download speed of 1 Mbit/s is achieved, despite the shaper being set to a lower value, we really need to use /sys/class/net/wwan0/statistics/rx_bytes.

Now see what happens. The script sets the shaper rate to 90% of the achieved rate on bufferbloat, with this wrong definition of "achieved". But, if somebody persistently overloads the link with UDP, it will progressively set the speed to 90%, then notice that the bufferbloat hasn't gone, and set it to 90% of the achieved rate again (which includes the shaper, so really 81%), and so on, which makes it too hard to recover from.

I think that we should completely delete special-casing of IFB/veth interfaces when figuring out rx_bytes_path and tx_bytes_path, and always use statistics of the upload interface, unless overridden. That is:

# Initialize rx_bytes_path and tx_bytes_path if not set
if [[ -z "${rx_bytes_path:-}" ]]; then
        rx_bytes_path="/sys/class/net/${ul_if}/statistics/rx_bytes"
fi
if [[ -z "${tx_bytes_path:-}" ]]; then
        tx_bytes_path="/sys/class/net/${ul_if}/statistics/tx_bytes"
fi

Well, I am not sure about the tx_bytes_path change, but I don't know how a setup with ifb or veth would look like in this case, and am not sure whether there is a universal rule based solely on the interface type. Yes I know that this will break someone's setup, but the current heuristic is too smart, in a wrong way, and I believe that for such non-standard cases a manual override is the correct solution. But then the setting and its use case should be documented in cake-autorate_defaults.sh.

P.S. One of the consequences of this bug is that all graphs posted so far are wrong regarding the download speeds - they don't include overshoots over the rate set by the shaper.

P.P.S. From a cybersecurity perspective, if my analysis is correct, this qualifies as a vulnerability: an attacker who can send a UDP stream to the target router with enough bitrate to cause bufferbloat, can cause the reduction of usable bandwidth down to the minimal rate, i.e. far more than without the cake-autorate script. Perhaps we need to make an official announcement here and on GitHub, and maybe even release 1.2.1?

1 Like

Keep in mind that at the point of shaper reduction we likely still have some data in flight sent at the old rate that will accumulate in the queue and will need some time to drain, so reducing the rate to below the achieved rate IMHO is still the right thing.

Confusion indeed.

As far as I can tell we:
a) understand that the achieved rate is only meaningful for the download direction (unless Linux has direct control over the the uplink-bottleneck interface, but in that case we should just enable BQL and would have solved the issue).
b) we want the achieved rate to reflect the actually achievable goodput (modulo the gross rate versus net rate overhead)
c) we use the achieved rate as a helper in deciding which rate to reduce our shaper too (IIRC we take the minimum of our normal reduction step calculation and the achieved rate)

Yes this is due to b) above, so I argue this is pretty much as intended.

But due to c) the fact that our shape might drop a lot of packets does not really matter that much, we only act on the achieved rate if we have evidence that the shaper rate is too high already and only use the achieved rate if that gets us lower than our normal heuristic, In that case achieved_rate < shaper_rate by necessity so it is still a useful proxy for what the link can deliver...
In the increase rate direction, we already increase when we are at 75% (is that the current default still?) so again slight imprecision in achieved_rate measurements (e.g. from taking the shaper's egress instead of ingress rate) are not going to affect the control loop significantly.
Keep in mind we are dealing with a set of heuristics here, not hard and fast facts...

To increase the reporting precision here we would need to have additional definitions for the interfaces to collect the traffic data from. Or use something like sqm-script does:

# find the ifb device associated with a specific interface, return nothing of no
# ifb is associated with IF
get_ifb_associated_with_if() {
    local CUR_IF
    local CUR_IFB
    local TMP
    CUR_IF=$1
    # Stray ' in the comment is a fix for broken editor syntax highlighting
    CUR_IFB=$( $TC_BINARY -p filter show parent ffff: dev ${CUR_IF} | grep -o -E ifb'[^)\ ]+' )    # '
    sqm_debug "ifb associated with interface ${CUR_IF}: ${CUR_IFB}"

    # we could not detect an associated IFB for CUR_IF
    if [ -z "${CUR_IFB}" ]; then
        TMP=$( $TC_BINARY -p filter show parent ffff: dev ${CUR_IF} )
        if [ ! -z "${TMP}" ]; then
            # oops, there is output but we failed to properly parse it? Ask for a user report
            sqm_error "#---- CUT HERE ----#"
            sqm_error "get_ifb_associated_with_if failed to extrect the ifb name from:"
            sqm_error $( $TC_BINARY -p filter show parent ffff: dev ${CUR_IF} )
            sqm_error "Please report this as an issue at https://github.com/tohojo/sqm-scripts"
            sqm_error "Please copy and paste everything below the cut-here line into your issue report, thanks."
        else
            sqm_debug "Currently no ifb is associated with ${CUR_IF}, this is normal during starting of the sqm system."
        fi
    fi
    echo ${CUR_IFB}
}

to automatically get the underlaying true interface (assuming that actually works in your wwan0 case). That is a lot of complication, add to this that due to b) we would need to grab these rate in addition to the rates we currently collect...

No, the script sets the rate to min(achieved_ratefactor1, shaper_ratefactor2) at a point at which we know no matter what the achieved rate over the last interval was, it was too large. So I argue, this being a heuristic, we do not gain all that much by obsessing about shaper ingress/egress rates.

This is effectively a DOS-attack, and yes autorate does not solve this problem, as far as I can tell that is unsolvable from the endpoint... there really is nothing we can do here, even if we reduce the shaper gentler that unrelenting UDP flow will still crowd out all usable traffic. So this is pretty much out od scope for autorate, sorry.

That is assuming that this nterface handles both ingress and egress.

I believe that the current heuristic works well enough and your concern (while not wrong) is a case of wanting perfect, while we already have "good enough". But hey, humor me, change this on your link abd see if that noticeably improves things...

Again, not a "bug" but a design goal, have the achieved rates correlate with measurable goodput.

This is an unavoidable consequence of doing ingress shaping after instead of egress shaping before the bottleneck link. We can be DOSed one way or the other, yes not ideal that an attacker will not need to use > 100% of link rate, but can get away with a bit less*... that is what the minimal rate definitions are there fore... I know you dislike them, but they are exactly the kind of back-stop against such shenanigans that we can use....

*) Note that we only persist on low rates if we actually experience bufferbloat, so when our controller arguably should engage. If there is no bufferbloat but high load percentage we increase the rate again, as well as if there is no load we will also increase the rate again. So to be constantly limited to the minimum rate we need to have persistent above threshold latency (effectively, spread around our reflector window), at which point keeping the shaper at the minimum rate is pretty much the right thing to do. I note the reports of "stuck on minimum" so there could be a bug in the implementation or the logicm, but on principle I do not see as catastrophic an issue as you seem to do.

1 Like

Oh boy this seems complicated. Is the issue that we are setting the achieved rate based on 90% of the post-shaper rate rather than 90% of the pre-shaper rate? Am I correct in thinking that ideally we would switch to the latter. But this is ONLY for this special case of setting the shaper rate based on line capacity estimation right? This might be why setting to 120% makes sense? For general case of measuring achieved rates for general monitoring and plotting we should go on using post-shaper rates right? It should achieved rates in general be pre-shaper including just for general monitoring and plotting?

Does it? The point is we only reduce rate if we experience increased latency, if that latency increase persists our shaper rate is too high and then reducing in slightly larger steps is not going to hurt much, if the bufferbloat persists we might have reduced the shaper a bit more than we might have, but normal rate increase rules still apply, that is we are likely experiencing high load so we increase the rate again.

That is @patrakov's argument in a nut-shell. I am less concerned about this, as all of this is a "tower of heuristics" and improving/changing one of the steps gently is IMHO not going to have a big effect on the whole. His attack model essentially is a sustained, below bottleneck rate unrelenting flow above its fair capacity share that uses up link capacity but is unresponsive to cake's signal so will result in pushing the shaper rate down (and successively getting an ever larger share of the ingress capacity). But that IMHO is the problem with unresponsive flows and ingress shaping, and that is something we will not be able to fix from our side. Or to put differently, if we switch the achieved_rate measurement method as proposed, all our attacker needs to do is gently increase its sending rate again and we are back at square one.
I do not dispute that as a reference for true achieved throughput over the link cake's ingress rate is a better proxy that cake's egress rate, I just do not think that this is going to be a big factor. However I am open to be convinced otherwise by data.

=

No this does not make sense generally, we really need to set the shaper a bit below the actual bottleneck rate to allow our queue to drain and increased latency to go away... there might be links where the mode, or basestation needs to see some persistent queue to schedule more capacity to a user, so it might be helpful on special links to try something like 120% but generally this is the wrong thing to do.

I think so, as that is what correlates (modulo the gross versus net difference) which what users can actually measure via speedtests.

In reality on normal links there is going to be a small difference between the two, hardly worth worrying about, after all our main controller really only acts on delta_delay, so we are talking about the margins here how fast to reduce or increase the shaper rate.

But I do understand that on very slow links this might look a bit more dramatic.

This is a partial misunderstanding, the problem is that you did not complete the thought experiment. Yes, the attacker with an unresponsive flow will be able to cause bufferbloat, and we can do nothing about it. Let's say (sorry for the unreasonably-low numbers, scale them as appropriate) the attacker uses up 1.1 Mbit/s via UDP out of 1 Mbit/s available, and there is also a legitimate TCP flow that would like to use as much as possible. The ISP's router will mix the two, and perhaps drop 54% of the attacker's packets to fit the link bandwidth. Without further shaping, we get 500 Kbit/s of the attack and 500 Kbit/s of legitimate traffic. But with the current cake-autorate, its logic will drive the shaper rate to the minimum, let's say 200 Kbit/s, thus throttling the legitimate traffic, too - that's the first concern. The second concern is the time to recover after the attack ends - overshaping obviously increases it.

Another way to view the proposal is: if it is clear that shaping below 90% of the actual incoming rate through wwan0 does not help against bufferbloat, then shaping even further won't help either, so stop persisting and please treat the latency incident as unfixable - i.e. optimize throughput, not latency.

So either we would switch to monitoring achieved rates based on shaper ingress (which might be confusing because speed tests would show values lower than that seen in our monitoring) or additionally monitor rates based on shaper ingress and then use the latter specifically when testing whether to punish on the lower of shaper rate * 0.9 or achieved rate * 0.9?

Regarding the confusing results of speedtests if we log the real achieved rate of wwan0: due to the way TCP works, I don't consider this a big problem, but logging both pre- and post-shaper rates, and only using the real wwan0 achieved rate as the estimate of the link capacity (and in all other decisions), would be a solution.

I would also say that the very use of "shaper rate * 0.9" for ingress is another root cause of the issue. Drop it.

For egress, of course the logic must take the shaper rate into account, as we don't have any estimate how much is dropped/buffered upstream.

But an attacker is just going to flood packets willy nilly; we are pitting a tsunami against a wooden hut and wondering whether we should use beech or ash.

Working out whether to persist or stop as you write is easier said than done. It's always tempted to add in extra heuristics like in event A, do X and in event B do Y, but we need something that works all the time in every situation.

But I do find it intellectually unsatisfying that we base the shaper rate = 0.9 * achieved rate on cake egress when it should be cake ingress. Assuming that I am understanding this issue correctly. It's sufficiently irksome to me to try to think about how to fix this.

Forget about the attacker then :slight_smile: think about a 900 Kbit/s video stream.

Why would we drop shaper rate * 0.9? That's simple. Bufferbloat detected so reduce shaper rate. No issue there. But I do see your issue with basing the shaper rate on the lower of the latter and 0.9 * achieved rate, when the achieved rate is the cake egress and hence less than the actual line capacity given the all the packets dropped by cake. The latter to me bothers me. Maybe we can just track both pre and and post rates and only use the pre rates for this special line capacity estimate 0.9 * achieved rate case.

That is a genuine DOS attack that will cause and ever growing queue at the ISPs side (until the FIFO) overfills. There is literally nothing we can do here except talk to the ISP to throttle that flow for us (unlikely on customer links for low rate flows).

You can't the responsive traffic will give way, either because of cake dropping or do to FIFO overflows. Let me repeat we can't not fix DOS from our side, no ifs no buts.

Nope, TCP will reduce its rate promptly on experiencing drops, the attack traffic will not, so the traffic share even with just our two flows will not be ~50/50. Unless the ISP uses a flow queueing scheduler, the attack traffic will dominate over the responsive traffic independent on what we do on our side. However I accept that the unresponsive traffic will cause problems for our ingress shaper (we still enforce per-flow fairness, and drop the attack traffic harder, but that attack traffic will hog more than its fair share over the bottleneck link).

That is an assumption, I consider unrealistic, given enough time responsive flows will scale back (tcp will go as far down as 2 MTU packets in flight) and the unresponsive traffic will gain the lion's share of the bottleneck capacity.

Only transiently, until the responsive traffic scales back, for a true 50/50 traffic share both flows need to have equal responsiveness (or lack thereof).

This is a red herring, the damage has already been done in the overflowing upstream buffer/queue (we know that exists, as other wise the delta delay would tell our shaper to open up again).
Let me repeat we can not fix your described slow-DOS attack from our side. taking the pre-shaper rate in our throttle down calculations will not turn this scenario into something palatable, sorry.

To call something "overshaping" ww would need a preferably closed form theory of optimal shaper settings. We only have a set of heuristics that mostly work, so a bit of imprecision is IMHO not as big a deal as you make it out.

Keep in mind we do not have instantaneous traffic rates but averaged over an interval and we need to make some room for the queue to drain... I invite you to test this, but I really think you are barking up the wrong tree.

We have a toggle for that the minimum rate, and you are free to change the relative shaper increase/decrease factors (just keep in mind that you need to dampen oscillations to avoid resonance phenomena). But really the point is we only shape down stay down if we experience increased latency, and that is pretty much by design, autorate in essence is trying to give you:
"as much throughput as you can get with your selected acceptable delay"

We have the minimal_rate as backstop for that you can tell the control at what point you stop caring for low latency, please use it... Really I do not think that our controller is at fault here, it is more that your requirements are simply way more complex than our controller can handle. I also argue that the controller would need a mindreader and or a time,achine/oracle to detect at the time of bufferbloat detection whether that is going to be post hoc declared as actionable or ignorable bufferbloat.

You can try that, I prediction that is not going to fix/change much. As I try to describe above I do not think that @patrakov's scenario describes what actually happens.

No it would not, it would (in non DOS situations) gently decrease the step by step reduction of shaper rate on bufferbloat, which depending on the actual traffic pattern can or can help to control bufferbloat better/quicker. If you truly have significant rate inelastic unresponsive traffic above its "fair" share of shaper rate you will experience the problems of traffic shaping on the wrong side of the link. Cake's "ingress" mode tries to adapt to this by dropping harder on under-responsive flows, but for truly unresponsive traffic these drops happen only after they used up time over the link.

And replace it with what? During active bufferbloat, we know that the current shaper rate is too high, so reducing it is the only sane thing to do. Using achieved_rate is an optimization that allows us to reduce the shaper rate harder if that seems advisable.... just to mention for egress the achieved_rate is essentially meaningless hence we set the new shaper_rate to min(achieved_ratefactor1, shaper_ratefactor2)...

I do not think that special casing ingress and egress is going to help much, and it will introduce considerable complexity.

Yes, we can not fix ingress DOS attacks from our side. End of story.

With a 1000Kbps link rate and flow-fair queueing 900Kbps is simply not reliably achievable if there is only one additional greedy flow. autorate can not:
a) magically make the link capacity higher
b) magically deduce which of the flows you value more

for b) you could use e.g. diffserv4, and put only the video in BestEffort and everything else into Bulk (1/16 of capacity guaranteed) 1000-1000/16 = 937.5 Kbps gross rate left for video.

My point is at some point something needs to give, and desiring 900Kbps Video to run smoothly over a flow-fair scheduler with 1000 gross rate is simply not going to fly, no matter how we decide the step size of shaper reductions.

We are not really doing that we always set to:

		# bufferbloat detected, so decrease the rate providing not inside bufferbloat refractory period
		*bb*)
			if (( t_next_rate_us > (t_last_bufferbloat_us+bufferbloat_refractory_period_us) )); then
				adjusted_achieved_rate_kbps=$(( (achieved_rate_kbps*achieved_rate_adjust_down_bufferbloat)/1000 )) 
				adjusted_shaper_rate_kbps=$(( (shaper_rate_kbps*shaper_rate_adjust_down_bufferbloat)/1000 )) 
				shaper_rate_kbps=$(( adjusted_achieved_rate_kbps > min_shaper_rate_kbps && adjusted_achieved_rate_kbps < adjusted_shaper_rate_kbps ? adjusted_achieved_rate_kbps : adjusted_shaper_rate_kbps ))
				t_last_bufferbloat_us=${EPOCHREALTIME/./}
			fi
			;;

We really take min(achieved_rate, shaper_rate) and that works out just fine for both directions. Our rate measurements are not spatio-temporally precise numbers, so we should avoid putting too much stock into them. And I argue we currently do the right thing by using these as optimizations on how to reduce the shaper_rate (but override these if the reduction would be too small).
On egress, to be explicit, achieved_rate can easily be to high, but that does not matter as then shaper_rate * factor comes into play.
And the principle, as long as bufferbloat persists we step-by-step reduce the shaper rate until we reach the minimum_rate which we do not shape below. Once bufferbloat relieves and there is enough load, we will increase the shaper rate again...

No here is the rub:
ingress/download side:
egress(shaper) <= ingress(shaper) ~= bottleneck_rate (approximate as the achieved rate over the last interval is not a veridical description of the immediate available rate)

egress/upload side:
ingress(shaper) >= egress(shaper) >= bottleneck_rate

We really want to set bottleneck_rate * factor, but we do not know that, so we need to make a decent guess... and I think we are doing a decent job with a single rule that seems good enough for both download and upload.

Changing this is not going to solve @patrakov's problems at all, because let's face it there is a point where a link is too slow to be useful or well-controlable... sotty to sound harsh.

You can do that, but please, please try whether this actually improves "perceived performance" before committing it to the main controller, simplicity is a virtue here. I think we all agree that this will only help for sufficiently inelastic traffic.
Why am I opposed to this? Taking the achieved rate was/is a decent improvement as it allows us to drop faster below the true bottleneck rate which is where we need to be to drain the build up queue, reducing the shaper less risks not draining the queue and will result in longer bufferbloat epochs.

So based on:

"ingress/download side:

egress(shaper) <= ingress(shaper) ~= bottleneck_rate (approximate as the achieved rate over the last interval is not a veridical description of the immediate available rate)

egress/upload side:
ingress(shaper) >= egress(shaper) >= bottleneck_rate"

Ideally we would for download use achieved rate as the ingress to the shaper (which we are not doing).

And for upload use achieved rate as the egress from the shaper (which I think we are doing).

As you know I've put time into trying to optimize things for weak LTE links and it seems like here there may be a small further optimisation.

"Taking the achieved rate was/is a decent improvement as it allows us to drop faster below the true bottleneck rate which is where we need to be to drain the build up queue, reducing the shaper less risks not draining the queue and will result in longer bufferbloat epochs."

But I think the correct way round this would be to use the right achieved rate and say increase the factor to 0.95 rather than use the wrong achieved rate and keep factor at 0.9?

Is this testable by just replacing 'ifb' with 'wan' for the download interface? Doesn't this also make sense in that right now we use shaper egress for upload achieved (what is on the wire) and shaper egress for download achieved (not what is on the wire). Does this also affect the max wire RTT logic we put in for very low shaper rates below 12Mbit/s?

Why would we drop shaper rate * 0.9? Bufferbloat detected so reduce shaper rate.

Bufferbloat detected again due to a non-responsive stream, so the current logic will consider shaper rate * 0.9 again, and (pointlessly, because this form of bufferbloat is unfixable, and we should prioritize bandwidth in cases of unfixable bufferbloat) decrease the shaper rate again. Look, the shaper rate should be just below the bottleneck, and now we have 0.81 * the bottleneck, and then 0.73 * the bottleneck, and so on.

I don't object, but - what's the other use of the post-shaper download rate, besides logging and graphing?

EDIT: already answered, "the max wire RTT logic we put in for very low shaper rates below 12Mbit/s".

And this is something that I disagree with. What we know is that the shaper + feedback mechanism (e.g. TCP) did not prevent the queue build-up. If we know that our shaper is not the bottleneck, then it makes sense to make it the bottleneck by reducing the shaper rate just below the estimated achieved rate (which is an estimation of the upstream bottleneck). But if we know already that our shaper is the bottleneck and in fact drops traffic, and the bufferbloat still exists (because the feedback loop does not work, i.e. there is an unresponsive stream that overloads the upstream bottleneck), then reducing it further is pointless and should not be done, because this instance of bufferbloat is unfixable.

Yes I understand but how would we resolve this in a way that works in all cases? We can't distinguish between what you refer to as unfixable bufferbloat from fixable bufferbloat. We already set a configurable refractory period that blocks subsequent shaper rate reduction.

If the controller sees bufferbloat it surely has to keep reducing the shaper rate down until the configurable minimum (subject to the refractory periods).

It would be rather dangerous to try reducing by an insufficient amount and then give up further reductions conditional on some form of very complicated heuristic because maybe the subsequent reduction is just what was needed.

Maybe you can propose a heuristic and we can try to shoot it down. If you heuristic survives scrutiny, then let's implement it!

We can! If the shaper rate is already below the true wwan0 achieved rate times 0.9 (i.e. if we know that the shaper is the bottleneck and drops something), then the bufferbloat is unfixable.

Of course this heuristic only applies in the download direction.

1 Like

But unless I am mistaken just because the shaper is dropping traffic does not mean it has become the bottleneck? I mean the shaper rate could still be way too high compared to capacity and it could still be dropping traffic, right? Or is there really some meaningful way we can rely upon whether shaper is dropping traffic or not to help make decision about next shaper rate?

I see, next shaper rate = 0.9 * achieved rate period (well subject to being greater than minimum rate and subject to refractory period). Have you tested?

1 Like