CAKE w/ Adaptive Bandwidth [October 2021 to September 2022]

Lynx · March 24, 2022, 7:57am

Have you considered the content of the referenced research here:

And here:

I wonder if this is a little more academic than practical. It seems it promises the substitution of active probing with passive measurement based on information contained in packet headers. It seems this can be done at receiver only.

How feasible is it in practice to extract reliable rtt information (enough to ascertain a delta to detect bufferbloat) from packet headers? My understanding of the underlying networking concepts is not good enough, but I imagine there are problems relating to multiple paths, changing paths and paths starting and stopping associated with packet streams between different end points starting and stopping.

Would this work on WireGuard encrypted packets?

I had initially wondered if a problem might be that you could have data from one path that was intermittent and then a burst of Netflix data from another path and if the first path had a discontinuity at the burst of Netflix data you wouldn't spot the delta in time to address the Netflix burst. To elaborate say you have data in respect of one particular data path that lasts for 5 mins and so you have a kind of baseline rtt for that path. If that path terminates before bufferbloat associated with a stream associated with a new path for which you have no baseline then you can't spot a delta. Or am I missing something?

In this connection, perhaps one nice thing about WireGuard packets is that they are always to and from the same IP. So that path is fixed and if packet headers can be inspected to infer rtt to the WireGuard IP then wouldn't that give a way to ascertain bufferbloat?

Could this compete with active probing to reliable reflectors?

Could it be ultimately adopted within CAKE to mean the user doesn't even need to put in a bandwidth at all even for fixed bandwidth connections? That is, once the fully baked CAKE has addressed the outstanding bandwidth estimation issue that was not fully addressed by 'autorate-ingress', this unaddressed but very important issue seriously compromising the utility of CAKE on its own as a means of bufferbloat control for at least all variable rate connections.

Assuming the bakers are not all too elderly now.

I would if I could (hopefully this thread shows there is plenty of will from myself and others), but my baking skills are not up to scratch and may well never be.

Or perhaps the recipe is just not quite there yet to resume the baking.

Point remains about unfinished business though.

moeller0 · March 24, 2022, 8:39am

Curious, why? From my understanding a proper WiFI stack employing airtime fairness and if needed AQL (airtime queue limits?) should solve that problem already, or is your goal here to post-hoc fix bad wifi stacks in the clients?

Nope, that approach works for bump in the wire situations, so can be implemented in intermediary hops AS LONG as the relevant header fields are not encrypted...

As that paper shows, quite feasible assuming rfc 1323 TCP timestamps are used on a sufficient fraction of flows....

Same issue as with your approach, selecting the minimum over a set of flows should work... essentially you can/need to calculate the deltaRTT independently for every flow (or potentially for hash bins in an fq qdisc)

Probably not, as these carry no readable timestamps... same is true for other encapsulations without visible timestamps...

I think ARL really uses the same "trick" we employ and selects the minimum of the measured RTTs under the assumption that changes in bottleneck rate will affect ALL flows, but it is a while I loked at the arl code and I did not read it deeply, so this might be more my preconception how arl should work and less a veridical description how arl actually solves this issue

So in lieu of timestamps on packets you would need a set-up in which packets are sent in pairs, where on reception of a packet the receiver immediately sends a response and where such packet pairs can easily be identified, then you could store a timestamp when the request packet of a pair passes your qdisc and then when the response packet of a pair returns, giving you the RTT from your measurement node to the endpoint. But that is a lot of work and requires to keep a lot of state, not sure that approach is feasible for production use (then again, I assumed the same for Kathleen Nichols' approach and yet arl happened )

Yes, but harder to pull off and more costly... (even if you implement that in kernel you will at least need to detect candidate packets and extract their timestamp and sequence fields).

Probably, but someone would need to implement that... plus with protocols with encrypted headers gaining popularity and neithe IPv4 or IPv6 carrying a reliable timestamp filed this might end in a situation where relaying on readable TCP timestamps gets dicy. In all likeyhood that state is still far in the future though.

I tend to be considerably less optimistic about any such approaches... think autorate-ingress, on the face of it quite sensibly designed (similar to sprout) yet in reality not as effective as desired (and lacking a configured upper and lower bound, quite likely to massively over- or undershoot, well possible that autorate-ingress with a hard configurable min and max would be an improvement already, maybe someone should play with that).

The obvious solution to all ingress shaping however is to lean on your ISP so that they tackle bufferbloat on their end....

Cake already pushes the envelope on what seemed acceptable for a qdisc to do and already is quite costly, adding even more costly processing might make this a hard sell for the upstream kernel (but that is a secondary problem it would need to be implemented first).

IMHO a more reasonable approach might be to modify ARL to simply write/update its RTT estimates (say for ~10 flows) somewhere in /sysfs or /proc and use that as fodder for a script like yours...

Lynx · March 24, 2022, 8:49am

Thank you very much for the informative response and I do hope you do not mind my cheeky comments too much. Naturally it is up to every individual how they spend their time, but individuals like me can still be cheeky from time to time, right?

Back in the day, CAKE clearly was a work of excellence and pushed the envelope. And clearly it is still of immense value to many including myself today. But perhaps maintenance of such excellence requires a perpetual work in progress to address issues like bandwidth estimation and the excessive CPU utilisation that also seems to be a significant threat vector to CAKE's efficacy and continued privileged status as SQM champion - in particular for gigabit connections. Posts crop up on this forum all the time about struggling to get CAKE to work for those, and users sometimes give up and adopt alternatives, even on the newer devices like the darling of this forum, the RT3200.

As an aside, I wonder if the 'easy cpu speed up modifications' mentioned below, or other such enhancements remain outstanding.

At a shaped rate, it does much better than htb + fq_codel does. There are a lot of easy cpu speed up mods left to make, but we prefer to work on fixing two problematic bits of codel right now… adding other features, and fixing bugs.

https://www.bufferbloat.net/projects/codel/wiki/Cake/#some-of-the-history

I may be sorely mistaken, but it seems that active CAKE development ground to a halt some time ago, and some big issues like bandwidth estimation or even some of the original design goals relating to CPU usage remain outstanding? If pride in the continued excellence of the work product alone isn't a sufficient motivator, is it a question of securing further funding I wonder? Or more willing contributors with the right skillsets?

moeller0 · March 24, 2022, 10:12am

My take is slightly different, low latency smooth traffic shaping is costly as hell, and both HTB+something or cake spend most of their CPU cycles shaping.
My naive interpretation is that one either works in largish batches with relative large and sloppy "update intervals" or in smallish batches with tight update intervals (say, release X Kb worth of packets to the real interface every X/rate seconds, if X is large the interval to the next point in time when the interface needs new data to not run dry is also large and easier to get hold of the CPU before that interval expired, if X is small it is likely that we overrun the interval letting the interface idle for a while thereby wasting throughput and increasing delay for all packets in the queue). In any way that interval will add to the latency of the connection so comes at a clear latency cost. One can strike a balance here about how to set X such that the re-fill interval does not get too large, but that only partially mitigates the problem that for tight bufferbloat control we want this interval to be as small as possible, requiring being able to get hold of the CPU in short intervals as well....
One way forward would be to move the traffic shaper into the NIC itself and have that grab/dequeue packets by DMS from an fq queue structure that is still filled/enqueued by the qdisc. I have no idea whether such an approach would actually be feasible...

But the gist is the most costly part of cake has no easy speed-ups waiting to be implemented. About cake's other components I have neither an idea how costly they are individually nor if there is low hanging fruit for the taking.

So cake seems to not see much development because it mostly works as intended and has no clear realistic TODO list (although adding min and max rate to autorate-ingress might be worth playing with).

moeller0 · March 24, 2022, 10:18am

Well, get a faster router then doing stuff per-packet at that rate simply is costly.. that is why many recent routers use "accelerators" to move processing load from the CPU to some dedicated machinery (sometimes CPUs in their own right as with the NSS cores of the r7800, IIRC). The problem is, this machinery gets part of its speed-up from being in hardware and part by being less general than the software networking stack in the kernel... so unless the accelerators implement kompetent AQM with cheap traffic shaping they do not help to fight bufferbloat. Would be nice if one of these would implement cake in "hardware", but I would not hold my breath....

anon98444528 · March 24, 2022, 11:46am

Why is OT. However, in breief:

is ath10k-ct (for qca9980) a bad wifi stack? I'm testing an alternative ATF implementation that may work for my device - if your interested, the patch is in the same branch as the sch_arl commit linked above - see also this thread.
does ATF "shape" traffic in the client to AP direction? I already know I can use a TBF on wifi clients to "fix" the "client -> AP netperf issues" I observe - using a TBF on all my clients is not a solution I want. As I indicated above I'm pretty sure a better solution exists, I'm still trying to find it.
Why are ath10k users disabling AQL to "fix" (latency?) issues with video conferencing (here, here, and disabling ATF to fix latency issues summarized here)? These may be client specific and I'm not (yet) convinced ATF/AQL is the root cause. Ironically, issues with video conferencing are why I got started on this and I tried disabling AQL but apparently I didn't test long enough combined with being distracted by my netperf observations and no knowing ATF (at least the virtual time based scheduler) by default does not work on my ath10k device.

dlakelan · March 24, 2022, 2:24pm

Not really, routes come and go all the time. Sometimes they even oscillate.

As speeds increase, such as 1Gb or more, you can be a lot more sloppy with the shaping and still maintain good performance. For example in 1ms you can send 1Mb of data which is at least 83 packets at 1500 byte MTU.

At 10Gbps it's 830 packets.

But at 10Mbps it's less than 1 packet. So if you try to maintain jitter on your game stream of around 1ms you have to be extremely clever at 10Mbps and you can be pretty stupid at 1Gbps

At speeds over 500Mbps it seems like something like a token bucket for rate control, hashing to something like 11 qfq classes and then a simple fifo under each class with say 10 packets in the fifo would likely work fine, none of the sophistication of cake is likely strictly needed. Such things can be done in hardware too. It could be built into a switch

Lynx · March 24, 2022, 3:20pm

About WireGuard I just meant I have one WireGuard peer. So a huge amount of my traffic is with single end point. I imagined that might make RTT assessments from the packet headers easier to work with as a means to inform shaper rate decisions as compared to a mixture of different streams that come and go and wax and wane. I have the feeling that so many end points that come and go would make things really hard to work with compared to regular periodic pings.

Just speculation on my part though as my understanding is way too limited.

Would be nice if WireGuard facilitated congestion control? I mean since it's end to end wouldn't that be fairly easy to add in? That would be an added benefit to VPN. That you get bufferbloat under control as well as encryption without paying for VPS and as part of relatively cheap VPN package. Albeit that doesn't handle mix of VPN and non-VPN.

Am I mistaken in thinking VPNs are cheaper than VPS? If so (and maybe anyway) I should ditch NordVPN and get VPS. NordVPN gives very high bandwidth VPN without any latency issues. So I don't see any performance hit with my relatively high latency medium bandwidth LTE. They have tons of servers. I wonder if easy to get VPS that would give same performance (I mean I would need to have it run WireGuard to give me encryption - to stop Vodafone's throttling, not to sell drugs - and then I'd ideally have same performance)?

@moeller0 with the LTE <-> VPS end-to-end route is there a different already existing protocol to address this situation that would beat what I have put together already?

moeller0 · March 24, 2022, 3:43pm

pping, Kathleen Nichols tool you linked to above, relies on having timestamp information in the packets, which is true for some TCP flows (those with rfc 1323c timestamps enabled, which generally is a good idea) but is not true for UDP, and since Wireguard uses UDP exclusively, pping will not help.

How?

Again how? Wireguard tries to be agnostic here and standard congestion signals (dropped packets) are just "perceived" by the tunneled flows as if there was no tunnel. IMHO the best wireguard can do is not get into the way of end to end congestion control (after all it is generically not guaranteed the the wireguard tunnel is fully e2e).

Yeah congestion control needs to be end to end, so a VPN is the wrong place for this... not sure how VPN versus VPS matters in this respect...

Not that I know of, you all are pretty much at the leading edge of this...

Lynx · March 24, 2022, 3:46pm

Sorry I am exposing my ignorance here, but my thinking was that WireGuard is necessarily end-to-end in the sense that it sets up a tunnel between two end points. And I was thinking that could be leveraged to facilitate maintenance of low latency, encrypted traffic flow between the end points. But it seems I am missing something. Doesn't one generally want tunnels that are not allowed to get subjected to bloat?

moeller0 · March 24, 2022, 3:58pm

But you are sending traffic from your end devices to servers in the internet, so only part of the path goes through the wireguard tunnel, and hence it is not end 2 end... because congestion control required active signalling between sender and receiver.

The only way to do this is if the true bottleneck participates and indicates congestion...

Sure, but quite a lot of us also want a $PONY to go with such a tunnel, but as desirables as that might be be it does not get any more realistic

That said all we really need is competent AQM at the bottleneck that does a good job at signaling congestion or looming congestion quickly (and endpoints that respond quickly to that)... doing that at different places the the true bottleneck requires some gymnastics, as you figured out in the process of this thread

Lynx · March 24, 2022, 4:22pm

Has this actually been thought of to some extent - see here:

https://lore.kernel.org/all/CAHmME9oevqa0+pPhSVjNqGFOSZkwctUB2U=19xvebRrxscJFaQ@mail.gmail.com/T/

fq_codel Integration

In order to combat buffer bloat, WireGuard could benefit from integrating the fq_codel algorithm and kernel-library, for managing packet queues and parallelism. There is much related work in the kernel to base this on; in particular, many wireless drivers take the same technique using the same library.

Or are these along slightly different lines?

moeller0 · March 24, 2022, 4:30pm

I would guess that this might help, but only in egress direction... I think you are already harvesting the low hanging fruit by having cake configured such that it will see the flow identity for wireguard encrypted packets via some skb marking... That is sweet but putting fq_codel into wireguard itself would still not solve the "shape for a remote bottleneck" part where the problem is that there is zero back pressure from the bottleneck and without such back pressure the AQM has no idea when/how much to send.

Lynx · March 24, 2022, 4:35pm

Does that only apply where the bottleneck is outside the tunnel? Won't the bottleneck mostly be inside the tunnel? I mean say between my LTE and a NordVPN server the bottleneck is inside that path.

(WG END POINT) MODEM <-> [BOTTLENECK] <-> ISP
ISP <-> NordVPN (WG END POINT)

NordVPN to rest of the world is way faster.

moeller0 · March 24, 2022, 4:43pm

Does what exactly apply? Sorry I am a bit dense today...

Yes, but what consequence does that imply?

Lynx · March 24, 2022, 4:47pm

I was thinking about:

I thought this meant it wouldn't work providing bottleneck is remote (which I thought meant outside tunnel). And I was thinking that since bottleneck is within tunnel path end points, the end points can communicate to solve the bufferbloat since it is a protocol and there is information flow between the end points.

But I'm sure I'm not stitching together enough points in my mind.

With example of:

[Netflix] -> NordVPN [WG END POINT 1] -> MODEM [WG END POINT 2]

If WG tunnel knows or can estimate how much capacity exists between the tunnel end points, can the tunnel not do stuff with the packets to stop Netflix sending too fast. Like end point 1 drops packets before sending them along to end point 2. Or something. Sorry this is probably really stupid and I am no doubt testing your patience. But I am keen to understand this stuff better.

anon98444528 · March 24, 2022, 4:59pm

It's not. Or said another way, I may be just as stupid as everyone else.

How could it know? If I follow correctly, WG, as a result of using UDP, has even less chance knowing what the tunnel throughput limits are. In my simplistic world (where everyone gets a $PONY), if it could know that, then it could drop packets and slow down the tunnel.

EDIT I believe I misstated a mantra above. I think it should be "If you can't measure, you can't control."

Lynx · March 24, 2022, 5:04pm

OK I think I am getting it.

[Netflix] -> NordVPN [WG END POINT 1] -> MODEM [WG END POINT 2]

During bufferbloat won't packets traverse [WG END POINT 1] much faster than the rate at which packets traverse [WG END POINT 2]. Doesn't that delta indicate [WG END POINT 1] should stop sending so quickly, so that the packet rate of traversal at [WG END POINT 1] matches packet rate of traversal at [WG END POINT 2]?

Because where there is a delta there is queuing. And we don't want queuing.

I appreciate patience here because my understanding is rather limited, but I am eager to understand.

moeller0 · March 24, 2022, 5:17pm

The only t

The tunnel endpoints would actively need to do so (actively communicate in a timely fashion about the observed throughput and maybe delay), currently wireguard does not as far as I understand.

Yes it could if it knew the instantaneous capacity, but if we knew that we could simply adjust whatever shaper is used....

But I take it your point is a tunnel would be in a good position to perform the required continous throughput and delay measurements and distribute that information between the endpoints? With that I agree, now how to convinxe Jason to tack this onto wireguard since it appears outside of wirdguards core mission....

Yes, but each sending wireguard side needs the instantaneous receivecrates of the other, which to my knowledge is not something wireguard distibutes between the sides.

Lynx · March 24, 2022, 5:19pm

Yes I was thinking there is already communication between end points so couldn't that be leveraged to kill two birds with one stone.

Is it not at least somewhat congruous with the points about bufferbloat listed here:

Wouldn't it be funky if WG helped facilitate encrypted videoconferencing worked well by co-ordinating packet flow rate traversals between end points.

Or at least since UDP does not include the packet header information needed it could put in timestamp or something to help restore that functionality. Just thinking out loud.

CAKE w/ Adaptive Bandwidth [October 2021 to September 2022]

fq_codel Integration

`fq_codel` Integration