Adjusting CAKE's RTT automatically based on real world network condition using DNS request latency

This is another alternative for adjusting CAKE's bandwidth and rtt parameters automatically that is inspired and adapted from the cake-autorate project by @Lynx

Right now it's probably not an all-in-one solution, and other similar implementations are great already, but still kind of pushing users to understand what parameters to change, I'd say.

The goal is an ease of use for anyone who's already familiar with dnscrypt-proxy, and from the user's perspective, the user should only worry about setting up the interface names, and the min/max UL/DL limit for adjusting CAKE's shaper.

It's trying to measure RTT based on the DNS latency for a specific website the user is trying to visit, detects if there's an increase in latency, and set the bandwidth and rtt before the user loads other web assets (i.e. html, css, js, images, etc.).

Also another thing that some of our users noticed was CAKE is sometimes causing ping spikes when bandwidth is set to a fixed value. According to the man7 page SHAPER PARAMETERS, "It will automatically burst precisely as much as required to maintain the configured throughput." so I guess that "burst" is causing the ping spikes?

Setting the shaper to unlimited did solve the ping spike issue according to what our users said, but I believe it won't give the full benefits of CAKE, especially the shaper, no?

Some of our users said that the ping spike issue is kind of solved when bandwidth and rtt parameters are set dynamically. I believe setting the rtt value based on the real RTT for every DNS request also helps make the AQM responds better to latency instead of using a fixed value.

The problem is, there's no way for CAKE to know that kind of information unless we provide it from somewhere, in real-time. Thus, the easiest way to get the user's RTT at that exact time is from making the DNS server that's running locally on the user's machine giving the latency information to CAKE, since DNS latency will vary depending on what website the user is trying to visit at that time.

I hope anyone who's familiar with Go, and has the time to try it, can give feedbacks if this implementation is helpful, and perhaps also review the logic from the code, if there's anything need to be adjusted for achieving better results but still keeping it simple for the general users.

1 Like

@galpt seems exciting. Does it work? I like the fact that this would be mostly passive rather than requiring the additional probes. Though I wonder whether it will be reliable - are the DNS lookups high frequency enough to provide the information needed? In cake-autorate we converged on 20Hz ICMPs (using multiple reflectors).

@moeller0 how about it - using all DNS requests to monitor latency? Will the frequency of the lookups be sufficient?

In any case it’s great to see a new approach. And go is fast.

Cake's RTT keyword changes the codel interval it uses, that is the time persistent dely needs to stay above target (which cake automatically calculates as 5% of RTT) rtt should be in the order of magnitude of the true internet rtt but does not need to be terribly precise. Setting rtt very low will result in a considerably higher likelihood to drop/mark packets and should mainly affect intra flow delay and per flow throughput.
Changing that rapidly based on DNS latency seems like a rather odd thing to do. I have no theory how that would avoid latency spikes unless cake is always close to run out of CPU at the configured rates, then this might help in reducing the load by dropping much harder. However in such a case getting a faster router, or reducing the shaper rate would be a more direct approach of addressing the issue.

That will depend on the frequency of new requests, dnsmasq after all maintains a local dns cache. So the frequency of external DNS requests will be variable. My gut feeling is, that this is not a good latency source for cake-autorate, but you would need to look into the statistics on a live router to make a real assessment.

An interesting question is also, how much variability ewill be in those DNS requests. If one uses DOH or similar one still mostly queries one upstream caching name server, so most DNS requests will have an unitary delay, namely that composed of the path RTT to between local DNS proxy and upstream server, that is decent to use this as delay probe for autorate (constant baseline delay) but will do little to change the RTT on the fly (which IMHO is not a bad thing, as playing with the codel interval frequently might not be a good idea).
Conceptually, we would like to know the true RTT for the flows currently in each hash bucket so we could optimise the signal schedule for each flow, but since there is no universal non-gameable way to deduce per-flow RTT quickly and cheaply cake opts for a common RTT value for all flows. I am not convinced that changing that value around aggressively is all that useful but a point can be made that if most traffic is nearby to potentially reduce the common RTT value. Wether to drive this from DNS lookup times Is a question I can not answer fully, but my gut feeling is that there is only a loose couplinmg between DNS lookup times and optimal common RTT setting for all flows, but hey, if this performs well enough for its users and they are happy, I am happy too.

It does work.. at least in our server :sweat_smile:
As far as reliability goes, it does react really fast when there's an increase in the DNS latency for every request and immediately reducing the bandwidth.
But by default it tries to increase the bandwidth to the maximum specified by the user, and reduce it by n-percent whenever it detects a latency increase. But it needs a better logic since it doesn't rely on frequent lookups, and we're currently testing anything that we could think of..

I'm personally trying to find a different approach that might be similar to cake-autorate but at the same time not so similar..(?)

We did try cake-autorate but it feels like our server got throttled after running it for some time (i.e. from usually the speed is Gbps, somehow it became several Mbps, even when the script is stopped), and after stopped using it for some time, the network speed recovered to Gbps again. I guess it's from the "frequent" lookups or something.. but we were not sure if that's the case, so maybe it's not..

Cake's built-in autorate-ingress is cool ngl, but unfortunately it feels like it doesn't react fast enough β€” at least while doing speedtests β€” so most of the time it feels like it slows down the network (i.e. from 100Mbit, I got only 20 - 30Mbit using it). It works surprisingly well in a busy server where every second it's pushing 30MB - 200MB data though.. but that's almost impossible to see in a typical home network.

The approach I'm currently trying is to make it "intelligent" enough to work for everyone, with as little complexity as possible to be used by the general users.

triple-isolate kind of inspired me too, I guess.
It's not "strict" like the dual- options, but it gets the job done for the general users.

I think even users don't need to specify Minimum or Base bandwidth limit, if it's going to be "intelligent", then it should predict the Minimum and Base bandwidth automatically too.
Since most users know how much speed they're buying from their ISPs, an implementation where users should only worry about specifying the uplink/downlink interface names & maxDL/UL limit as advertised by their ISPs sounds like a good idea to me :slightly_smiling_face:

Btw, it does reduce the bandwidth really fast, but recovering from the bandwidth if it's already too low (i.e. your maximum limit is 4000mbit and you set the minimum limit to 1000mbit) does take some time, and when there are many users actively sending DNS requests and visiting different websites, the DNS latency varies a lot, so it keeps reducing cake's bandwidth until it hits the minimum bandwidth specified, but it failed to recover to the maximum bandwidth since the reduction happens really fast, so I guess faster recovery should be considered too

Yeah true..

I'm thinking of refactoring the code and the logic right now.
The current code works fine in our server but I saw it set the rtt to 6us β€” which doesn't feel right, but that's the real data it received from the dnscrypt-proxy, so the code does function as expected, it's just the logic isn't really correct and we need to work on that.

According to the man7 page, "For traffic mostly within a single city." so does that mean the metro option can be used for shaping an Internet link?

If it's not then how about a logic where it sets cake's rtt dynamically, ranging from 100ms - 300ms?

I guess for any latency under 100ms, the code should reset cake's rtt back to 100ms, and for anything more than 300ms, the code should set it to rtt 300ms, and then just use whatever rtt it's got from the DNS latency (i.e. if DNS latency showing 110ms then just use that as cake's rtt)?

What about the satellite 1000ms option? Does it even make any sense to set the rtt more than 300ms :thinking: The only time I ever got in this kind of situation is when I was using a public wifi that's heavily throttled, doing speedtest it was showing 1000ms+ latency

Mmh, consider the minimum rates as a policy question: Below which rate will you value higher throughput over lower latency?
I think each network admin will come down with a policy decision that works for the local network...

1 Like

Well it needs to avoid cache hits... ands keep in mind, that without a special kernel RTT should likely by >= 10 ms as otherwise the jitter of kernel timing will cause cake to throttle throughput too much without getting the expected low latency...

You can use anything you want, but the choice has consequences :wink:
Cake's 'rtt' keyword configures what CoDel calls the 'interval' that together with the 'target' controls the drop/marking schedule...
If the minimal experienced sojourn time of packets stays > target for a duration of interval, CoDel/cake will enter drop/mark schedule and drop/mark the next eligible egressing packet, AND at the same time the interval will be reduced, so the next drop/mark will be scheduled sooner... as long as the sojourn times stay above target, the subsequent drops will be scheduler in shorter and shorter durations. Cake and fq_codel will maintain an individual interval parameter and drop/mark state per hash bin...
Now, the way this works ideally is we drop/mark a packet and give the affected flow time to actually react to that 'signal', and the quickest we can expect changed behaviour is one RTT after we sent the signal. That is why it is recommended to set rtt/interval to around the typical RTT for most flows, it turns out that getting this parameter absolutely right is less important, but getting the order of magnitude right is more important.
If we set this too low, we will signal flows to 'slow down' too aggressively, which will be visible in single flow transfers not being able to utilise the full link capacity.
If we set this too high, the effect is that inside each hash bucket we will see more intra-flow or self-congestion, but flows staying below their share will not suffer from this. (But note the way cake reports peak and average latency these will be dominated by self congesting flows even if anything else like game traffic just zips along unimpeded).

If you have truly long RTT flows in the 300ms range sure, however for a more common RTT distribution with a peak somewhere between, say 50 and 150 ms, the default 100 works quite well...
(Sidenote above 200ms TCPs tend to become unhappy, some can push this a bit further, but e.g. for 600ms geostationary satellite latencies people often use tricks to make TCP less unhappy*)

Well, if that 300ms truly reflects the RTT to the target servers, then setting RTT 300 might be justified (assuming that maximal throughput to these servers is desired), but if 100ms is just internal processing delay from the queried DNS server then I would argue 300 would be too much...

Now, if all your important traffic truly has an RTT of ~10ms, by all means set 'rtt 10', as long as you are aware of the trade-off you are making.

Well if the true RTT is ~1000ms then setting it to 300 will result in overly aggressive dropping which in turn will limit the maximal throughput for single flows (which might well be acceptable for many as many data transfer applications already use multiple concurrent flows).

On public WiFi there is little you can do from your device anyway :wink:

*) By blatantly lying to the TCP, while being prepared to clean up consequences of thesswe lies...

1 Like

Wow, thx for the detailed answer!
I refactored the code almost an hour ago, decided to go with the 100ms - 300ms rtt range :slightly_smiling_face:
Since dnscrypt-cake relies on users' DNS lookups, I think using estimations should make it work similar to cake-autorate, and also added a fast bandwidth recovery kinda stuff to help it react fast enough..
I guess it won't be very accurate like cake-autorate does, but just in case folks want to try a new flavor, it should behave similarly with less variables to configure :laughing:

1 Like

And also potentially better able to run on low-end devices since 'go' is a lot faster than 'bash', right? I also like how this is more passive in the sense that it can leverage already existing and necessary packets rather than relying upon ICMPs.

But the problem with 'organic' probe packets is that you can have a high sustained load without any new DNS queries... and you give up the reflector diversity...

Theoretically, it is!
Go or any other languages that are precompiled into machine code should theoretically be faster compared to interpreted languages. So if the source code is compiled to work with the router arch, performance is something that's almost guaranteed. But Go is not as low-level as C language, so it has its disadvantages too if we need to do something that changes the system settings.

But Bash is everywhere, so it has its own advantages for users that prefer convenient over the benefits of precompiled langs (i.e. concurrency, memory management, etc.) :laughing: compiling everytime there are new changes to the source code is not really convenient, but that's a small price folks have to pay to get the guaranteed performance I guess..

Also until now I only have Xiaomi 4A Gigabit for OpenWrt, moving stuff is hard most of the time since storage is very limited, so I prefer bash whenever I need to work with this router :sweat_smile:

I also like how this is more passive in the sense that it can leverage already existing and necessary packets rather than relying upon ICMPs.

Yeah it's heavily inspired from cake's built-in autorate-ingress too, where it doesn't rely on ICMP kinda stuff, but then it can only estimate & kinda giving up on accuracy compared to frequent probes. So to keep up on that, either find a way to do super duper good estimations or find a way to react even faster using something other than the user's DNS latency.

1 Like

Surprisingly a single user can send upto 100 DNS queries or even more everytime they're loading a web page (i.e. the ones with lots of images or other web assets). That's before counting the apps outside the browser that also send DNS queries every n-second too (i.e. messaging apps, entertainment apps like instagram, tiktok, etc. do this a lot!) β€” so we're lucky that at least every second there should be 10 new DNS queries or more from each device in the typical home network, unless the users live alone and only have limited number of apps installed on their phones or other devices, then we might need to address that too since the very beginning..

As for the reflector stuff, I guess a more passive approach does seem to need a different way for identifying bufferbloats, and this is probably one of the challenging parts too from the whole implementation :melting_face:

But that operates on a completely different principle, in that it does not measure 'delay' but inter packet spacing, which knowing the packetsize can be directly translated into capacity. While this sounds attractive, the devil is in the details and it conceptually only works for the download direction... I would put your go approach much closer to cake-autorate than to cake's autorate-ingress in that it operates based on delay measurements...

1 Like

Yes, and I am sure most of the time there will be sufficient background chatter, however that is not guaranteed, and especially it is decoupled from throughput loads, like a large download that might do a DNS query once and then go its merry way saturating the link for extended periods of time....

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.