CAKE w/ Adaptive Bandwidth

Not sure. Since I used to use WireGuard for everything - hence UDP, and an appropriate shaper rate made enormous difference.

Hey! I think this has been a productive and interesting conversation.

That's because the feedback mechanism still exists for the TCP streams that are encapsulated inside your WireGuard tunnel. Look, the shaper drops WireGuard packets. After decryption, some TCP packets end up therefore missing and not ACKed. The sender therefore slows down.

For Google Meet (used at the previous job), yes. I could even turn the video off completely. For Discord, apparently no - but it does appear to reduce the video quality after some time.

Jitsi Meet appears to reasonably auto-tune the bandwidth used when the other party uses Chrome but not Firefox.

Anyway we have so many disagreements about the policy that I think that the best move for me would be to unsubscribe from this thread and just quietly use a modified version (or, eventually, a rewrite that I am working on) that does the unsafe things that happen to work on my (too-unusual and unsupportable) LTE connection.

One last P.S.: regarding the "stall-detection" feature that was added because of my connection - if you have no other way to test, and no other users, I would not object to the removal from the official version.

Me telling @patrakov what he knows already is not really helping to deal with his rotten LTE situation.... :wink: so I wanted to signal I am done ranting and back at trying to find ways of making things better.

Because in the end, it helps nobody to note that the application he has to use is not really suited for his situation (and he will not be alone in that, I bet a number of developers simply assume X Mbps to be the lowest they ever need to deal with, and users on < X links are left out in the cold).

This is different, wireguard does not need to do congestion control, because all of the traffic it carries already need to do, if a wireguard packet is dropped the payload packets it carries are dropped as well and the respective unencrypted flows are expected to respond to that signal...

The same is true for UDP, the only difference is that this response to encountered congestion is not integrated part of the protocol stack, and hence applications can and do get this wrong... Using UDP is not "get out of the requirement to avoid congestion collapse" requirement for internet protocols/applications :wink:

Well, policy is something you set and if you are happy with I will shut up. My beef is with making proposals like "trusting the achieved rate" unconditionally when we have no empiric data supporting that and a lot of already known conditions when this is not going to work as desired, because empirically achieved_rate != instantaneous bottleneck_rate.

Well, if you do and it works, by all means let @anon10117369 know, unlike me he is quite open to explore new avenues and try things. (This being his project you can also simply ignore me :wink: )

For me this triggers when during low load conditions delay probes are lost, but honestly that probably mainly means that I have misconfigured the stall_rate so this triggers too often... so clearly my fault the biggest issue with that feature is mostly that it revealed issues in our conceptual concurrency handling, but these are fixed in 2.0 as far as I can tell from the outside. Removing this feature (which can essentially already been disabled seems not a good idea (in spite of my earlier ranting) compared to making sure this actually works as described :wink:

More experiments with @Lochnair's new and updated tsping binary, as follows.

Timecourse:

Raw CDFs:

This is using:

reflectors=(
94.140.14.15 94.140.14.140 94.140.14.141 94.140.15.15 94.140.15.16 # AdGuard
64.6.64.6 64.6.65.6 156.154.70.1 156.154.70.2 156.154.70.3 156.154.70.4 156.154.70.5 156.154.71.1 156.154.71.2 156.154.71.3 156.154.71.4
208.67.222.222 208.67.220.2 208.67.220.123 208.67.220.220 208.67.222.2 208.67.222.123 # OpenDns
185.228.168.168 185.228.168.9 185.228.168.10 185.228.169.11 185.228.169.9 185.228.169.168 # CleanBrowsing
149.112.112.112 9.9.9.10 9.9.9.11 149.112.112.10 149.112.112.11 # Quad9
)

from @tievolu's list here.

Funny, in the upload OWDs we see the steps from ICMP timestamp resolution being one 1 millisecond.

I wonder why our reflector recycling code did not (yet?) replace the reflectors with > 200ms RTT?
Would be interesting to see a bidirectional load test.

Do you have flent/netperf installed somewhere?

Also, during the upload tests, the upload OWD seems to stay flat, but the download OWDs increases ever upward, that looks suspicious.

Exactly what I was wondering. Let me look at the log manually.

REFLECTOR; 2023-03-11-16:02:57; 1678550577.155771; 1678550577.154974; 185.228.169.168; 36160; 114028; 77868; 10000; 7529; 68089; 60560; 10000; 26006; 38005; 11999; 10000; 1100; 1593; 493; 10000
DEBUG; 2023-03-11-16:02:57; 1678550577.158613; Warning: reflector: 185.228.169.168 dl_owd_baseline_us exceeds the minimum by set threshold.
DEBUG; 2023-03-11-16:02:57; 1678550577.160692; Starting: replace_pinger_reflector with PID: 5064
DEBUG; 2023-03-11-16:02:57; 1678550577.166323; replacing reflector: 185.228.169.168 with 208.67.222.2.

185.228.169.168 was kicked.

REFLECTOR; 2023-03-11-16:03:57; 1678550637.363465; 1678550637.361214; 185.228.169.11; 39200; 113017; 73817; 10000; 10329; 81440; 71111; 10000; 26166; 38204; 12038; 10000; 2567; 2905; 338; 10000
DEBUG; 2023-03-11-16:03:57; 1678550637.366672; Warning: reflector: 185.228.169.11 dl_owd_baseline_us exceeds the minimum by set threshold.
DEBUG; 2023-03-11-16:03:57; 1678550637.371758; Starting: replace_pinger_reflector with PID: 5064
DATA; 2023-03-11-16:03:57; 1678550637.379783; 1678550637.379100; 4487; 29908; 89; 85; 1678550637.3622590; 149.112.112.11; 134; 40476; 84000; 10329; 43524; 32400; 26170; 31000; 2706; 4830; 30342; 6; 0; dl_high_bb; ul_high; 5000; 35000
DEBUG; 2023-03-11-16:03:57; 1678550637.381509; replacing reflector: 185.228.169.11 with 156.154.70.4.

And so was 185.228.169.11.

So the replacement code works, right?

But yes the plots reveal something a bit dodgy. Any idea what's going on?

@moeller0 ah maybe @Lochnair changed the columns in this new tsping version?

while read -r -u "${pinger_fds[pinger]}" timestamp reflector seq _ _ _ _ dl_owd_ms ul_owd_ms _
root@OpenWrt-1:~/cake-autorate# tsping -D -m 9.9.9.9
Starting tsping 0.2.3 - pinging 1 targets
1678552468.211457,9.9.9.9,0,59668164,59668182,59668182,59668211,47,29,18
1678552468.310463,9.9.9.9,1,59668265,59668282,59668282,59668310,45,28,17
1678552468.410431,9.9.9.9,2,59668365,59668382,59668382,59668410,45,28,17
1678552468.518550,9.9.9.9,3,59668465,59668481,59668481,59668518,53,37,16
1678552468.620138,9.9.9.9,4,59668565,59668587,59668587,59668620,55,33,22
1678552468.745638,9.9.9.9,5,59668666,59668711,59668711,59668745,79,34,45
1678552468.835267,9.9.9.9,6,59668766,59668802,59668802,59668835,69,33,36
1678552468.918931,9.9.9.9,7,59668866,59668890,59668890,59668918,52,28,24
^C

No that seems OK, right? I don't understand what the very last column: 'Finished' relates to.

Here is a plot of the OWDs without those bad CleanBrowsing reflectors:

Just the OWDs:

@moeller0 is my max upload 35Mbit/s just way too low then?

				printf(FMT_OUTPUT, ip, result.sequence, result.originateTime, result.receiveTime, result.transmitTime, result.finishedTime, rtt, down_time, up_time);

So the last is truly the upload/send direction and the next to last the download/receive direction... I guess I would have ordered them naturally first up the down, but this looks consistent.

But your extraction seems simply to be wrong:

timestamp reflector seq _ _ _ _ dl_owd_ms ul_owd_ms _

should be

timestamp reflector seq _ _ _ _ _ dl_owd_ms ul_owd_ms
1 Like

Was it changed? Or did I just mess that up from the start. Yikes!

1 Like

Finished is the timestamp from receiving the return packet sent from the remote side conataining the first 3 timestamps again, the final timestamps gets added once the process for that timestamp request is finished I would guess as rationale for the name , I prefer the following description:

TSPING: result.originateTime, result.receiveTime, result.transmitTime, result.finishedTime
LOGICAL: local_send, remote_receive, remote_send, local_receive.

No idea... I am still far away with my version to implement new ping_wrappers.

With:

while read -r -u "${pinger_fds[pinger]}" timestamp reflector seq _ _ _ _ _ dl_owd_ms ul_owd_ms

Timecourse:

Raw CDFs:

What are we seeing here(?):

... not enough saturation?

There still seems to be something wrong, during downloads, both OWDs seem to go up, not what we expect.

We are still pulling the UL shaper down even tough there seems to be no upload data flowing during the download tests. Yes we see increased UL ODWs during the download, but I wonder are these real or do we have some piece of code that confuses UL and DL somewhere?

Unless I messed up again, no :wink:
I did change the order for the default output to match "-m" mode, but "-m" format should be the same.

BTW, here is a link:

to a response by Apple's Stuart Cheshire addressing the TCP/UDP question we where partly discussing. I feel I did make the point I wanted to make as eloquently and clearly as Stuart (no surprise here :wink: ) and I hope that his argument makes things clearer, and why discord seems to be to blame here (not that that helps, and if autorate can grow a config option that allows discord in situations like his it should)).

1 Like

@anon10117369 Quick question, have you handled the midnight rollover problem for OWD's in your code?
I recall that was something we ran into in the Lua effort

Not explicitly. Perhaps it's wishful thinking to think that the existing baseline tracking and working with deltas will work with that. @moeller0? Since tsping outputs down and up OWDs what will those look like with the midnight rollover?

I am sure this will need a little care, but let's postpone tackling that part until we have tsping working well otherwise. By virtue of being close to the UTC time zone, europeans are unlikely to immediately suffer from this issue. I am not saying to ignore this for good, just ignore it for now and return to it once the rest of tsping interaction works smoothly.

Note the issue is two fold:
a) the simple problem of cycling to zero when 95040 would be expected, resulting in off measurements with our baseline tracking. E.g. 0-95040 = -95040 which is certainly unexpected large offset that would throw our baseline tracking off course. But that should be relatively easy to ignore
b) the other problem is when clocks between the endpoints are badly synchronized and hence the timestamps do not "flip" over close in time but with considerable delay, in that case the offset (our baseline) will change and we would need to update our base line estimate to account for that... but even that should be easy to detect, after all we roughly now which reported timestamp range is potentially problematic...

1 Like

If the raw timestamps aren't being corrected you end up with OWD values that are offset by 86400000 milliseconds in one or both directions, depending on exactly when the reflector's clock resets relative to yours. I deal with four different scenarios:

  1. Our timer has been reset to zero before the request was sent, but the reflector's hasn't
  2. The reflector's timer has been reset to zero before the request was received, but ours hasn't
  3. Our timer resets to zero between sending the request and receiving the response
  4. The reflector's timer resets to zero between receiving the request and sending the response

I handle this in my perl implementation by detecting when a OWD value indicates that the reflector's offset (i.e. the relative difference between the reflector's clock and our clock) has changed by more than the configured ICMP timeout. When that happens I check the values and add 86400000 to the appropriate raw timestamp(s) to try and fix them.

Here's an example from my logs:

Mon Mar  6 23:54:40.882 2023 [12126-0004824721]: ICMP DEBUG: RECEIVE:   WARNING: recv timestamp for "151.80.6.68 34725 13196" too small after applying offet of -441822. Attempting to correct...
Mon Mar  6 23:54:40.883 2023 [12126-0004824721]: ICMP DEBUG: RECEIVE:   WARNING: Local and/or remote timer reset detected and corrected for "151.80.6.68 34725 13196":
Mon Mar  6 23:54:40.883 2023 [12126-0004824721]: ICMP DEBUG: RECEIVE:            Before: ip=151.80.6.68     orig=86080862   recv=122697     tran=122697     end=86080882   ul_time=-85958165   dl_time=85958185    rtt=20
Mon Mar  6 23:54:40.883 2023 [12126-0004824721]: ICMP DEBUG: RECEIVE:             After: ip=151.80.6.68     orig=86080862   recv=86522697   tran=86522697   end=86080882   ul_time=13          dl_time=7           rtt=20

But I work with absolute OWD values and detect the problem when the reflector's offset changes too much. I'm not sure how you'd detect the problem in the bash implementation.

1 Like

Thanks, yes that is helpful.
I guess 3. and 4. could be dealt with primarily by ignoring such samples, but that really just reduces these cases to special cases of 1. and 2. namely that our baseline estimates change drastically; if they get smaller we currently should deal with quickly, but if the apparent baseline increases we will take a while to catch-up (which will trade-off some throughput but should keep latency fine, but might result in an extended epoch close to the minimum rates). But I have not actually looked at this closely enough to have more than a hunch and the only half-digested information from your post :wink:

Yeah, (1) and (2) are much more common than (3) and (4). I don't think I've ever seen (4) actually happen because the time window is so small, but it is theoretically possible.