CAKE w/ Adaptive Bandwidth [August 2022 to March 2024]

With:

while read -r -u "${pinger_fds[pinger]}" timestamp reflector seq _ _ _ _ _ dl_owd_ms ul_owd_ms

Timecourse:

Raw CDFs:

What are we seeing here(?):

... not enough saturation?

There still seems to be something wrong, during downloads, both OWDs seem to go up, not what we expect.

We are still pulling the UL shaper down even tough there seems to be no upload data flowing during the download tests. Yes we see increased UL ODWs during the download, but I wonder are these real or do we have some piece of code that confuses UL and DL somewhere?

Unless I messed up again, no :wink:
I did change the order for the default output to match "-m" mode, but "-m" format should be the same.

BTW, here is a link:

to a response by Apple's Stuart Cheshire addressing the TCP/UDP question we where partly discussing. I feel I did make the point I wanted to make as eloquently and clearly as Stuart (no surprise here :wink: ) and I hope that his argument makes things clearer, and why discord seems to be to blame here (not that that helps, and if autorate can grow a config option that allows discord in situations like his it should)).

1 Like

@Lynx Quick question, have you handled the midnight rollover problem for OWD's in your code?
I recall that was something we ran into in the Lua effort

Not explicitly. Perhaps it's wishful thinking to think that the existing baseline tracking and working with deltas will work with that. @moeller0? Since tsping outputs down and up OWDs what will those look like with the midnight rollover?

I am sure this will need a little care, but let's postpone tackling that part until we have tsping working well otherwise. By virtue of being close to the UTC time zone, europeans are unlikely to immediately suffer from this issue. I am not saying to ignore this for good, just ignore it for now and return to it once the rest of tsping interaction works smoothly.

Note the issue is two fold:
a) the simple problem of cycling to zero when 95040 would be expected, resulting in off measurements with our baseline tracking. E.g. 0-95040 = -95040 which is certainly unexpected large offset that would throw our baseline tracking off course. But that should be relatively easy to ignore
b) the other problem is when clocks between the endpoints are badly synchronized and hence the timestamps do not "flip" over close in time but with considerable delay, in that case the offset (our baseline) will change and we would need to update our base line estimate to account for that... but even that should be easy to detect, after all we roughly now which reported timestamp range is potentially problematic...

1 Like

If the raw timestamps aren't being corrected you end up with OWD values that are offset by 86400000 milliseconds in one or both directions, depending on exactly when the reflector's clock resets relative to yours. I deal with four different scenarios:

  1. Our timer has been reset to zero before the request was sent, but the reflector's hasn't
  2. The reflector's timer has been reset to zero before the request was received, but ours hasn't
  3. Our timer resets to zero between sending the request and receiving the response
  4. The reflector's timer resets to zero between receiving the request and sending the response

I handle this in my perl implementation by detecting when a OWD value indicates that the reflector's offset (i.e. the relative difference between the reflector's clock and our clock) has changed by more than the configured ICMP timeout. When that happens I check the values and add 86400000 to the appropriate raw timestamp(s) to try and fix them.

Here's an example from my logs:

Mon Mar  6 23:54:40.882 2023 [12126-0004824721]: ICMP DEBUG: RECEIVE:   WARNING: recv timestamp for "151.80.6.68 34725 13196" too small after applying offet of -441822. Attempting to correct...
Mon Mar  6 23:54:40.883 2023 [12126-0004824721]: ICMP DEBUG: RECEIVE:   WARNING: Local and/or remote timer reset detected and corrected for "151.80.6.68 34725 13196":
Mon Mar  6 23:54:40.883 2023 [12126-0004824721]: ICMP DEBUG: RECEIVE:            Before: ip=151.80.6.68     orig=86080862   recv=122697     tran=122697     end=86080882   ul_time=-85958165   dl_time=85958185    rtt=20
Mon Mar  6 23:54:40.883 2023 [12126-0004824721]: ICMP DEBUG: RECEIVE:             After: ip=151.80.6.68     orig=86080862   recv=86522697   tran=86522697   end=86080882   ul_time=13          dl_time=7           rtt=20

But I work with absolute OWD values and detect the problem when the reflector's offset changes too much. I'm not sure how you'd detect the problem in the bash implementation.

1 Like

Thanks, yes that is helpful.
I guess 3. and 4. could be dealt with primarily by ignoring such samples, but that really just reduces these cases to special cases of 1. and 2. namely that our baseline estimates change drastically; if they get smaller we currently should deal with quickly, but if the apparent baseline increases we will take a while to catch-up (which will trade-off some throughput but should keep latency fine, but might result in an extended epoch close to the minimum rates). But I have not actually looked at this closely enough to have more than a hunch and the only half-digested information from your post :wink:

Yeah, (1) and (2) are much more common than (3) and (4). I don't think I've ever seen (4) actually happen because the time window is so small, but it is theoretically possible.

@patrakov can you remind me why:

root@OpenWrt-1:~# cat /root/cake-autorate/test.sh
#!/bin/bash

while read -r line
do
        echo $line
done< <(ping 1.1.1.1)

kills ping process just fine on exit, but when run from:

root@OpenWrt-1:~# cat /etc/init.d/try
#!/bin/sh /etc/rc.common

START=97
STOP=4
USE_PROCD=1

start_service() {
        procd_open_instance
        procd_set_param command "/root/cake-autorate/test.sh"
        procd_close_instance
}

does not?

I think it's related to procd/systemd interfering with normal process management and things like SIGPIPE?

According to my tests, there are three processes involved: test.sh (parent), test.sh (child), and ping. The child seems to be a subshell, but I don't know why it appears here. Anyway all that it does is to wait for the ping process to finish.

procd sends the SIGTERM signal only to the parent process, which indeed terminates. Therefore, the pipe between it and the ping process gets severed from the reader (receiver) end. ping tries to write something there, and, because it is writing to a pipe the other end of which is closed, it gets a SIGPIPE.

Without procd, this signal terminates ping, and then the child test.sh process also finishes waiting for ping to exit, and exits itself, leaving nothing.

With procd, SIGPIPE is ignored, and therefore ping doesn't die, and the test.sh child process continues waiting, in vain. Therefore, two processes remain.

Bad news: in bash, there is no way to restore the "normal" handling of SIGPIPE, this is even documented in the manual page:

Signals ignored upon entry to the shell cannot be trapped or reset.

1 Like

Terrific analysis! Is there by any chance a way to disable the ignoring of SIGPIPE in procd or systemd?

In any case, it seems that perhaps the safest option is to go on retaining the PID and explicitly killing it even though it's less elegant than the pipe teardown mechanism.

Please ignore systemd. It is not relevant to OpenWrt.

And I have checked the code of procd, and can confirm that it sets SIGPIPE to ignored unconditionally.

This commit seemed desirable:

... but hugely problematic for the reasons you have identified.

No way to disable the ignoring of SIGPIPE in procd then? Even if there were to be, perhaps relying on that would be foolhardy.

But you now can encapsulate the whole pinger PID locally inside the parser process an nobody else needs to know, you just keep a single PID. seems easy enough and avoids the need for elaborate process management...

the main loop only needs the maintain and logger PIDs, and maintain the parser PIDs and each parser handles a single binary.... seems pretty clean to me clean enough to be able to not need process management...

I think we can take some inspiration from mwan3: it is also implemented in shell, runs ping and other subprocesses, and is forced to run in an environment where SIGPIPE does not work (which is, in my opinion, an unsupported environment for any shell).

It's been a while since i posted in this thread. life and all that jazz. I have been using the new bash implementation ever since the Lua stopped to work due to reasons already explained. As a user seeking solutions, it's a luxury to have these options. As far as i can determine, the recent bash version does seem to do what i want it to do. But you guys need data. I would love to serve up some here next weekend.

1 Like

Hello.

It would be great if compatibility with other Linux distros could be preserved, as it seems to be the case so far in my x86 Debian setup.

But please, I don't want to be disrespectful or abusive in any way, asking something like this in an ... OpenWrt forum!

Of course you are OpenWrt devs thus your main goal is to produce OpenWrt-tailored code.

Anyway, thanks for the good work so far, and thanks for not kicking me out even if I'm not an OpenWrt user. (I still have OpenWrt in a wifi access point device, though )

2 Likes

If that is true then there is our path forward, trap that term and then initiate a proper staged shutdown, preferably by first asking each process nicely to shut itself down gracefully followed by a SIGKILL (and I mean -9 here the time to ask politely is when sending the shutdown request, so the "talk softly, but carry a big stick" of process management, if you will)

1 Like