CAKE w/ Adaptive Bandwidth [October 2021 to September 2022]

tievolu · October 14, 2021, 3:23pm

It's 1700 lines of pretty amateurish perl. I'm not sure it should be seen in public

I did look at @dlakelan's script and there was a reason I ended up rolling my own, but I can't remember what that reason was now.

Lynx · October 14, 2021, 3:25pm

Why? The code looks nice to me, albeit erlang seems funky. Could you be convinced to compare performance?

Could you elaborate on your nping use? That seems very intriguing. Is the outgoing packet and response timestamped?

moeller0 · October 14, 2021, 3:31pm

Ping in theory allows to request time stamps from the other end as well, if all clocks are synchronized that allows to measure one-way delays. Which in turn allows to assess congestion per direction.
Last time I tried it I ran into two issues:

only few sites bothered to return timestamps at all
timestamps returned did not look too reliable

at which point I stopped that line of inquiry, apparently too early, because it seems @tievolu got it to work...

Lynx · October 14, 2021, 3:37pm

Thanks @moeller0. Seems like all the goodies are spread across the different independently formulated DIY scripts . Namely:

@lantis1008 <-- handles load well (not easy to adapt to OpenWrt)
@dlakelan <-- handles ping-based convergence under load well (fails on no load)
@tievolu <-- handles directionality in respect of ping (1700 lines)

Does:

load > 50% AND no delay encountered → increase bandwidth
load < 50% OR delay encountered → reduce bandwidth

In the context of @dlakelan's script make sense to you as described above? Would you propose any change?

tievolu · October 14, 2021, 3:53pm

These are the notes in my script describing how I mitigated the clock synchronization problem for nping:

# nping request and response times will only be accurate if our system clock
# is precisely in sync with the target host's system clock. This will
# normally not be the case. When the NTP daemon has had 12-24 hours to
# synchronize with an NTP server the times will be pretty close, probably
# to within 5-10 milliseconds, but they can be hundreds of millseconds out
# when we first boot up. Any difference in the clocks will cause the
# results to be offset by that difference, which leads to request and
# response times that are inflated, reduced, or even negative. However,
# we can "fix" the results by applying an offset to compensate for the
# clock difference.

# Obviously we can't know what the time difference between the clocks is,
# but we can estimate it based on the minimum possible request/response
# time. We use the lowest half-round-trip-time seen in the last
# x samples [I'm currently using four hours worth]. If a request/response is lower than this value,
# we calculate the number of milliseconds required to bump it up to the
# minimum, and apply this offset to both the request and the response.
# We add the offset to the request time, and subtract it from the response
# time, and the offset can be positive or negative. The minimum
# response/request time and offset for each target host are stored in temp
# files, and they are adjusted when necessary.

tievolu · October 14, 2021, 4:06pm

Here's an example nping perl script which targets a number of public DNS servers that respond to ICMP type 13 requests and ranks them in order of the quickest response. There are a load more usable IPs included in the script, but they're all too slow from my location in the UK.

Note that this script requires gnu-date to process the timestamps properly, with the location defined here:

my $gnu_date = "/etc/scripts/util/gnu-date";

Unfortunately installing coreutils-date breaks things that assume that the date command is the busybox implementation, which includes OpenWrt system scripts on 19.07, and BearDropper. So I had to store a gnu-date binary in a custom location and use that.

#!/usr/bin/perl -CS

my $ping_count = 0;
my @ping_targets = ();
my $ping_interval = 2;
my $jitter_count = 20;

my $nping_minimum_time = 50;
my %offsets;

my %req_results_from_target;
my %rep_results_from_target;
my %rtt_results_from_target;

my $gnu_date = "/etc/scripts/util/gnu-date";

my $ping_offset = 3;

foreach $arg (@ARGV) {
        if ($arg =~ /-c(\d+)/) {
                $ping_count = $1;
        } elsif ($arg =~ /-i(\d+)/) {
                $ping_interval = $1;
        } else {
                push(@ping_targets, $arg);
        }
}

# Fastest servers for me
push(@ping_targets, "37.252.230.153");
push(@ping_targets, "9.9.9.11");
push(@ping_targets, "74.82.42.42");
push(@ping_targets, "193.19.108.2");
push(@ping_targets, "46.227.200.55");
push(@ping_targets, "194.242.2.2");
push(@ping_targets, "9.9.9.9");
push(@ping_targets, "46.227.200.54");
push(@ping_targets, "208.67.222.222");
push(@ping_targets, "146.112.41.2");
push(@ping_targets, "149.112.112.10");
push(@ping_targets, "208.67.220.220");
push(@ping_targets, "146.112.41.3");
push(@ping_targets, "193.19.108.3");
push(@ping_targets, "146.112.61.106");
push(@ping_targets, "194.242.2.3");
push(@ping_targets, "185.228.168.168");
push(@ping_targets, "185.228.168.10");
push(@ping_targets, "9.9.9.10");
push(@ping_targets, "149.112.112.112");
push(@ping_targets, "212.78.94.40");
push(@ping_targets, "209.250.226.191");
push(@ping_targets, "149.112.112.11");

# Full List
#push(@ping_targets, "37.252.230.153");
#push(@ping_targets, "212.78.94.40");
#push(@ping_targets, "185.64.79.5");
#push(@ping_targets, "94.140.14.15");
#push(@ping_targets, "149.112.121.10");
#push(@ping_targets, "174.138.29.175");
#push(@ping_targets, "27.112.79.80");
#push(@ping_targets, "172.104.93.80");
#push(@ping_targets, "46.227.200.54");
#push(@ping_targets, "146.112.61.106");
#push(@ping_targets, "185.233.106.232");
#push(@ping_targets, "93.104.213.190");
#push(@ping_targets, "130.59.31.251");
#push(@ping_targets, "185.43.135.1");
#push(@ping_targets, "95.216.229.153");
#push(@ping_targets, "149.112.122.10");
#push(@ping_targets, "149.112.122.20");
#push(@ping_targets, "35.231.247.227");
#push(@ping_targets, "149.112.121.20");
#push(@ping_targets, "149.112.112.11");
#push(@ping_targets, "9.9.9.9");
#push(@ping_targets, "46.227.200.55");
#push(@ping_targets, "149.112.112.112");
#push(@ping_targets, "176.9.1.117");
#push(@ping_targets, "213.196.191.96");
#push(@ping_targets, "35.245.234.132");
#push(@ping_targets, "35.230.160.38");
#push(@ping_targets, "185.235.81.1");
#push(@ping_targets, "101.101.101.101");
#push(@ping_targets, "9.9.9.10");
#push(@ping_targets, "193.19.108.3");
#push(@ping_targets, "94.140.14.14");
#push(@ping_targets, "185.134.196.55");
#push(@ping_targets, "185.150.99.255");
#push(@ping_targets, "185.95.218.42");
#push(@ping_targets, "146.255.56.98");
#push(@ping_targets, "5.1.66.255");
#push(@ping_targets, "149.112.121.30");
#push(@ping_targets, "101.102.103.104");
#push(@ping_targets, "139.162.112.47");
#push(@ping_targets, "9.9.9.11");
#push(@ping_targets, "185.228.168.168");
#push(@ping_targets, "194.242.2.3");
#push(@ping_targets, "74.82.42.42");
#push(@ping_targets, "37.252.232.95");
#push(@ping_targets, "94.140.15.15");
#push(@ping_targets, "185.233.107.4");
#push(@ping_targets, "94.140.15.16");
#push(@ping_targets, "185.95.218.43");
#push(@ping_targets, "35.237.220.84");
#push(@ping_targets, "149.112.122.30");
#push(@ping_targets, "194.242.2.2");
#push(@ping_targets, "146.112.41.3");
#push(@ping_targets, "185.228.168.10");
#push(@ping_targets, "146.112.41.2");
#push(@ping_targets, "208.67.220.220");
#push(@ping_targets, "176.9.93.198");
#push(@ping_targets, "193.17.47.1");
#push(@ping_targets, "185.134.196.54");
#push(@ping_targets, "46.239.223.80");
#push(@ping_targets, "149.112.112.10");
#push(@ping_targets, "193.19.108.2");
#push(@ping_targets, "209.250.226.191");
#push(@ping_targets, "208.67.222.222");
#push(@ping_targets, "185.216.27.142");
#push(@ping_targets, "130.59.31.248");
#push(@ping_targets, "79.110.170.43");


if (scalar(@ping_targets) == 0) {
        print("Usage:  ./nping.pl [ping target]\n");
        exit(2);
}

my $pings_done = 0;
while ($pings_done < $ping_count || $ping_count == 0) {
        my @results_to_print = ();

        foreach $ping_target (@ping_targets) {
                my $current_date_utc = `TZ='UTC' $gnu_date "+%b %d 00:00 %Y"`;
                chomp($current_date_utc);

                my $midnight_utc_ms = `TZ='UTC' $gnu_date --date="$current_date_utc" +%s%3N`;
                chomp($midnight_utc_ms);

                my $nping_start_utc_ms = `TZ='UTC' $gnu_date +%s%3N` + $ping_offset;
                chomp($nping_start_utc_ms);

                @nping_output = `nping --icmp --icmp-type 13 -c 1 --delay 0 $ping_target | grep ICMP 2>&1`;

                my $rqst_sent_utc_ms;
                my $rqst_rcvd_utc_ms;
                my $rply_sent_utc_ms;
                my $rply_rcvd_utc_ms;

                foreach $nping_line (@nping_output) {
                        # Sent
                        if ($nping_line =~ /^SENT \((.*)s\) ICMP .*/) {
                                my $nping_sent_time_ms = sprintf("%d", ($1 * 1000));
                                $rqst_sent_utc_ms = $nping_start_utc_ms + $nping_sent_time_ms;
                                next;
                        }

                        # Received
                        if ($nping_line =~ /^RCVD \((.*)s\) ICMP .* recv=(\d+) trans=(\d+)/) {
                                my $rcvd_time_ms = sprintf("%d", ($1 * 1000));
                                $rqst_rcvd_utc_ms = $midnight_utc_ms + $2;
                                $rply_sent_utc_ms = $midnight_utc_ms + $3;
                                $rply_rcvd_utc_ms = $nping_start_utc_ms + $rcvd_time_ms;
                                next;
                        }
                }

                #print("$ping_target ping $pings_done: ");

                my $nping_end_utc_ms = `TZ='UTC' $gnu_date +%s%3N`;
                chomp($nping_end_utc_ms);

                my $rqst_travel_time = "ERROR";
                my $rply_travel_time = "ERROR";
                my $round_trip_time  = "ERROR";

                my $offset = 0;
                if (exists($offsets{$ping_target})) {
                        $offset = $offsets{$ping_target};
                }

                if ($rqst_sent_utc_ms eq "" || $rqst_rcvd_utc_ms eq "" || $rply_sent_utc_ms eq "" || $rply_rcvd_utc_ms eq "") {
                        #print("ERROR: Did not receive valid nping timestamp response\n");
                } else {
                        $rqst_travel_time = $rqst_rcvd_utc_ms - $rqst_sent_utc_ms;
                        $rply_travel_time = $rply_rcvd_utc_ms - $rply_sent_utc_ms;
                        $round_trip_time = $rply_rcvd_utc_ms - $rqst_sent_utc_ms;

                        # Adjust minimum request/response time if necessary
                        my $half_rtt = int($round_trip_time / 2);
                        if ($nping_minimum_time > $half_rtt) {
                                $nping_minimum_time = $half_rtt;
                        }

                        # Check whether the offset needs to be adjusted
                        if ($rqst_travel_time + $offset < $nping_minimum_time) {
                                $offset = $nping_minimum_time - $rqst_travel_time;
                                $offsets{$ping_target} = $offset;
                        } elsif ($rply_travel_time - $offset < $nping_minimum_time) {
                                $offset = $rply_travel_time - $nping_minimum_time;
                                $offsets{$ping_target} = $offset;
                        }

                        # Apply the current offset
                        $rqst_travel_time += $offset;
                        $rply_travel_time -= $offset;

                        # Store the ping results
                        &store_result($ping_target, "req", $rqst_travel_time);
                        &store_result($ping_target, "rep", $rply_travel_time);
                        &store_result($ping_target, "rtt", $round_trip_time);
                }

                push(@results_to_print,
                        "| " . sprintf("%-15s", $ping_target) . "  |  " .
                        sprintf("%15s", $offset) . "  |  " .
                        sprintf("%7s", "$rqst_travel_time") . " | " .
                        sprintf("%7s", &get_ave($ping_target, "req")) . " \x{00B1} " .  sprintf("%-6s", &get_jitter($ping_target, "req")) . "  |  " .
                        sprintf("%7s", "$rply_travel_time") . " | " .
                        sprintf("%7s", &get_ave($ping_target, "rep")) . " \x{00B1} " .  sprintf("%-6s", &get_jitter($ping_target, "rep")) . "  |  " .
                        sprintf("%7s", "$round_trip_time") . " | " .
                        sprintf("%7s", &get_ave($ping_target, "rtt")) . " \x{00B1} " .  sprintf("%-6s", &get_jitter($ping_target, "rtt")) . "  |  " .
                        "\n"
                );

        }

        $pings_done++;

        system("clear");

        print(
                "               " . "     " . "                   " . "----------------------------------------------------------------------------------------------\n" .
                "  " .
                "Pings done = " . sprintf("%-5s", $pings_done) . "  " .
                "RTT/2 min = " . sprintf("%3s", $nping_minimum_time) . "  |  " .
                sprintf("%26s", "Request") . "  |  " .
                sprintf("%26s", "Reply") . "  |  " .
                sprintf("%26s", "Round Trip") . "  |  " .
                "\n" .
                "-------------------------------------------------------------------------------------------------------------------------------------\n" .
                "| " . sprintf("%-15s", "Ping Target") . "  |  " .
                sprintf("%15s", "Offset") . "  |  " .
                sprintf("%7s", "Current") . " | " .
                sprintf("%7s", "Average") . " \x{00B1} " . sprintf("%6s", "jitter") . "  |  " .
                sprintf("%7s", "Current") . " | " .
                sprintf("%7s", "Average") . " \x{00B1} " . sprintf("%6s", "jitter") . "  |  " .
                sprintf("%7s", "Current") . " | " .
                sprintf("%7s", "Average") . " \x{00B1} " . sprintf("%6s", "jitter") . "  |  " .
                "\n" .
                "-------------------------------------------------------------------------------------------------------------------------------------\n"
        );

        my @results_to_print = sort {&get_sort_string($a) cmp &get_sort_string($b)} @results_to_print;

        foreach $result (@results_to_print) {
                print $result;
        }

        print("-------------------------------------------------------------------------------------------------------------------------------------\n");

        if ($ping_count != $pings_done) {
                sleep($ping_interval);
        }
}


sub get_results {
        my ($ping_target, $type) = @_;

        if ($type eq "req") {
                if (exists($req_results_from_target{$ping_target})) {
                        return @{$req_results_from_target{$ping_target}};
                }
        } elsif ($type eq "rep") {
                if (exists($rep_results_from_target{$ping_target})) {
                        return @{$rep_results_from_target{$ping_target}};
                }
        } elsif ($type eq "rtt") {
                if (exists($rtt_results_from_target{$ping_target})) {
                        return @{$rtt_results_from_target{$ping_target}};
                }
        } else {
                die("Invalid type in get_results: $type");
        }

        # If we reach here there are no stored results. Return an empty array.
        return ();
}

sub set_results {
        my ($ping_target, $type, @results) = @_;

        if ($type eq "req") {
                @{$req_results_from_target{$ping_target}} = @results;
        } elsif ($type eq "rep") {
                @{$rep_results_from_target{$ping_target}} = @results;
        } elsif ($type eq "rtt") {
                @{$rtt_results_from_target{$ping_target}} = @results;
        } else {
                die("Invalid type in set_results: $type");
        }
}


sub store_result() {
        my ($ping_target, $type, $result) = @_;

        my @results = &get_results($ping_target, $type);

        if (scalar(@results) == $jitter_count + 1) {
                shift(@results);
        }

        push(@results, $result);

        &set_results($ping_target, $type, @results);
}

sub get_ave {
        my ($ping_target, $type) = @_;

        my @results = &get_results($ping_target, $type);

        if (scalar(@results) == 0) {
                return "N/A";
        }

        my $total = 0;
        foreach $result (@results) {
                $total += $result;
        }

        #return int($total / scalar(@results));
        return sprintf("%.2f", ($total / scalar(@results)));
}

sub get_jitter {
        my ($ping_target, $type) = @_;

        my @results = &get_results($ping_target, $type);

        if (scalar(@results) == 0) {
                return "N/A";
        }

        my $jitter = 0;
        my $previous = -1;
        foreach $current (@results) {
                if ($previous == -1) {
                        $previous = $current;
                        next;
                } else {
                        $jitter += abs($current - $previous) / (scalar(@results) - 1);
                        $previous = $current;
                }
        }

        return sprintf("%.2f", $jitter);
}

sub get_sort_string {
        my ($result) = @_;

        if ($result =~ /\|.+\|.+\|.+\|.+\|.+\|.+\|.+\|(.+)\|/) {
                return $1;
        }
}

It gives results that look like this:

                                       ----------------------------------------------------------------------------------------------
  Pings done = 15     RTT/2 min =   7  |                     Request  |                       Reply  |                  Round Trip  |
-------------------------------------------------------------------------------------------------------------------------------------
| Ping Target      |           Offset  |  Current | Average ± jitter  |  Current | Average ± jitter  |  Current | Average ± jitter  |
-------------------------------------------------------------------------------------------------------------------------------------
| 74.82.42.42      |               -3  |       12 |    9.79 ± 1.92    |        8 |    8.43 ± 0.85    |       20 |   18.21 ± 1.85    |
| 185.228.168.10   |               10  |       11 |    9.20 ± 2.07    |        9 |    9.20 ± 0.86    |       20 |   18.40 ± 2.21    |
| 149.112.112.112  |                1  |       10 |   10.13 ± 1.57    |        8 |    8.33 ± 0.86    |       18 |   18.47 ± 2.00    |
| 193.19.108.2     |          -187616  |       13 |    9.60 ± 1.57    |       10 |    9.00 ± 1.14    |       23 |   18.60 ± 1.43    |
| 149.112.112.10   |               -3  |       15 |    9.73 ± 1.21    |       10 |    9.27 ± 1.00    |       25 |   19.00 ± 1.79    |
| 208.67.220.220   |                0  |       10 |   10.73 ± 2.86    |        7 |    8.47 ± 0.79    |       17 |   19.20 ± 2.50    |
| 9.9.9.10         |               -3  |       10 |   10.40 ± 3.00    |        9 |    8.87 ± 0.36    |       19 |   19.27 ± 3.07    |
| 194.242.2.2      |          -187615  |        9 |   10.93 ± 2.79    |       10 |    8.33 ± 0.57    |       19 |   19.27 ± 3.21    |
| 9.9.9.11         |               -2  |       12 |   11.13 ± 3.79    |        7 |    8.20 ± 0.57    |       19 |   19.33 ± 3.64    |
| 46.227.200.55    |               -1  |       10 |   11.13 ± 3.64    |        8 |    8.20 ± 0.71    |       18 |   19.33 ± 3.93    |
| 149.112.112.11   |               -2  |       12 |   10.73 ± 2.14    |        9 |    8.67 ± 0.64    |       21 |   19.40 ± 1.93    |
| 9.9.9.9          |               -3  |        8 |   10.87 ± 3.57    |        9 |    8.67 ± 0.36    |       17 |   19.53 ± 3.21    |
| 37.252.230.153   |               -2  |        9 |   11.00 ± 3.43    |        7 |    8.73 ± 0.71    |       16 |   19.73 ± 3.86    |
| 185.228.168.168  |               10  |       16 |   10.67 ± 4.57    |        9 |    9.13 ± 0.57    |       25 |   19.80 ± 5.00    |
| 209.250.226.191  |               -2  |       11 |   11.27 ± 1.07    |        9 |    8.67 ± 0.50    |       20 |   19.93 ± 1.00    |
| 146.112.41.2     |               -2  |        9 |   11.07 ± 3.64    |        8 |    9.13 ± 0.71    |       17 |   20.20 ± 3.79    |
| 212.78.94.40     |                0  |       10 |   11.07 ± 2.93    |        9 |    9.20 ± 0.79    |       19 |   20.27 ± 2.57    |
| 208.67.222.222   |               -2  |       12 |   11.80 ± 2.64    |        8 |    8.53 ± 1.14    |       20 |   20.33 ± 3.36    |
| 193.19.108.3     |          -187615  |       10 |   12.29 ± 4.92    |        8 |    8.14 ± 0.31    |       18 |   20.43 ± 4.77    |
| 146.112.41.3     |               -3  |       17 |   11.33 ± 5.07    |       10 |    9.20 ± 1.14    |       27 |   20.53 ± 4.64    |
| 46.227.200.54    |               -1  |       10 |   12.20 ± 4.71    |        9 |    8.40 ± 0.21    |       19 |   20.60 ± 4.79    |
| 146.112.61.106   |               -2  |       15 |   11.67 ± 4.57    |        8 |    9.00 ± 0.86    |       23 |   20.67 ± 4.43    |
| 194.242.2.3      |          -187616  |       20 |   12.13 ± 4.71    |       10 |    8.80 ± 0.57    |       30 |   20.93 ± 4.29    |
-------------------------------------------------------------------------------------------------------------------------------------

As you can see, some of the servers' system clocks are way off, but the script compensates with the offsets to provide usable results. The offsets are calculated as explained in my last post, just not using four hours' worth of data as my bandwidth adjustment script does.

Lynx · October 14, 2021, 4:13pm

Wow, that seems pretty neat.

What is the context for you needing to adjust bandwidth by the way I wonder?

Would you mind giving a rough outline of how your script adjusts bandwidth based on this ping-based info?

tievolu · October 14, 2021, 4:28pm

I have a Virgin Media cable connection and the bandwidth can vary a lot at different times of day, especially the upstream. I have lots of stuff that uses lots of bandwidth with work and online backup etc., but my kids want to stream video and play games etc. etc. at the same time, and I need to make work video calls etc. so I need to keep the latency low at all times.

My script runs once a minute as a cron job, working roughly as follows (on the upstream/downstream independently as described before):

Ping the specified target hosts [currently 15 pings to three targets]
 |
 |-> If upstream latency is "good"
 |    '-> If we didn't change the upstream bandwidth recently [last 15 mins if the last change was an decrease, last 5 minutes if it was an increase]
 |        '-> If upstream bandwidth is not at maximum
 |            '-> Increase the upstream bandwidth
 |
 |-> If upstream latency is "bad" [currently defined as all targets recording at least three "bad" request times of 40ms or more]
 |   '-> If internet connection is not completely down
 |       '-> If upstream bandwidth is not at minimum
 |           |-> If upstream bandwidth was recently increased [within the last 5 minutes]
 |           |   '-> Undo last increase
 |           '-> Decrease the upstream bandwidth
 |			 
 |-> If downstream latency is "good"
 |    '-> If we didn't change the bandwidth recently [last 15 mins if the last change was an decrease, last 5 minutes if it was an increase]
 |        '-> If downstream bandwidth is not at maximum
 |            '-> Increase the downstream bandwidth
 |
 '-> If downstream latency is "bad" [currently defined as all targets recording at least three "bad" response times of 40ms or more]
     '-> If internet connection is not completely down
         '-> If downstream bandwidth is not at minimum
             |-> If downstream bandwidth was recently increased [within the last 5 minutes]
             |   '-> Undo last increase
             '-> Decrease the downstream bandwidth

It works pretty well, with any latency problems generally being resolved within a minute or two, which is sufficient to keep the kids at bay A quicker response would be nice. I could adapt the script to run constantly as a daemon instead of running once every minute, assessing the last 10-15 pings on a rolling basis, but then I have to worry about leaks and stuff

EDIT:
This gave me the nudge I needed to make the improvements described above. I have now adapted my script to run constantly as a service, using nping to assess latency to three targets every two seconds. "Bad" latency is defined as request/reply times being above a defined threshold (I use 40ms) for all three targets for at least 3 out of the most recent 10 pings. Otherwise the algorithm remains the same. In my testing, bad latency is usually detected (and the necessary adjustments made) within 10 seconds, and often after just 6 seconds.

tievolu · October 14, 2021, 4:32pm

Here's an example of the logs my script spits out when adjusting the connection:

Wed Oct 13 18:47:00 2021 [20738]: QOS Monitor started
Wed Oct 13 18:47:00 2021 [20738]: Current QOS bandwidth: 390000 Kb/s download, 18392 Kb/s upload
Wed Oct 13 18:47:00 2021 [20738]: Sleeping for 1.693s before starting latency test...
Wed Oct 13 18:47:01 2021 [20738]: Ping count: 15, interval: 2s, UL/DL thresholds: 40ms/40ms, Max bad pings per target: 4
Wed Oct 13 18:47:01 2021 [20738]:           ---------------------------------------------------------------------------
Wed Oct 13 18:47:01 2021 [20738]:           |  Bandwidth usage  |        9.9.9.10 |  129.250.35.251 |  149.112.112.10 |
Wed Oct 13 18:47:01 2021 [20738]:           ---------------------------------------------------------------------------
Wed Oct 13 18:47:01 2021 [20738]:           | TX Mbps | RX Mbps |  Rqst  |  Rply  |  Rqst  |  Rply  |  Rqst  |  Rply  |
Wed Oct 13 18:47:01 2021 [20738]: -------------------------------------------------------------------------------------
Wed Oct 13 18:47:03 2021 [20738]: | Ping  1 |  13.985 | 394.363 |  163 x |   13   |  201 x |   14   |  201 x |   15   |
Wed Oct 13 18:47:05 2021 [20738]: | Ping  2 |  14.200 | 394.281 |  128 x |   16   |  121 x |   14   |  112 x |   16   |
Wed Oct 13 18:47:07 2021 [20738]: | Ping  3 |  14.768 | 396.211 |  309 x |   19   |  311 x |   17   |  302 x |   18   |
Wed Oct 13 18:47:10 2021 [20738]: | Ping  4 |  13.826 | 394.809 |  557 x |   10   |  557 x |    9   |  544 x |   11   |
Wed Oct 13 18:47:12 2021 [20738]: | Ping  5 |  13.553 | 392.750 |  352 x |   11   |  332 x |   10   |  321 x |   10   |
Wed Oct 13 18:47:14 2021 [20738]: | Ping  6 |  15.534 | 396.045 |  469 x |   12   |  506 x |   10   |  501 x |    9   |
Wed Oct 13 18:47:16 2021 [20738]: | Ping  7 |  14.127 | 390.440 |   15   |   10   |   11   |   11   |   15   |    9   |
Wed Oct 13 18:47:18 2021 [20738]: | Ping  8 |  13.764 | 392.792 |   23   |   12   |    8   |   11   |    7   |   13   |
Wed Oct 13 18:47:20 2021 [20738]: | Ping  9 |  13.562 | 387.771 |  191 x |   18   |  163 x |   17   |  142 x |   18   |
Wed Oct 13 18:47:22 2021 [20738]: | Ping 10 |  12.118 | 393.117 |   57 x |   26   |   45 x |   25   |   34   |   25   |
Wed Oct 13 18:47:24 2021 [20738]: | Ping 11 |  12.245 | 390.498 |   39   |   18   |   23   |   16   |   16   |   17   |
Wed Oct 13 18:47:26 2021 [20738]: | Ping 12 |  13.459 | 392.081 |   19   |   18   |  240 x |   15   |  216 x |   16   |
Wed Oct 13 18:47:28 2021 [20738]: | Ping 13 |  12.389 | 392.189 |  197 x |   10   |  185 x |   10   |  178 x |   10   |
Wed Oct 13 18:47:30 2021 [20738]: | Ping 14 |  14.086 | 398.317 |  100 x |   23   |   85 x |   22   |   72 x |   24   |
Wed Oct 13 18:47:32 2021 [20738]: | Ping 15 |  12.254 | 397.621 |   26   |   15   |    8   |   13   |    8   |   15   |
Wed Oct 13 18:47:32 2021 [20738]: -------------------------------------------------------------------------------------
Wed Oct 13 18:47:32 2021 [20738]:                               | NOT OK |   OK   | NOT OK |   OK   | NOT OK |   OK   |
Wed Oct 13 18:47:32 2021 [20738]:                               -------------------------------------------------------
Wed Oct 13 18:47:32 2021 [20738]: Bad upload: 3/3, bad download: 0/3, unresponsive: 0/3  ==> UPLOAD NOT OK | DOWNLOAD OK
Wed Oct 13 18:47:32 2021 [20738]: Download bandwidth is already set to maximum. Taking no action for download.
Wed Oct 13 18:47:32 2021 [20738]: Undoing upload increase from 54s ago. Decreasing upload bandwidth by 9.091% to 16720 Kb/s. Increases disallowed until 19:02:32 (900s).
Wed Oct 13 18:47:32 2021 [20738]: Applying new egress bandwidth 16720 Kb/s to eth0: uci set sqm.eth0.upload=16720 2>&1
Wed Oct 13 18:47:32 2021 [20738]: Applying new egress bandwidth 16720 Kb/s to eth0: tc qdisc change root dev eth0 cake bandwidth 16720Kbit 2>&1
Wed Oct 13 18:47:32 2021 [20738]: New upload bandwidth applied successfully
Wed Oct 13 18:47:32 2021 [20738]: Setting last change for upload: decrease @ 1634147252
Wed Oct 13 18:47:32 2021 [20738]: QOS Monitor finished

Lynx · October 14, 2021, 4:54pm

Thanks, but oh man. This is all a lot to process.

Slightly off topic, but @moeller0 is there an easy, and reliable way to detect zoom or teams traffic? I am thinking a temporary fix would be simply to only apply CAKE aggressively whenever there is detected zoom or teams traffic.

Looking over this:

github.com

dlakelan/routerperf/blob/master/sqmfeedback.erl

-module(sqmfeedback).
-export([main/0,pingtimes_ms/3,monitor_ifaces/2,monitor_a_site/1,adjuster/1,monitor_delays/2,timer_process/2 ]).


%% Read a line of ping output and extract the time in milliseconds

pingline_to_ms(Line) ->
    case re:replace(Line,".*time=([0-9.]+) ms","\\1") of
	[[L]] ->
	    try binary_to_float(L) 
	    catch 
		error:_Reason -> 
		    try float(binary_to_integer(L)) 
		    catch
			error:_ -> false
		    end
	    end;
	_ -> false
    end.

This file has been truncated. show original

I think my proposed tweak to take into account load looks like it might be much harder than I imagined.

tievolu · October 14, 2021, 5:07pm

For Zoom I use dnsmasq to populate an ipset, adding any IP resolved for hostnames ending zoom.us (with a timeout of 1 day). I then have a firewall rule which looks for UDP traffic to/from these IPs on ports 3478, 3479, 8801-8810. I can't remember where I got the port info, but I think it was somewhere reliable.

I create a static ipset for these Teams IP ranges (and prioritize traffic to/from these ranges using firewall rules):

13.107.64.0/18
52.112.0.0/14
52.120.0.0/14

Info from here: https://docs.microsoft.com/en-gb/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#skype-for-business-online-and-microsoft-teams

moeller0 · October 14, 2021, 5:08pm

Maybe it can be much simpler, simple store the time of the last rate change and if there was no change in X seconds simply reduce the rate by say (max - min)/10 so let the rate slowly decay back to your minimum. No actual load measurements necessary just a choice
Of time constant and reduction step size....

Lynx · October 14, 2021, 5:10pm

@dlakelan would that work? I don't well enough understand erlang to figure this out.

dlakelan · October 14, 2021, 5:12pm

Yeah, I like that hack. If I remember correctly the guy I originally wrote my script for wanted "reasonable" latency but didn't want to sacrifice a lot of speed because they had a slow connection. So the script as is tries to move towards maximum and then decrease when it detects latency issues... your goal is move towards minimum so that latency is always low and open up bandwidth as needed... it's definitely doable... let me look at the code

Lynx · October 14, 2021, 5:16pm

Yes please! ... the script certainly converges well under load. And pretty fast too. It all seems elegant to me and implementation with service file as above is easy for even inexperienced users to implement. This could presumably even be packaged up if enough users have tested it.

Please can you consider changing:

    monitor_ifaces([{"tc qdisc change root dev eth0.2 cake bandwidth ~BKbit diffserv4 dual-srchost overhead 34 ", 4000, 6000, 8000},
		    {"tc qdisc change root dev ifb4eth0.2 cake bandwidth ~BKbit diffserv4 dual-dsthost nat overhead 34 ingress",15000,30000,35000}],
    ["dns.google.com","one.one.one.one","quad9.net","facebook.com",
     "gstatic.com","cloudflare.com","fbcdn.com","akamai.com","amazon.com"]),

to:

    monitor_ifaces([{"tc qdisc change root dev X cake bandwidth ~BKbit", 4000, 6000, 8000},
		    {"tc qdisc change root dev Y cake bandwidth ~BKbit,15000,30000,35000}],
    ["google.com","one.one.one.one","quad9.net","facebook.com",
     "gstatic.com","cloudflare.com","fbcdn.com","akamai.com","amazon.com"]),

dns.google.com doesn't resolve for me? So I changed to google.com.

Also we only need to set bandwidth in the change.

Also could you consider changing 10.0ms to 20.0 ms and delays to 3?

moeller0 · October 14, 2021, 5:21pm

Maybe expose these as parameters??

Lynx · October 14, 2021, 5:22pm

.. waits with bated breath!

dlakelan · October 14, 2021, 5:25pm

dns.google.com should resolve as follows, so that's weird:

dns.google.com has address 8.8.4.4
dns.google.com has address 8.8.8.8
dns.google.com has IPv6 address 2001:4860:4860::8888
dns.google.com has IPv6 address 2001:4860:4860::8844

In looking at the code, by default it drifts upwards at each "clock tick", and it drifts downwards if it has seen delays. at first @moeller0's suggestion seemed reasonable, but it seems that if the default is to drift downwards, in both cases, then it'll never open up the throttle...

so... I still don't think we have an algorithm that "works" for @Lynx's goal. If I understood the goal better I could probably write the algorithm.

I think the "perfect" algorithm would constantly leave the speed 5% below what is actually available. Unfortunately, we don't know what is available! The current algorithm by default starts full throttle, and then declines until delays go away. I think you want an algorithm that starts at minimum throttle and when actual bandwidth gets close to the current limit, opens the throttle until you see delays and backs it off...

unfortunately, to be bandwidth responsive requires that we add code to monitor BW usage and make decisions on that... I don't really have the time to work on that right now.

Lynx · October 14, 2021, 5:31pm

The convergence under load seems to work well. It's just that it always drifts to max bandwidth under no load, which seems to break things.

Indeed - this is what I am keen to implement.

If load monitoring were easy, then I think this:

load > 50% AND no delay encountered → increase bandwidth
load < 50% OR delay encountered → reduce bandwidth

would work.

Is there something else not involving load that would work? Perhaps there is no way round this but to monitor load.

dlakelan · October 14, 2021, 5:33pm

Yes, I think you have to monitor load if you're going to make decisions based on load. I don't think this is super hard... you need to spawn off a thread that just reads the byte rx and tx counters on a timer tick... and then have it send current bw to the monitor_delays thread as a message... then make decisions on that message

but as I said, I don't have time to really develop that right now.

To give you an idea how the code would work though, you could see that "monitor_ifaces" is responsible for spawning off all the various workers... so it'd need to spawn a "monitor_bw" thread...

The timer_process thread should now know about the monitor_bw thread and send it a message on a tick the way it currently does for the monitor_delays thread...

The monitor_bw thread should after every tick send monitor_delays a message saying "this is the current BW usage", and then monitor_delays should make a decision about what to do.