Good, I can tell you that the group that's been developing this code absolutely values your input, we really just haven't been able to assimilate it among all the other bits and bobs. For example I made my soft-start feature which ensures that baselines start out correct, it just moved a few lines around and added a couple sleeps... but it stopped the entire thing from doing anything... it took a half hour to track it down and the fix was trivial, but c'est la vie.

I'm not trying to pull moeller0 back in, but for those who are interested, the main reason we chose the minimum as a percentage is that we believe that people will have an intuition around "I'm willing to sacrifice up to 25/50/80% of my speed for latency control" but many people will not be knowledgeable about doing the math to figure out what the consequences of say 800kbps vs 1500kbps vs 5000kbps vs 50000kbps are for their connection. It's not possible for us to give a fixed recommended minimum for everyone as has been seen some people would like to run this on a connection whose maximum speed is below what I would consider the minimum speed for low latency (3000kbps). I'd rather have those guys at least get a logically clear minimum that's lower than the base rate by default :slight_smile:

If we use a percentage then at least when someone enters say 3500kbps as their baseline rate we will calculate a minimum which is substantially lower than that baseline rate, and not higher!

For those whose basic speeds are below about 10Mbps the percentage values should probably become bigger than our default... if you're at 3Mbps you should probably be using a minimum which is 50%-70% or more. Dropping below 1.5Mbps is going to add tremendous quantities of buffering just from the time it takes to send a packet. But if you're at 50Mbps or 500Mbps the ~20% default is going to work fine for you out of the box regardless of which case you're in and this means for a whole bunch of people it's a setting they don't have to think about and we think this will result in fewer problems for people.

As usual the hard cases are where bandwidth is scarce, but with LTE connections often offering above 10Mbps and DOCSIS offering 50-500Mbps there are a lot of people who can stick with a default percentage. As the "typical" rate increases, so do the kinds of things people expect to be able to do. We've already had @_FailSafe get disconnected from a screen-sharing business zoom session when a random ping spike on his whole connection to 100ms or so during a period when his connection was relatively idle caused the rate controller to plummet to 1500kbps which he had left at its default "placeholder value"... This then caused zoom to simply quit. Putting his minimum rates to something like 5Mbps up and 50Mbps down would have avoided these problems, but if he didn't do that, what will thousands of less familiar people do? To avoid a lot of pain we decided on percentages, and then maybe some warnings that for slow connections those percentages should be tuned more carefully.

If you want to calculate a kbps value but we only offer percent, you can do this relatively easily, just take your kbps value X and calculate X/base_rate * 100 and put that as your percentage.

yes upgraded without problems. here my data after running 8 hours.
my config


config network
        option upload_interface 'eth1'
        option download_interface 'ifb4eth1'
        option upload_base_kbits '10591'
        option download_base_kbits '49700'
        option upload_min_percent '55'
        option download_min_percent '20'

config output
        option log_level 'FATAL'
        option stats_file '/tmp/sqm-autorate.csv'
        option speed_hist_file '/tmp/sqm-speedhist.csv'

config advanced_settings
        option upload_delay_ms '15'
        option download_delay_ms '15'

here the plots
delaydownecdf
delayupecdf



downhist
uphist

i feel like it's often to hit the around 20% of my download speed, the speed rarely reaches 30Mbps In other hand version 0.3.0 can hit up 40-60Mbps.

whoops, something went very wrong in that run. I noticed that I didn't check in some of the work I did on ewma from my laptop, and getting that properly into the repo should help avoid things but even with that failure it seems that the delay stats were borked... at some point your download delay went to 15 seconds and then never recovered from that in the first 100 seconds of run... I'm guessing we have an issue with the pings in the start-up phase.

1 Like

The fractional usage rarely went over 80% and hence it didn't have any "upward pressure" to increase the threshold. When it did go up, it didn't take long before it slammed back down to try to avoid the delay it was experiencing. Unfortunately because of that initial 15 seconds delay, we can't read anything useful off the delay graph...

@dlakelan I think this might be related to the ~33% increase in reflector IPs. I am seeing a significant initial load on the CPU right at startup since adding them. I think we need to possibly look at staggering the initial RTT queries or consider dropping that reflector list back down. I imagine this is especially painful on slower CPUs and/or CPUs with low core count.

2 Likes

understood on in your explanation.

is it can be set in the config? i didn't see 15 seconds delay.

i am running some sevices that takes cpu usage in start-up. The devices i used rpi 4b.

I don't understand the results with version 0.4.0 from the testing/lua-threads branch.

I started it this morning and let it run. Frequently, when I return after a period of inactivity, I find the upload speed setting at my minimum value. If I start to use the link, it works fine, and the ul & dl speed settings ramp up appropriately. Latency stays low, and as long as I keep using the link, I don't notice any problems.

I finally got around to running the Julia plotting package, and saw the plot below. Notice the (blue) speed setting keeps dropping to a low value. Notice, too, the regular spikes in the "Delay through time" chart.

Thoughts? (I also have a bunch of other debugging info)

I’d call this progress. :relaxed: Sometimes the [unintentional] bad new versions show how good progress looked on prior versions.

On a serious note, Dan already pointed out we missed some code in the v0.4.0 release that needs to get into a new patch level release. Just need some time to get that out the door.

1 Like

I think initially we ping all the reflectors 2/second

In your zoomed in delay graph it at time t=0 it shows a bit over 1.5x10^4 which means 15000 ms = > 15 seconds. This is a meaningless number that shouldn't have happened, so it's some kind of start-up bug, probably @_FailSafe is right about us pinging hundreds of reflectors in the first second causing problems.

2 Likes

Well, I proposed a different design for easing in new reflectors (similar in scope to soft-start as I understand it) based on evaluating simple update counters per reflector (and evaluation of only those reflectors with sufficient low-load samples in their history) that would not have introduced any more blocking sleeps and hence very little chance of failing hard. As you can see as a github comment. That is exactly why discussion before implementation has the potential advantage of avoiding pitfalls other already dug themselves out of in the past... (In "real-time" control, blocking actions like "sleep" are generally problematic and it is best not to introduce special code-paths for rare or corner cases, and better to handle them through the common paths and just change as little as possible, in the case of soft-start simply do not change the rates).
But as you have fixed your controller you probably know this already.

Yes, but at 35000.2 = 700Kbs the set thresholds will not work well any more (serialosation delay for a single full MTU packet ~ 1000 (15008)/(7001000) = 17.14ms with a default threshold of 15ms that is kind of bad) and the link is more likely to stay at the minimum. No matter how you slice it users will have to enter meaningful values under some conditions.

About the percentage, all I want to say is that I do not think your rationale is all that much more convincing to me than @CharlesJC's was, but then not my call to make.

As I said, have a special value for "auto-set" and go wild in automatically selecting say what you consider a decent default, but do not force poor souls for which the auto setting does not work to first measure their absolute minimum rate convert that into a percentage and repeat that whole thing should they change the base_rate again.

Well, apparently the 1500 were selected under the assumption this would be a never needed safety mechanism, and as it turns out not so much. Happens during development, but that IMHO is not a rationale for switching to percentages, but for making sure the minimum rate is still useful. And I bet that if you check with Zoom their recommendation is going to be a speed in absolute units per call, and not a percentage of the link's capacity. Which to me indicates that for selecting a lower limit absolute numbers are more useful. I apparently love a) repeating myself and b) flogging a dead horse, as that ship has sailed.

No, you put in a safe minimum of 5000 to allow Skype/Zoom and maybe warn a user if that is above base rate and instruct them to select something lower and warn them of the consequences.... this is not rocket science and going to percentages of base rate is solving none of the tricky issues here as far as I am concerned.

See this is the thing the "some warnings that for slow connections" is the kicker here, and that will work just as well with absolute as with percentage minimums, but it will be far easier for a user to figure out which absolute minimum rate she/he needs for their must-work applications.

This is insulting, and you know it. As I pointed out one of the first thing autorate-lua does is convert the percentage back into an absolute rate, because that is what the controller operates on.

Are you really piping all reflectors though the code at start-up? If so how do you space them out? For production I understand you space the default 5 reflectors out by distributing them roughly over one full cycle time, so when you start with more reflectors, do you space them out over more cycles or are you slamming all of them in in a single cycle?

Here is what I thought about this, instead of running each reflector at (cycle_time)^-1 ping one reflector every XXms, but cycle though a list of reflectors that is considerably longer, so the per reflector repetition rate stays nice and low, but the temporal coverage stays high, allowing you effortlessly to sample from a larger population.

But with this I am truly out for a week unless there are specific questions, promise.

Yes, I suspect this is due to having adjusted the ewma so that we are a bit more trigger happy :wink: I would really like to merge in the correct changes to the ewma which somehow got lost in the transition between working remotely on my laptop and working at home on my desktop, and then to have people try out different constants. I think to do this properly we need an actual test protocol.

no it was actually chosen under the assumption that it was a definitely needed quantity the user should think about and set, and in real-world testing it turned out to be something everyone ignored and then complained about the controller not working. Since we can't know what people will put for the base rate, we can't put a sensible value in kbps there by default (for example 5000kbps), because hey there have already been people who want to run this on 2000 or 3000 kbps connections. We can put a sensible percentage that will likely work for everyone with more than say 10Mbps and we felt that was a fairly large number of people so that's what we decided on. It tends to be the case that people "get used to what they can do" and when their speeds drop below 20% of normal, they'll notice it, no matter what that level is. A person with 1Gbps will very much notice a drop to 5Mbps.

I apologize if you were offended here. That really was not intended for you nor intended to be insulting, but rather for a number of people who have offered their testing but who maybe really don't understand what the mathematical issues are on percent vs absolute kbps. My impression is we have a huge range of audience here, also some of whom don't speak english as a first language etc.

@_FailSafe I believe it's a great thing to have a huge list of potential reflectors, but yeah, we need to soft-start on the reflectors as well. Right now with say 200 reflectors we're sending 400pps of pings for a few seconds. This is just another thing we haven't yet gotten to fixing but we should do so soon.

1 Like

@malikshi and others who've tried recent versions, we pushed out 0.4.2 a patch release that adds soft-starting of the reflectors (starts with 6 good candidates as a seed, and then runs the selector thread often for 40 cycles to gradually evaluate all the reflectors). Also It has the intended changes to EWMA calculations that weren't picked up for 0.4.0.

I believe we skipped 0.4.1 because it was late at night and my user error

0.4.2 is the current version in testing/lua-threads and we would love to see some testing from whoever has LTE or other highly variable connections.

In the latest version, you can tune the "sensitivity" of the trigger by changing a single number on the line:

    local fast_factor = ewma_factor(tick_duration, 0.4)

Think of the 0.4 there as a number in seconds which is related to the length of time it takes to notice an uptick in delay. If a delay lasts for several multiples of this number it will definitely be detected, if it lasts substantially less than this number we will tend to smooth through it and leave the rate alone. 0.4 seconds is a compromise number, try adjusting it lower if you want "faster detection" and higher if you want "fewer false positives during short spikes in delay". The number should be bigger than 0.0

4 Likes

@aGentti86, @anon50098793, @Nomid, @openwrticon

If you have had a chance to use v0.4.2 and have any feedback/stats, we would love to hear how it’s working out for you. Thanks!

1 Like

sorry i had family stuff yesterday, since my ISP not stable in the night so i set 40% from my download speed also trying to changes upload_delay_ms & download_delay_ms set 30ms
delaydownecdf
delayupecdf
downhist
uphist



if you want to the result with same config as before, i can post it in 2-3 hours since this post.

1 Like

Wow weird, you have another download delay in the 20 second range. You are the only one I've seen who has that kind of data. I wonder what that's about?

1 Like

yes it's should be weird. can i ask what the default reflector in advanced config?
but i can says in version 0.4.2 my down speed can hit 40Mbps if the signal in good mood.

You’ll see the default set (starting set) of reflectors here in this commit (lines 1143-1144): https://github.com/Fail-Safe/sqm-autorate/commit/cfc8ca92a716864c63b0c487e18e8ce502600041

Would you mind pinging a handful of the reflectors from that list and let us know what kind of RTT you see?

Thanks!

my ping to quad9 dns

PING 9.9.9.9 (9.9.9.9): 56 data bytes
64 bytes from 9.9.9.9: seq=0 ttl=56 time=50.652 ms
64 bytes from 9.9.9.9: seq=1 ttl=56 time=66.252 ms
64 bytes from 9.9.9.9: seq=2 ttl=56 time=76.574 ms
64 bytes from 9.9.9.9: seq=3 ttl=56 time=47.472 ms
64 bytes from 9.9.9.9: seq=4 ttl=56 time=50.492 ms
64 bytes from 9.9.9.9: seq=5 ttl=56 time=64.902 ms
64 bytes from 9.9.9.9: seq=6 ttl=56 time=55.165 ms
64 bytes from 9.9.9.9: seq=7 ttl=56 time=100.363 ms
64 bytes from 9.9.9.9: seq=8 ttl=56 time=62.821 ms
64 bytes from 9.9.9.9: seq=9 ttl=56 time=53.461 ms
64 bytes from 9.9.9.9: seq=10 ttl=56 time=63.678 ms
64 bytes from 9.9.9.9: seq=11 ttl=56 time=67.569 ms
64 bytes from 9.9.9.9: seq=12 ttl=56 time=55.807 ms
64 bytes from 9.9.9.9: seq=13 ttl=56 time=63.739 ms
64 bytes from 9.9.9.9: seq=14 ttl=56 time=71.580 ms
64 bytes from 9.9.9.9: seq=15 ttl=56 time=64.153 ms
64 bytes from 9.9.9.9: seq=16 ttl=56 time=58.384 ms
64 bytes from 9.9.9.9: seq=17 ttl=56 time=62.907 ms
64 bytes from 9.9.9.9: seq=18 ttl=56 time=52.243 ms
64 bytes from 9.9.9.9: seq=19 ttl=56 time=61.939 ms
64 bytes from 9.9.9.9: seq=20 ttl=56 time=64.982 ms
64 bytes from 9.9.9.9: seq=21 ttl=56 time=54.524 ms
64 bytes from 9.9.9.9: seq=22 ttl=56 time=67.812 ms
64 bytes from 9.9.9.9: seq=23 ttl=56 time=64.726 ms
64 bytes from 9.9.9.9: seq=24 ttl=56 time=59.788 ms
64 bytes from 9.9.9.9: seq=25 ttl=56 time=63.635 ms
64 bytes from 9.9.9.9: seq=26 ttl=56 time=58.288 ms
64 bytes from 9.9.9.9: seq=27 ttl=56 time=90.699 ms
64 bytes from 9.9.9.9: seq=28 ttl=56 time=77.113 ms
64 bytes from 9.9.9.9: seq=29 ttl=56 time=81.784 ms
64 bytes from 9.9.9.9: seq=30 ttl=56 time=69.374 ms
64 bytes from 9.9.9.9: seq=31 ttl=56 time=82.339 ms
64 bytes from 9.9.9.9: seq=32 ttl=56 time=52.658 ms
64 bytes from 9.9.9.9: seq=33 ttl=56 time=69.956 ms
64 bytes from 9.9.9.9: seq=34 ttl=56 time=46.488 ms
64 bytes from 9.9.9.9: seq=35 ttl=56 time=78.690 ms
64 bytes from 9.9.9.9: seq=36 ttl=56 time=67.748 ms
64 bytes from 9.9.9.9: seq=37 ttl=56 time=67.473 ms
64 bytes from 9.9.9.9: seq=38 ttl=56 time=68.656 ms
64 bytes from 9.9.9.9: seq=39 ttl=56 time=53.407 ms
64 bytes from 9.9.9.9: seq=40 ttl=56 time=64.725 ms
64 bytes from 9.9.9.9: seq=41 ttl=56 time=57.829 ms
64 bytes from 9.9.9.9: seq=42 ttl=56 time=65.178 ms
64 bytes from 9.9.9.9: seq=43 ttl=56 time=68.610 ms
64 bytes from 9.9.9.9: seq=44 ttl=56 time=65.227 ms
64 bytes from 9.9.9.9: seq=45 ttl=56 time=48.589 ms
64 bytes from 9.9.9.9: seq=46 ttl=56 time=56.608 ms
64 bytes from 9.9.9.9: seq=47 ttl=56 time=75.813 ms
64 bytes from 9.9.9.9: seq=48 ttl=56 time=56.042 ms
64 bytes from 9.9.9.9: seq=49 ttl=56 time=59.190 ms
64 bytes from 9.9.9.9: seq=50 ttl=56 time=59.084 ms
64 bytes from 9.9.9.9: seq=51 ttl=56 time=57.182 ms
64 bytes from 9.9.9.9: seq=52 ttl=56 time=69.697 ms
64 bytes from 9.9.9.9: seq=53 ttl=56 time=66.992 ms
64 bytes from 9.9.9.9: seq=54 ttl=56 time=60.570 ms
64 bytes from 9.9.9.9: seq=55 ttl=56 time=58.221 ms
64 bytes from 9.9.9.9: seq=56 ttl=56 time=60.429 ms
64 bytes from 9.9.9.9: seq=57 ttl=56 time=57.338 ms
64 bytes from 9.9.9.9: seq=58 ttl=56 time=74.769 ms
64 bytes from 9.9.9.9: seq=59 ttl=56 time=58.115 ms
64 bytes from 9.9.9.9: seq=60 ttl=56 time=75.821 ms
64 bytes from 9.9.9.9: seq=61 ttl=56 time=55.861 ms
64 bytes from 9.9.9.9: seq=62 ttl=56 time=68.248 ms
64 bytes from 9.9.9.9: seq=63 ttl=56 time=55.990 ms
64 bytes from 9.9.9.9: seq=64 ttl=56 time=108.650 ms
64 bytes from 9.9.9.9: seq=65 ttl=56 time=69.740 ms
64 bytes from 9.9.9.9: seq=66 ttl=56 time=109.424 ms
64 bytes from 9.9.9.9: seq=67 ttl=56 time=70.276 ms
64 bytes from 9.9.9.9: seq=68 ttl=56 time=55.658 ms
64 bytes from 9.9.9.9: seq=69 ttl=56 time=62.192 ms
64 bytes from 9.9.9.9: seq=70 ttl=56 time=63.509 ms
64 bytes from 9.9.9.9: seq=71 ttl=56 time=71.966 ms
64 bytes from 9.9.9.9: seq=72 ttl=56 time=60.132 ms
64 bytes from 9.9.9.9: seq=73 ttl=56 time=58.058 ms
64 bytes from 9.9.9.9: seq=74 ttl=56 time=71.944 ms
64 bytes from 9.9.9.9: seq=75 ttl=56 time=58.782 ms
64 bytes from 9.9.9.9: seq=76 ttl=56 time=54.274 ms
64 bytes from 9.9.9.9: seq=77 ttl=56 time=58.183 ms
64 bytes from 9.9.9.9: seq=78 ttl=56 time=50.793 ms
64 bytes from 9.9.9.9: seq=79 ttl=56 time=57.939 ms
64 bytes from 9.9.9.9: seq=80 ttl=56 time=70.519 ms
64 bytes from 9.9.9.9: seq=81 ttl=56 time=59.632 ms
64 bytes from 9.9.9.9: seq=82 ttl=56 time=70.749 ms
64 bytes from 9.9.9.9: seq=83 ttl=56 time=59.565 ms
64 bytes from 9.9.9.9: seq=84 ttl=56 time=94.337 ms
64 bytes from 9.9.9.9: seq=85 ttl=56 time=63.998 ms
64 bytes from 9.9.9.9: seq=86 ttl=56 time=55.904 ms
64 bytes from 9.9.9.9: seq=87 ttl=56 time=52.351 ms
64 bytes from 9.9.9.9: seq=88 ttl=56 time=55.557 ms
64 bytes from 9.9.9.9: seq=89 ttl=56 time=57.612 ms
^C
--- 9.9.9.9 ping statistics ---
90 packets transmitted, 90 packets received, 0% packet loss
round-trip min/avg/max = 46.488/64.535/109.424 ms
1 Like

The weird thing is that you're getting 20s pings right from the start... even before 20s has passed. How does that happen? It can't be a baseline issue because when we first get a response we set both the baseline and the fast ewma to the same value. I just don't understand how within the first ~ 1 or 2 seconds we can "detect 20s of delay" @_FailSafe can you figure out a code path that could cause that?