Why does this happen? (sqm, bufferbloat)

Every time, right before the upload test, bufferbloat jumps all the way up to 2000s, sometimes as high as 6-7000s. It can sometimes take 10secs or more for the upload test to start. Other than that it seems fine (not ideal, but okay). That's with SQM on and no one else using the network at that moment. Actually it's roughly the same without SQM - which is weird! I've tried cake+pieceofcake and fq_colonel+simple, didn't change anything. Tried different egress/ingress values (anywhere from 100% all the way down to 40-45%), didn't change anything (except for the speeds of course).

Other related symptoms (I think): things just freeze occasionally, then get fast again. A google search page would take 10-15 secs to load, after which every next page loads instantly. A YouTube vid would get stuck buffering at one point for 10-20 secs, then suddenly buffer almost entire video super quickly.

Could this be a sign of faulty hardware? Bad router? It's a refurbished Archer A7 btw. Plugged in directly to the ONT from one end, into the computer from the other. I do know (or at least suspect with a reasonable certainty) that this router can't saturate full bandwidth (300/300) with SQM on, because the hardware is just not there, but it should be able to avoid bufferbloat when I drop both in/egess to like 125000 (so <50%), right?

My guess is it is almost certainly cpu bound.
Try starting as low as 5k, if that is ok then work your way up until you get problems then back off a little.

Could you please post a link to the dslreports speedtest's detailled results? That way we can see the bufferbloat measurements in high resolution and less time consuming than in a video clip.
Also, please post the output of cat /etc/config/sqm. And also post the output of tc -s qdisc twice, first after a fresh reboot of your router ( so say a few minutes after the reboit, burt without stressing the network) then perform a dslreports speedtest post the output of tc -s qdisc again, then post a link to thecdetailled results of that test, please.

DSLReports log for the speedtest in the video:

0.00s Start testing Fiber
0.00s geo location failed
00.1s Servers available: 11
00.1s pinging 10 locations
05.2s could not reach Newcastle, Delaware, USA http://t68.dslreports.com
05.2s 13ms Silver Spring, MD, USA
05.2s 17ms Nashville, TN, USA
05.2s 25ms Houston, USA
05.2s 63ms Winnipeg, Manitoba, Canada
05.2s 67ms San Jose, USA
05.2s 67ms Los Angeles 2, CA, USA
05.2s 100ms Beaverton, Oregon, USA
05.2s could not reach Dallas, USA http://t59.dslreports.com
05.2s could not reach Kansas City, Missouri, USA http://t50.dslreports.com
05.2s 5 seconds measuring idle buffer bloat
10.7s Trial download normal
10.7s Using GET for upload testing
19.00s  stream0 3.43 megabit San Jose, USA
19.00s  stream1 3.19 megabit Dallas, USA
19.00s  stream2 5.88 megabit Silver Spring, MD, USA
19.00s  stream3 5.54 megabit Silver Spring, MD, USA
19.00s  stream4 4.45 megabit Nashville, TN, USA
19.00s  stream5 4.3 megabit Nashville, TN, USA
19.00s  stream6 3.64 megabit Houston, USA
19.00s  stream7 5.08 megabit Silver Spring, MD, USA
19.00s  stream8 4.91 megabit Silver Spring, MD, USA
19.00s  stream9 5 megabit Silver Spring, MD, USA
19.00s  stream10 4.3 megabit Nashville, TN, USA
19.00s  stream11 3.54 megabit Dallas, USA
19.00s  stream12 4.13 megabit Nashville, TN, USA
19.00s  stream13 3.49 megabit Houston, USA
19.00s  stream14 3.66 megabit Dallas, USA
19.00s  stream15 4.08 megabit Nashville, TN, USA
19.00s  stream16 4.48 megabit Silver Spring, MD, USA
19.00s  stream17 3.78 megabit Nashville, TN, USA
19.00s  stream18 3.76 megabit Houston, USA
19.00s  stream19 3.62 megabit Houston, USA
19.00s  stream20 3.13 megabit San Jose, USA
19.00s  stream21 3.64 megabit Houston, USA
19.00s  stream22 3.69 megabit Houston, USA
19.00s  stream23 3.22 megabit Dallas, USA
19.00s  stream24 2.03 megabit Dallas, USA
19.00s  stream25 3.38 megabit Dallas, USA
19.00s  stream26 3.82 megabit Dallas, USA
19.00s  stream27 3.39 megabit Dallas, USA
19.00s  stream28 2.12 megabit Dallas, USA
19.00s  stream29 2.83 megabit San Jose, USA
19.00s  stream30 2.44 megabit Dallas, USA
19.00s  stream31 2.45 megabit Dallas, USA
31.5s ERROR - latency not idle
31.5s End of download testing.
31.5s Using POST for upload testing
55.8s Upload report:
55.8s  stream0 3.18 megabit San Jose, USA
55.8s  stream1 10.41 megabit Dallas, USA
55.8s  stream2 11.67 megabit Silver Spring, MD, USA
55.8s  stream3 11.39 megabit Silver Spring, MD, USA
55.8s  stream4 9.91 megabit Nashville, TN, USA
55.8s  stream5 10.75 megabit Nashville, TN, USA
55.8s  stream6 5.1 megabit Houston, USA
55.8s  stream7 11.55 megabit Silver Spring, MD, USA
55.8s  stream8 11.31 megabit Silver Spring, MD, USA
55.8s  stream9 11.37 megabit Silver Spring, MD, USA
55.8s  stream10 10.55 megabit Nashville, TN, USA
55.8s  stream11 9.67 megabit Dallas, USA
61.9s End of upload testing
61.9s Recording upload  117.2
61.9s Timer drops: frames=56 total ms=18500 slip=0
61.9s END TEST
64.00s Total megabytes consumed: 471 (down:175 up:295.9)

Re-run with moeller's settings (125000 in/eg):

root@OpenWrt:~# cat /etc/config/sqm
config queue 'eth1'
        option qdisc_advanced '0'
        option debug_logging '0'
        option verbosity '5'
        option linklayer 'ethernet'
        option overhead '44'
        option interface 'eth0.2'
        option enabled '1'
        option download '125000'
        option upload '125000'
        option qdisc 'cake'
        option script 'piece_of_cake.qos'
root@OpenWrt:~# tc -s qdisc
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn
 Sent 191162 bytes 794 pkt (dropped 0, overlimits 0 requeues 1)
 backlog 0b 0p requeues 1
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.1 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 8005: dev eth0.2 root refcnt 2 bandwidth 125Mbit besteffort triple-isolate nonat nowash no-ack-filter split-gso rtt 100.0ms noatm overhead 44
 Sent 61997 bytes 328 pkt (dropped 0, overlimits 21 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 14080b of 6250000b
 capacity estimate: 125Mbit
 min/max network layer size:           28 /    1500
 min/max overhead-adjusted size:       72 /    1544
 average network hdr offset:           10

                  Tin 0
  thresh        125Mbit
  target          5.0ms
  interval      100.0ms
  pk_delay        108us
  av_delay         16us
  sp_delay         15us
  backlog            0b
  pkts              328
  bytes           61997
  way_inds            0
  way_miss          125
  way_cols            0
  drops               0
  marks               0
  ack_drop            0
  sp_flows            1
  bk_flows            1
  un_flows            0
  max_len          4829
  quantum          1514

qdisc ingress ffff: dev eth0.2 parent ffff:fff1 ----------------
 Sent 113369 bytes 318 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan0 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 8006: dev ifb4eth0.2 root refcnt 2 bandwidth 125Mbit besteffort triple-isolate nonat wash no-ack-filter split-gso rtt 100.0ms noatm overhead 44
 Sent 117793 bytes 318 pkt (dropped 0, overlimits 81 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 11904b of 6250000b
 capacity estimate: 125Mbit
 min/max network layer size:           48 /    1500
 min/max overhead-adjusted size:       92 /    1544
 average network hdr offset:           10

                  Tin 0
  thresh        125Mbit
  target          5.0ms
  interval      100.0ms
  pk_delay        311us
  av_delay         31us
  sp_delay         14us
  backlog            0b
  pkts              318
  bytes          117793
  way_inds            0
  way_miss          118
  way_cols            0
  drops               0
  marks               0
  ack_drop            0
  sp_flows            1
  bk_flows            1
  un_flows            0
  max_len          1514
  quantum          1514
root@OpenWrt:~# tc -s qdisc
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 0: dev eth0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn
 Sent 461513576 bytes 461896 pkt (dropped 0, overlimits 0 requeues 20)
 backlog 0b 0p requeues 20
  maxpacket 1514 drop_overlimit 0 new_flow_count 76 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.1 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 8005: dev eth0.2 root refcnt 2 bandwidth 125Mbit besteffort triple-isolate nonat nowash no-ack-filter split-gso rtt 100.0ms noatm overhead 44
 Sent 283594359 bytes 249883 pkt (dropped 37, overlimits 168025 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 184512b of 6250000b
 capacity estimate: 125Mbit
 min/max network layer size:           28 /    1500
 min/max overhead-adjusted size:       72 /    1544
 average network hdr offset:           14

                  Tin 0
  thresh        125Mbit
  target          5.0ms
  interval      100.0ms
  pk_delay        627us
  av_delay        327us
  sp_delay         20us
  backlog            0b
  pkts           249920
  bytes       283650377
  way_inds            0
  way_miss          298
  way_cols            0
  drops              37
  marks               0
  ack_drop            0
  sp_flows            1
  bk_flows            1
  un_flows            0
  max_len          4829
  quantum          1514

qdisc ingress ffff: dev eth0.2 parent ffff:fff1 ----------------
 Sent 177068824 bytes 212674 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan0 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 8006: dev ifb4eth0.2 root refcnt 2 bandwidth 125Mbit besteffort triple-isolate nonat wash no-ack-filter split-gso rtt 100.0ms noatm overhead 44
 Sent 178835032 bytes 211874 pkt (dropped 800, overlimits 81980 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 880896b of 6250000b
 capacity estimate: 125Mbit
 min/max network layer size:           48 /    1500
 min/max overhead-adjusted size:       92 /    1544
 average network hdr offset:           14

                  Tin 0
  thresh        125Mbit
  target          5.0ms
  interval      100.0ms
  pk_delay         28us
  av_delay         10us
  sp_delay          6us
  backlog            0b
  pkts           212674
  bytes       180046232
  way_inds            0
  way_miss          290
  way_cols            0
  drops             800
  marks               0
  ack_drop            0
  sp_flows            2
  bk_flows            1
  un_flows            0
  max_len          1514
  quantum          1514

DSLReports after a reboot

Also, here's a DSLReports results with SQM disabled, if that's of any help. Grade B in this one.

Another re-run with SQM disabled almost right after 1st one - Grade C

Another re-run with SQM disabled almost right after 2nd one - Grade D

Also a quick note: eth0.2 is wan,wan6 for me.

5K (in/eg) run 1

5K (in/eg) run 2

There did indeed seem to be less of a pause before the upload test and fewer spikes in general.

Ran fine at 20K

At 40K it started spiking again, and the big spike before upload returned.

40K? I thought the A7 was supposed to be a decent budget router haha

Could it be faulty hardware (since it's a refurb), or is that unlikely? It's a huge generalization, but my observation is usually when it comes to electronics, it either works or it doesn't. And this looks like an in-betweener.

It is. Don't forget broadband technology has advanced tremendously over the last couple of years.
A good budget router will handle adsl, vdsl type speeds ie up to ~tens of megabits per second. Absolutely fine for most uses, home and office including video streaming, but not able to make use of feeds in the range of hundreds of megabits per second available with modern high speed fiber to the building offerings. To make use of these speeds you need a high grade (expensive), top of the range router....... :wink:

It's not even 40K. I still get frequent spikes and a big spike before upload. It's more like 30K for a consistently good Bufferbloat score (but a Quality score hovering between C and D..).

If this is indeed an A7 thing (is this a certainty, based on the configs I posted above?), what's the cheapest router right now that'll give me a stable connection with SQM and that'll, if not saturate my bandwidth, at least give me about 150Mbps (so like, ~50% of it)?

I might go back to stock and test it out on that (never did before flashing OpenWRT) before doing that though.

Try with with fq_codel/simple.qos and repeat tests and see if there is an improvement.

Two I have had success with are:

  1. Ubiquiti ER-X
    https://openwrt.org/toh/ubiquiti/ubiquiti_edgerouter_x_er-x_ka
    This is quite good but getting a little long in the tooth and I suspect (perhaps incorrectly, but due to the scarcity of supplies recently) that it is approaching end of life by the manufacturer. Probably cheapest if you can get one.

  2. Gl-iNet MV-1000
    https://openwrt.org/toh/gl.inet/gl.inet_gl-mv1000_brume
    Superb with loads of ooooomph. But quite a bit more expensive

1 Like

fq_codel/simple.qos gave nigh on identical results

Thanks for the suggestions. I do need wireless on the router. The Brume one does seem to have a wireless model, but it's 2.4 only - which is fine actually - but I would rather not spend that much on a router. It's very cute-looking though! Super small.

And also, I'm just wondering.. Why do I need SQM in the first place? Why am I having bufferbloat issues on a symmetrical 300/300 fiber connection - even when I'm the only one using the network? People are saying this is something I shouldn't even need in this situation.

And are the symptoms I described earlier a result of bufferbloat/misconfigured SQM?

Whether you need SQM is really your decision. Personally, I very much like how SQM keeps my link usable for latency sensitive interactive uses even under saturating loads, and I also like its per internal fairness mode which shares my limited wan speed reasonably predictable between the concurrently active hosts. But that are just my subjective reasons and without SQM I see >= 100ms latency increases under load, which immediately take the fun out of voip and videconferencing, even ssh session start to feel slugish. But neither of this must be true for your link and even if your latency numbers would be identical that does not mean that your subjective assessment would match mine.
In other words, to SQM or not to SQM is a policy decision you need to make for your own network.
I will have a look at your posted numbers later/tomorrow when I have access to a real computer....

1 Like

Looking at the bufferblot tests of all of these, I have to say they all look like something went wrong with the tests, and hence they are not very diagnostic of anything.

If you have a linux host, you could try the following script I cobbled together (not as nice as the dslreports test, but as I said cobbled together):

#! /bin/bash
# if 

# show what is going to be run
cat $0
echo ""

SESSION_DATETIME=$( date "+%Y%m%dT%H%M%S" )
MTR_INTERVAL_SECS=0.2
MTR_HOST_IP="8.8.8.8"

echo "Starting unidirectional speedtest"
echo ""

# idle RTT
mtr -ezb4w -i ${MTR_INTERVAL_SECS} -c 100 ${MTR_HOST_IP} > mtr_idle_${SESSION_DATETIME}.out ; 
# loaded RTTs
mtr -ezb4w -i ${MTR_INTERVAL_SECS} -c 150 ${MTR_HOST_IP} > mtr_loaded_${SESSION_DATETIME}.out & 
# der speeedtest
speedtest-cli > speedtest_${SESSION_DATETIME}.out

# wait for background job to finish
wait
echo "Unidirectional speedtests finished..."
echo ""


echo "RTT to ${MTR_HOST_IP} idle:"
cat ./mtr_idle_${SESSION_DATETIME}.out
echo ""

echo "Speedtest.net results:"
cat ./speedtest_${SESSION_DATETIME}.out
echo ""

echo "RTT to ${MTR_HOST_IP} unidirectional loads:"
cat ./mtr_loaded_${SESSION_DATETIME}.out
echo ""



echo "Starting bidirectional speedtest"
echo ""


## loaded RTTs
mtr -ezb4w -i ${MTR_INTERVAL_SECS} -c 150 ${MTR_HOST_IP} > mtr_fully_loaded_${SESSION_DATETIME}.out & 

# der speeedtest
speedtest-cli --no-upload > speedtest_no-upload_${SESSION_DATETIME}.out &
speedtest-cli --no-download > speedtest_no-download_${SESSION_DATETIME}.out

# wait for background job to finish
wait
echo "Biidirectional speedtests finished..."
echo ""

echo "Speedtest.net result DOWN:"
cat ./speedtest_no-upload_${SESSION_DATETIME}.out
echo ""

echo "Speedtest.net results UP:"
cat ./speedtest_no-download_${SESSION_DATETIME}.out
echo ""

echo "RTT to ${MTR_HOST_IP} bidirectional load:"
cat ./mtr_fully_loaded_${SESSION_DATETIME}.out
echo ""

exit 0

Just copy this to a file (e.g. combined_mtr_speedtest.sh), make it executable and call it like:
sudo ./combined_mtr_speedtest.sh

I note this requires both mtr and speedtest-cli to be installed, speedtest-cli really only is needed to generate load, the relevant data is in the mtr results, where best, average, worst and standard deviation (to the final hop) allow to eyeball the latency distribution to get an idea about how much latency under load the speedtest causes for the sparseish mtr probes.

2 Likes

Sorry I think I worded my question badly. By "Why do I need it?" I meant, why am I still getting bad latency and bufferbloat score when a decently high-bandwidth symmetrical fiber connection should technically serve as SQM by itself? Isn't the whole idea that bufferbloat and latency occur when the network is oversaturated? How does it manage to get oversaturated?

To paraphrase: to my understanding (and I am an absolute ignoramus, so could be 110% wrong), I shouldn't need SQM and shouldn't be getting bad bufferbloat and latency even without SQM in this situation. But I still am. Why could that be the case?

I'll try the script! Thank you very much for that. It's a bummer all the tests look faulty. Will this work from a VM? I don't have a linux machine unfortunately..

Because your router is not fast enough.

Oh so even without SQM it's struggling? I could try plugging directly into the ONT and testing that.

1 Like

When plugged straight into the ONT it's somehow all over the place. Anywhere from B to D. I think I even got F once lol.

I've also tried the ISP's router (much larger in size than the A7), and that one got F.

I agree, but if the world were perfect nobody would need SQM ;). The issue really is over-sized and under-managed buffers. And these sob-optimal buffer typically show up at speed transitions, your internet is 300/300, but your internal networking probably is 1000/1000 (and for the downlink the internet is >> 1000) so you are still likely to experience some degree of bufferbloat (but as before, whether you consider that actionable or not is your decision to make).
There are techniques in the linux kernel, like BQL for ethernet adapters and AQL for wifi adapters that help to get some buffers better managed (and qdiscs like fq or fq_codel that help a lot), but unless these techniques percolate into the devices in front of the actual bottlenecks like dsl modems/dslams, docsis-modems/cmts, PON OLT/ONTs bufferbloat will still be an issue....

2 Likes