SQM Makes bufferbloat significantly worse

There are fixes for the bufferbloat issues on the WRT AC Series (for example, @davidc502's builds have hardware buffer patches applied)

How big of a problem do the wrt series have with sqm?I have a WRT1200AC and I get an A on the bufferbloat test with sqm. I'm using 17.01.4 r3560. My connection is 150/10

@moeller0 I'm with Andrews & Arnold in the UK, they're a highly technical ISP so I have direct access to the modem, as well as upstream graphs of continuous line monitoring, and I can even change the 'profile' used by the backhaul carrier for the line. Unfortunately I don't know enough about network hardware to make the most of this!

My ADSL modem (ZyXEL VMG1312-B10D) doesn't seem to offer that much in the way of tweaking ADSL parameters or in diagnostics. Looking at it this morning (when line quality has been a bit better) I get the following from the "DSL Statistics" (I can try and 'stress' the line and then look again if you think that would help)

    ADSL Training Status:   Showtime
                    Mode:   ADSL2 Annex A
            Traffic Type:   ATM Mode
             Link Uptime:   0 day: 16 hours: 29 minutes
       ADSL Port Details       Upstream         Downstream
               Line Rate:      0.405 Mbps        2.772 Mbps
    Actual Net Data Rate:      0.372 Mbps        2.743 Mbps
          Trellis Coding:         ON                ON
              SNR Margin:       14.5 dB            8.3 dB
            Actual Delay:          4 ms             11 ms
          Transmit Power:       12.2 dBm          16.6 dBm
           Receive Power:        4.7 dBm           4.7 dBm
              Actual INP:        0.5 symbols       0.0 symbols
Attainable Net Data Rate:      0.440 Mbps        3.148 Mbps

            ADSL Counters

           Downstream        Upstream
Since Link time = 29 min 27 sec
FEC:		1178		11
CRC:		52		0
ES:			35		0
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Retr:		0
HostInitRetr:	0
FailedRetr:	0
Latest 15 minutes time = 14 min 50 sec
FEC:		24		0
CRC:		0		0
ES:			0		0
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Retr:		0
HostInitRetr:	0
FailedRetr:	0
Previous 15 minutes time = 15 min 0 sec
FEC:		8		0
CRC:		1		0
ES:   		1		0
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Retr:		N/A
HostInitRetr:	N/A
FailedRetr:	N/A
Latest 1 day time = 16 hours 29 min 50 sec
FEC:		1178		11
CRC:		52		0
ES:			35		0
SES:		0		0
UAS:		22		22
LOS:		0		0
LOF:		0		0
LOM:		0		0
Retr:		0
HostInitRetr:	0
FailedRetr:	0
Previous 1 day time = 0 sec
FEC:		0		0
CRC:		0		0
ES:			0		0
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Retr:		0
HostInitRetr:	0
FailedRetr:	0
Total time = 16 hours 29 min 50 sec
FEC:		1178		11
CRC:		52		0

The modem traffic status currently shows no errors or dropped packets.

All available DSL profiles and capabilities on the modem are currently enabled, including bitswap.

Upstream my provider currently artificially caps the line to 95% of capacity as they find this improves VoIP performance (I have a hardware VoIP phone that actually works quite well most of the time), although I can turn it off myself if I wish. Upstream also applies TCPFix (I have the option to also apply MRUFix, LCPFix and FastTimeout) and limits the MTU to 1492 because my backhaul doesn't support baby-jumbo frames, so I can't use a full MTU of 1500 after ATM encapsulation. However I notice my modem is still set to an MTU of 1500, could this be a problem, or will upstream line settings overrule this?

The backhaul is with TalkTalk (TT) and currently is using the SI16_6_24M_1M profile which means Annex: A, Adaption: Dynamic, Interleaving: 16, SNR: 6dB, Max Downstream: 24Mbs, Max Upstream: 1Mbs.

Monitoring of my line shows U Attn to be rock steady at 36dB, D Attn rock steady at 60dB and D Margin wandering all over the place between 4.5-8.5dB. I have to say that on the times when the line has seemed to be behaving really well from my point of view I have observed D Margin to be rock steady - however when I mentioned this to my provider they didn't think it was important.

My line traffic graph yesterday when I posted, looked like this:


The state of the line between 10:00 and 14:00 shows when an upload was ongoing, without the LEDE router (using the ZyXEL modem as modem and router). Although the graph doesn't show "the situation on the ground", in that because of bufferbloat at my router's end the actual perceived latency on my computers was actually in the thousands of milliseconds. At around 14:00 I switched the ZyXel to bridge mode and introduced the Linksys running LEDE with SQM enabled to tackle the bufferbloat, but as can be seen on the graph things just got worse. The large gap between 16:30 and 17:30 is me switching the ZyXel back to its modem+router configuration and taking the LEDE router out of the loop. As can be see afterwards there was still significant latency on the line.

My provider got back to me this morning, saying:

Thanks for your email. Yes can see from 14:00 - 17:00 yesterday a
difference in activity.

Firstly a few stats about your broadband circuit...

Estimated line length 4174m

Estimated standard ADSL download speed 1,644→3,983kb/s.

Inherent latency = PING REDACTED (REDACTED) 56(84) bytes of
64 bytes from REDACTED: icmp_seq=1 ttl=59 time=39.3 ms
64 bytes from REDACTED: icmp_seq=2 ttl=59 time=40.1 ms

The Zyxel has negotiated..

DownstreamRate: 2.743000Mb/s
UpstreamRate:   0.372000Mb/s

Looking at the last 60 days, though your circuit is long, it does appear
quite stable as the signal to noise ratio has remained stable and the
packet error rate is very low.

Most cloud services have an ability to cap the up-sync rate to alleviate
the asymmetric properties of the signalling as 0.372Mb/s is easily
saturated causing the latency you are experiencing.

If you need greater capacity we would advise moving to FTTC

So they don't seem to have found a problem with the line (in the UK they do actually have a reputation at being quite good at fault location).

Any suggestions on parameter tweaks I can make to improve things would certainly be appreciated!

So, when configuring SQM what values did you put in for the upload and download speeds?

EDIT: I ask because with high levels of bufferbloat the test results can fail to reflect how much data is going down the line, one reason is because TCP will reset connections if there is too long of a delay for an ACK, and if you're getting like 5 second delays that's definitely too long. Based on the numbers you just posted I'd recommend 2400 kbps down and 300 kbps up as starting points for your SQM testing. If that doesn't work well, back it off to 2000 and 250 perhaps.

Excellent, A&A truly seems to be one of the few ISP with loads "clue", wish I could get their service or find a similarly technically oriented ISP over here (Germany).
What profiles does are you offered for the backhaul (since I hear this concept for the first time, I will most likely not be able to help, but I certainly am curious now ;))

These are exactly what I wanted to see, thanks. Stressing the line should not matter, as long as we talk about just using the line, if you use additional RF-sources to disturb the physical voltage signal on the line that would be different :wink: (and then my advise to improve your situation woud simply be "just do not use the RF noise injectors").

I believe that in Zyxel nmenclature these are the bandwidth values your modem negotiated with the DSLAM, so these will be the physical limits on your ADSL link (but there might be bottlenecks past that).

These show that your ADSL link is pretty nice and clean (at least in the 16.5 hours since last link negotiation). Just as a reminder FEC are corrected errors, they will not affect latency and/or bandwidth at all (and you really want to see a few as otherwise this would indicate that you are leaving potential bandwidth un-used). The CRE erreos all resulted in a dropped packet and hence are potentially bad, but 52 in 16.5 hours that can be safely ignored (the opinions differ, but most users will not notice a low single digit number od CRC errors per minute). The 35 errored seconds (ES) are also totally unproblematic.
In short your actual link seems quite fine.

I do not know what half of those mean... The MTU thing is not so much about ATM, but rather about the 8 byte PPPoE header that "eats" into the MTU (that is you are using the full MTU of 1500 to your ISP, but * of those 1500 bytes will only exist between your modem/router and the ISP, so the MTU towards all hosts upstream of your ISP needs to be reduced by this 8 bytes).

No it should not be a problem (and the modem actually sees MTU1500 packets), it is the pppoe-client and server that need to set the MTU to 1492, so that seems fine. Wrond MTU settings should, for IPv4, result in lots of fragmentation that would reduce the bandwidth and might cause bufferbloat, but should also be easy to diagnose from a packet capture.

That is a description of the ADSL link's properties, but since your modem counters look sane this should not be your problem.

It would be intersting to correlate the modems error counter changes with these "D Margin " fluctuations, if they wobble again please have repeated looks into the error counters and see whether CRC's increase?
But these CRCs will trigger retransmits which will cost bandwidth, but should not affect latency under load, so I am not sure whether this really is part of your core problem.

Nice, it seems my interpretation of your line stats seems reasonable :wink: So that leaves only few places for the bufferbloat to actually occur. Could be the router, the modem (if there is a bug in its bridge-mode) or the DSLAMs uplink...

Could you run a speedtest next time you see the issues with a concurrent "mtr" (if on windows try winmtr), please? Mtr is sort of a continuos traceroute that will show last, average, best and worst RTT to all hops between the computer running mtr and the target ip-address. What can be diagnostic is if the RTTs from a specific hop on all increase under load. traceroute/mtr are a bit tricky to interpret, but I am sure if you could post the results here in the thread we could try to make sense out of the data.

Just tried davidc502’s builds.

Works amazingly well, before I got a few spikes of 200ms on the test but it still got me around an A on the test. Now the spikes only reach 10ms.

Here is without the traffic shaper

I am using cake + piece of cake.

OP, you should give his builds a try.

Okay so I've run some more tests, and done all the following things:

  • Reflashed the Linksys router to use davidc502's patched builds as suggested by @JW0914

  • Based my initial upstream/downstream numbers off the modem's negotiated line values as suggested by @dlakelan

  • Reduced the MTU down to 1280 as suggested by @fuller

I've done all of these things in stages and at each point tested the line from a wired connection with no other clients. As suggested by @moeller0 I've run a concurrent "mtr" while running the speed tests in each case. The results are given at the end of this post.

The low down though is that it still doesn't help, even after tweaking the numbers and also trying fq_codel instead of cake. However, I'm starting to be convinced from the way that the speed test behaves that SQM really isn't working properly. My impression was that it was supposed to prevent upload bandwidth going much above the thresholds you set, but what I see in the speed tests is that the upload speed always initially ramps right up beyond 1 Mbps. This, unsurprisingly, causes the line to completely saturate and the buffers to fill up so that the bandwidth then decays exponentially, sometime to less than 10kbps by the end of the test. Unfortunately the saved test results don't seem to show the historical data for this they only give the final measured upload speed. I've tried progressively lowering the upstream value in SQM but this just continues to happen right up to a sudden cutoff when the whole thing just breaks instead and the speed test won't even run.

The only way I have been able to instead get a steady, almost flat, upload speed with the tests has been paradoxically to tell SQM to greatly curtail the download speeds. This results in steady upload speeds over 100 kbps but of course totally destroys download. I've shown this too in the results below.

If anyone can offer any ideas as to what on earth might be going on, I'd be very grateful. I know my line isn't fast but it seems both my ISP and people here now think that it ought to be capable of something around 300 kpbs upload :frowning:

Anyway here are the results...

Control Test (Zyxel as modem and router, connection completely idle)

mtr results after running mtr for 5 minutes:

REDACTED-HOSTNAME (           2018-04-13T14:14:07+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1.                       0.0%   304    0.8   0.7   0.6   1.8   0.1
 2. bottomless.aa.net.uk              0.0%   303   40.7  41.2  37.1 163.7   9.7
 3. n.aimless.thn.aa.net.uk           0.0%   303   42.4  41.8  37.4 211.3  14.7
 4. google1.lonap.net                 0.0%   303   40.1  41.8  37.5 190.2  11.7
 5.                   0.0%   303   39.8  42.0  37.7 213.1  13.3
 6.                   0.3%   303   42.9  43.0  38.3 192.8  13.0
 7. google-public-dns-a.google.com    0.0%   303   39.0  41.7  37.5 196.1  10.9

Control under load (Zyxel as modem and router, speedtest running)

REDACTED-HOSTNAME (           2018-04-13T14:16:55+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1.                       0.0%    80    0.7   0.7   0.5   1.0   0.1
 2. bottomless.aa.net.uk             11.4%    80  7983. 3272.  38.8 7983. 2857.
 3. n.aimless.thn.aa.net.uk          11.4%    80  7843. 3268.  38.3 7843. 2837.
 4. google1.lonap.net                 8.9%    80  7704. 3375.  37.9 7730. 2855.
 5.                  11.4%    80  7866. 3316.  38.7 7878. 2871.
 6.                  15.2%    79  8014. 3311.  40.2 8026. 2911.
 7. google-public-dns-a.google.com   14.1%    79  8065. 3270.  37.7 8065. 2836.

Zyxel in bridge mode, Linksys running patched LEDE build by davidc502, no SQM

REDACTED-HOSTNAME (        2018-04-13T14:32:41+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. LEDE-ROUTER                      0.0%    56    0.5   0.4   0.3   0.6   0.0
 2. bottomless.aa.net.uk             19.6%    56  9322. 1897.  36.2 9322. 2820.
 3. n.aimless.thn.aa.net.uk          16.1%    56  9202. 2249.  35.6 9369. 3172.
 4. google1.lonap.net                18.2%    56  9491. 1982.  37.3 9491. 2932.
 5.                  16.4%    55  9565. 2160.  36.0 9565. 3111.
 6.                  16.4%    55  9630. 2179.  36.5 9630. 3111.
 7. google-public-dns-a.google.com   16.4%    55  9540. 2200.  35.9 9540. 3110.

With SQM (cake queue discipline, downstream set to 2200 upstream to 300)

REDACTED-HOSTNAME (        2018-04-13T14:40:00+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. LEDE-ROUTER                       0.0%    58    0.4   0.4   0.3   0.5   0.0
 2. bottomless.aa.net.uk              7.0%    57  4941. 1319.  35.6 4941. 1660.
 3. n.aimless.thn.aa.net.uk           7.1%    57  4793. 1265.  36.6 4793. 1609.
 4. google1.lonap.net                 7.1%    57  4803. 1283.  36.5 4803. 1614.
 5.                   7.1%    57  4815. 1290.  36.2 4815. 1623.
 6.                   7.1%    57  4842. 1302.  36.8 4842. 1635.
 7. google-public-dns-a.google.com    7.1%    57  4857. 1315.  37.1 4857. 1646.

With SQM (downstream unfiltered, upstream 200)

REDACTED-HOSTNAME (        2018-04-13T15:14:28+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. LEDE-ROUTER                       0.0%    68    0.5   0.4   0.3   0.5   0.0
 2. bottomless.aa.net.uk              0.0%    68   59.0 136.2  39.2 798.0 178.1
 3. n.aimless.thn.aa.net.uk           0.0%    68   49.8 152.7  38.1 844.5 182.6
 4. google1.lonap.net                 0.0%    68   90.2 175.0  41.7 884.1 195.2
 5.                   0.0%    68   68.1 174.1  39.9 921.5 198.1
 6.                   0.0%    68   57.7 186.8  43.1 972.2 205.5
 7. google-public-dns-a.google.com    0.0%    67   94.3 196.1  40.7 956.9 212.1

SQM (downstream 2200 upstream 300) with MTU 1280

REDACTED-HOSTNAME (        2018-04-13T15:23:52+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. LEDE-ROUTER                       0.0%    56    0.4   0.4   0.3   0.8   0.1
 2. bottomless.aa.net.uk              7.1%    56  4077. 992.0  45.9 4077. 1337.
 3. n.aimless.thn.aa.net.uk           7.1%    56  4126. 1008.  43.5 4126. 1351.
 4. google1.lonap.net                 7.3%    55  4008. 963.4  44.9 4008. 1298.
 5.                   7.3%    55  4020. 975.9  46.3 4020. 1310.
 6.                   7.3%    55  4018. 983.6  47.8 4018. 1318.
 7. google-public-dns-a.google.com    7.3%    55  4069. 995.6  44.4 4069. 1331.

SQM (downstream 300 upstream unfiltered) MTU 1280

Note upstream unfiltered now, but downstream severely limited.

REDACTED-HOSTNAME (        2018-04-13T15:45:40+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. LEDE-ROUTER                       0.0%    45    0.4   0.4   0.3   0.5   0.0
 2. bottomless.aa.net.uk              0.0%    45   40.2  73.0  37.5 209.1  49.0
 3. n.aimless.thn.aa.net.uk           0.0%    45   39.8  72.6  38.1 331.1  59.8
 4. google1.lonap.net                 0.0%    45   41.5  81.3  37.1 453.7  72.5
 5.                   0.0%    45   44.2  83.1  37.8 579.7  90.3
 6.                   0.0%    44   41.5  79.8  37.8 472.5  76.9
 7. google-public-dns-a.google.com    0.0%    44   40.8  78.5  39.5 334.0  61.2

SQM (downstream 800 upstream unfiltered) MTU 1280

Again just limiting downstream.

REDACTED-HOSTNAME (        2018-04-13T15:49:03+0100
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. LEDE-ROUTER                       0.0%    56    0.4   0.4   0.3   0.6   0.0
 2. bottomless.aa.net.uk              0.0%    56   39.4 117.0  37.9 495.8 133.2
 3. n.aimless.thn.aa.net.uk           1.8%    56   38.8 119.7  37.9 595.8 143.3
 4. google1.lonap.net                 0.0%    56   42.1 120.3  37.8 470.1 131.2
 5.                   0.0%    56   44.1 129.0  37.5 587.0 146.0
 6.                   0.0%    56   44.9 133.4  40.0 590.2 151.3
 7. google-public-dns-a.google.com    0.0%    56   40.1 125.5  38.1 493.5 139.7

Once again many thanks to everyone for all their input so far!

1 Like

Where to get those patches please?

I see it says you're using a large number of streams 8 to 16. On such a slow connection this may be a problem. Try changing settings to 1 stream both directions, or maybe 3 at most. The feedback effect may be too slow given your speed to get all the streams tuned right... Resulting in the stall you see.

1 Like

I don't recall where those were pulled from, so I've sent an invite to @davidc502

Excellent, thank you for this thorough data collection!

Okay, you could try to rest this hypothesis by installinf iftop on the router ("opkg update ; opkg install iftop") and run iftop -i pppoe-wan, while running the speedtest, this should show you more or less instantaneous wan bandwidth in use, if this spikes above your set value we have confirmation that the shaper is misbehaving.

These mtr results indicate that your unloaded RTT of ~41ms is mainbly caused between the first and second hop. Also the worst case results seem to corroborate this (no proof though, as there is no guarantee that the worst case samples are from around the same time it might just be coincidence). Also with your uplink you expect a serialization delay for full MTU packets of 1544*(53/48)*8 / (300000) * 1000 = 45.4622222222 ms.
All the MTR results seem to support the hypothesis that the issue occurs between hop 1 and hop 2. unfortunately that "single mtr hop" most likely consists out of multiple "hops" on TalkTalk's / Openreach's network. Also I agree that unmitigated bufferbloat on the lede router itself should have exactly the same phenotype.

SO one problem with this test is that the buffers from the doenstream test are still so bloated that they do not empty before the upstream test starts, so the upstream test never has a chance as it starts with too full buffers on the ACK return path.

But this shows that upload sqm does not see to work. Now, as much as MTU 1280 should not have a big effect I would recommend to do a SQM (downstream 1800 upstream 250) MTU 1500/1492. The idea would be to try to find a regime in which a) the download test does not overfill the massively oversized buffers and the upstream test also stays below the pain threshold. The MSS clamping (MTU 1280) is a bit suspect and going from 1500 to 1280 does not buy you that much (that would be different with MTU 576 but that is only expected to work with IPv4, IPv6 can break as there RFCs require at least MTU1280 capability).

This is a very keen observation, I wonder how I missed that. In my experience dslreports always automatically reduced the number of streams to match the configured bandwidth (but that always was with working sqm, so this might reflect the fact that your link's heroic over-buffering just out-lasts dslreports capability test).

I haven't applied those hardware patches in a very long time. I've searched both of my linux vm's, physical server, and don't have them anymore. From what I remember they were for kernel 3.x and when we went to kernel 4 they broke and were not fixed (from what I recall).

Currently the only patch I'm having to apply is to the wifi driver, and it isn't a patch, it is more of a hack. This is where AMSDU is disabled for the 1900acx, acs, and 1200ac units on mwlwifi driver. User Starcms is the one who actually recommended the hack to me last year.

Bottom line is bufferbloat should be similar between the daily snapshots and davidc502 builds from a RJ45 LAN perspective. However, latency should be improved for those on Wifi on one of the units described above especially in gaming situations.

Hope that helps to clarify.

Interesting, I tested back to back. I guess it's just because of the updated LEDE builds then.

I may have found the patches... Take a look. Hardware Buffer Management


"Since mvneta driver supports using hardware buffer management (BM), in
order to use it, board files have to be adjusted accordingly. This commit
enables BM on AXP-DB and AXP-GP in same manner - because number of ports
on those boards is the same as number of possible pools, each port is
supposed to use single pool for all kind of packets."

Thankyou all again. It turns out the number of streams in the test really did seem to be the problem as pointed out by @dlakelan. I reduced the test to three streams and the results started to get much better.

Using SQM with cake, an upstream value of 300 and a downstream value between 2200 and 2400, I can now get a bufferbloat score between A and B (more often A), download speeds of 1.8-2.0 Mbps and upload of 180-220 kbps.

I've also confirmed that I can simultaneously stream video, upload data and still have a responsive SSH session (at least all from the same client), although upload speeds do get slower under these conditions. Which is exactly what I wanted. I still need to test this works with multiple clients connected to the router, but so far very promising.

Thanks again everyone for your help.

1 Like

This is odd in that TCP should scale back if it finds insufficient bandwidth, but I guess a combination of effects like (potentially) using an initial window of 10 packets and a quickly increase RTT will make TCP unable to cope well with that condition. I will try to update the speedtest recommendations accordingly, thanks.

thank you

Hmm maybe also give the sqm/cake ack filter feature a try too?

I also had ADSL with a 3Mbps cap, but discovered an Unlimited Sprint 4G LTE plan. Now, I get 45MbpsDL/12MbpsUL, with 30ms ping, using a Sierra Wireless MC7455 card in a USB adapter, using the SIM out of the provided Hotspot. If you have tower not too far away, it might be worth a look.

And do you run with SQM enabled? If so, what settings?