Troubleshooting SQM for Comcast Gigabit

Hello all,

I'm trying to stabilize SQM on Comcast's DOCSIS 3.1 Gigabit tier (1000 Mbit down, 35 Mbit up), but cannot seem to get the download speeds high and stable. I did some troubleshooting on the LEDE IRC channel with shm0, but we were not able to find the best blend of bandwidth to latency under load.

I use a Motorola MB8600, with a whitebox x86 machine using an AMD Athlon 5350 cpu, 4 gb ram, and an Intel i350-t2 dual NIC running 17.01.4 x86_64, and can achieve 930 Mbit without SQM (about 80% idle observation using top -d 1). No matter what I do, I can hit a roughly 750 Mbit wall with SQM enabled despite 50% idle, and includes a penalty of high latency.

Observed are some pretty high latency spikes in DSLReports, using 750000/35000 and cake, layer_cake enabled on WAN interface.

Ingress settings are nat dual-dsthost docsis, and egress settings are nat dual-srchost ack-filter docsis.

Anything higher results in higher bufferbloat, and things look a little better where I see bufferbloat even out when experimenting at 450000/35000 -

The spikes are quite evident still and I'm not sure what else to do to approach this. All else are defaults, and using irqbalance, while reducing idle utilization, does not improve the wall nor reduce the spike behavior.

I can't seem to max out 32/32 streams in DSLReports - that seems to cause the browser test to go funky, and I have to reduce it down to 20/20 for it to properly behave.

Anything I can do to tune this more? Is this hardware at play possibly?

DOCSIS 3.1 has head-end bufferbloat "mitigation" and Comcast has upgraded their head ends in many locations. You may be having the two fighting with each other.

https://www.cablelabs.com/how-docsis-3-1-reduces-latency-with-active-queue-management/

That could definitely be an issue. The other thing is that in general testing a gigabit is somewhat hard, with 20 streams it should probably be ok, but it can actually stress the machine you are running the test on. There are some settings in dslreports to prevent it from writing data to disk which can help... somewhere in the settings.

Speed Test suggests moving the browser cache directory to a RAM disk to eliminate the maximum download speed being limited to the maximum disk write speed.

Okay, I would definitely work on fixing that, otherwise you might confuse a browser stall with problems on your DOCSIS link. Here is a link that might help https://www.dslreports.com/forum/r29994522-speedtest-gigabit-testing-results-debugging

I would also like to see the results for simple.qos/fq_codel for the 750000/35000 shaper settings. Simple.qos tends to keep latency better constrained at a higher bandwidth cost. That said your issue is not chronic and that makes sqm itself the unlikely culprit, there must be something else that interacts with sqm's shaper that yields these odd latency spikes.

Oh what would also be interesting to see a test with 0/35000 as shaper settings, which effectively disables sqm in the download direction and a test with sqm completely disabled. There is a slight chance that you might not need sqm at all or maybe not on the expensive download side (shaping 35 Mbps should be no issue on your hardware, lowly MIPS cpus could that, say 10 years ago)

I would agree don't try to control the download, just run SQM on the upload side.

Thank you all for your help!

Here's cake results at 0/35000:

Here’s SQM set to 750000/35000 fq_codel and simple.qos:

Regarding the browser stall, I'm not sure if my client machine would be encountering issues. Using latest stable chrome and i7-6700k, 32 gb ram, intel i219-v nic. I can definitely test browsers if that’s suspect, however.

Looks like you ran at least one of those tests on https based on the badge having https on it, so I'd recommend against that, a lot of CPU needed for encryption.

These both look terrible, the first encountered a stall somewhere in the download draing of which spread well into the short idle period between upload and download. You are testing with a wired connection an no additional traffic, I assume?
The simple.qos tests also looks suspicious, especially the upload. On most hardware htp+fq_codel running out of cpu cycles results in less throughput than expected but well controlled latency, in your example we see bad thoughput with incredibly bad uploading latency (and honestly for the uplink I would have guessed that docsis 3.1's mandatory pie shaper should have avoided something this bad).

Could we maybe start fresh with a dslreports speedtest without sqm with a wired connection between computer and router and a wired connection between router and modem and with all other known sources of traffic turned off? The idea is to establish a baseline first and then see how to improve this.

Then based on the dubious simple.qos results I would be interested in the results with sqm set to 350/25 Mbps, as that should certainly be in reach for your router's cpu...

Also absolutely TURN OFF HTTPS on the test, it's unlikely that even your i7 is going to encrypt/decrypt a gigabit/s without having some effect, you want the browser/client to just push packets.

50% idle on a dual-core system is one core at 100% and the other idle. are you
sure no one core is maxing out?

Hello All,

All tests are cat 6a from modem to router, router to desktop, no other activity.

Baseline - https://www.dslreports.com/speedtest/30794937

350000/25000 Cake layer_cake - https://www.dslreports.com/speedtest/30795350

350000/25000 fq_codel simple.qos - https://www.dslreports.com/speedtest/30795441

I have an HP Pro 6200 SFF to add a little bit more horsepower to the equation, and below are its results, using the same i350-t2 NIC:

Baseline - https://www.dslreports.com/speedtest/30796170

750000/35000 Cake layer_cake - https://www.dslreports.com/speedtest/30796364

750000/35000 fq_codel simple.qos - https://www.dslreports.com/speedtest/30796318

350000/25000 Cake layer_cake - https://www.dslreports.com/speedtest/30796449

350000/25000 fq_codel simple.qos - https://www.dslreports.com/speedtest/30796507

I also have a Linksys WRT1200AC and did some additional testing, using the same modem hardwired to it, and from the router to the desktop.

Baseline - https://www.dslreports.com/speedtest/30797267

750000/35000 Cake layer_cake - https://www.dslreports.com/speedtest/30797840

750000/35000 fq_codel simple.qos - https://www.dslreports.com/speedtest/30797886

500000/35000 Cake layer_cake - https://www.dslreports.com/speedtest/30797755

500000/35000 fq_codel simple.qos - https://www.dslreports.com/speedtest/30797804

350000/25000 Cake layer_cake - https://www.dslreports.com/speedtest/30797634

350000/25000 fq_codel simple.qos - https://www.dslreports.com/speedtest/30797717

Some pretty strange results...

I observed that ECN was enabled on upload, so I disabled that and saw the latency spikes drop completely from the upload side. Furthermore, my desktop was running the patches that mitigated meltdown/spectre on the OS. I reinstalled Win10 1709 and it showed something a lot more proper than what I anticipated from running cake.

For those that have Comcast gigabit, I've been able to use 820000/40000 with success.

I'm running ingress with:

nat dual-dsthost docsis

And run egress with:

nat dual-srchost docsis