SQM on Netgear R7800: choppy audio/video (Teksavvy)

moeller0 · April 27, 2017, 5:26pm

Okay, so in theory I would expect at full saturation on average an added delay equal to the sum of the target values from cake's statistics, that would add up to 10ms. But in practice I often see more like double of the target sum, so that would be 2 * (5 + 5) = 20ms in your case and then the observed 15 to 35 seems in the right ball park. So that seems not great but okay (especially ince cake when the cpu is overburdened will keep the bandwidth up at the cost of a little added latency under load, HTB+fq_codel as in simplest.qos will keep the latency low at the cost of reduced bandwidth under load). So you could test this by trying simplest.qos+fq_codel...

In addition you might want to log into your rputer via SSH while you run your tests and look at the output of "top -d 1" which will give you an snapshot of your router's load every second. If idle hits zero or constantly hovers near zero you might be CPU cycle limited (in that case I would also expect the sirq value to be relatively high). But at 50/10 most not too old router's should cope one would hope...

Best Regards

fantom-x · April 27, 2017, 6:00pm

Thx for the hint. Unfortunately simplest.qos+fq_codel provides worse latency: at 25/5 it is over 20ms while with cake it is 12..13 ms.

This router has a a dual core CPU at 1.7GHz and at 25/5 "top" reports >70% idle. At 45/9 it is ~50% idle and 35..50% sirq. It starts being single-core bound around 40/8 or less: I guess SQM is running on a single core?
At 35/7 latency drops to ~15..20ms again (with torrents and GRO disabled) with >60% CPU idle.
At 30/6 ping latency is < 15 ms and 70% CPU idle, 25% sirq.

I did not realize SQM would be so CPU intensive and this router has one of the most powerful CPUs...

dlang · April 27, 2017, 6:14pm

Thx for the hint. Unfortunately simplest.qos+fq_codel provides worse latency: at 25/5 it is over 20ms while with cake it is 12..13 ms.

This router has a a dual core CPU at 1.7GHz and at 25/5 "top" reports >70% idle. At 45/9 it is ~50% idle and 35..50% sirq. It starts being single-core bound around 40/8 or less: I guess SQM is running on a single core?
At 35/7 latency drops to ~15..20ms again (with torrents and GRO disabled) with >60% CPU idle.
At 30/6 ping latency is < 15 ms and 70% CPU idle, 25% sirq.

50% idle probably means that one core is maxed out and the other completely idle
(in top, hit '1' to have it show each core separately), even 60% idle is pretty
tight.

It's very possible that you are running out of cpu here.

David Lang

fantom-x · April 27, 2017, 6:19pm

'1' does not seem to be implemented, but yeah looks like it is getting single-core bound pretty quickly.

dlang · April 27, 2017, 6:43pm

that's the number 1, not the letter L

David Lang

fantom-x · April 27, 2017, 6:53pm

Yes, I know. Does it work for you on the a LEDE router? I found lots of standard Linux tools have limited functionality on routers.

dlang · April 27, 2017, 7:11pm

I may be compiling in a full version of top instead of a busibox version or
something like that.
k

hnyman · April 27, 2017, 7:45pm

Like I commented to your question in my own build thread, HTB used by simple and simplest seem to perform weakly in dual-core R7800, especially with kernel 4.4 that is used for ipq806x with LEDE 17.01.

The main reason for the weak "simple" performance with HTB+fq_codel seems actually to be HTB, not fq_codel itself. It is also possible to use "simplest_tbf" that avoids HTB by using TBF but still normally uses fq_codel. That simplest_tbf performed much better than simple (at least with kernel 4.4).

Extensive performance comparison of codel and cake qdiscs in R7800 can be found e.g. from https://github.com/tohojo/sqm-scripts/issues/48
A good summary maybe in https://github.com/tohojo/sqm-scripts/issues/48#issuecomment-270168000

The adoption of HTB burst in SQM and the move to kernel 4.9 has since then helped HTB/fq_codel performance somewhat (in LEDE master).

fantom-x · April 28, 2017, 5:05am

So, irqbalance has made a huge difference: now with heavy torrenting, the ping latency is ~20ms (vs 11ms ideal) at 45/9 (the link is 50/10). Setting SQM speed to 47.5/9.5 makes the latency go to 50..100ms and higher.

@moeller0, now that I am no longer CPU bound, how can I further improve SQM on my router? DO I still need to disable gro/lro/gso/tso ? My current SQM settings are below.

config queue 'wan'
	option interface 'pppoe-wan'
	option debug_logging '0'
	option verbosity '5'
	option linklayer 'ethernet'
	option qdisc_advanced '1'
	option qdisc_really_really_advanced '1'
	option iqdisc_opts 'nat dual-dsthost'
	option eqdisc_opts 'nat dual-srchost'
	option squash_dscp '1'
	option squash_ingress '1'
	option ingress_ecn 'ECN'
	option egress_ecn 'NOECN'
	option overhead '34'
	option enabled '1'
	option qdisc 'cake'
	option script 'piece_of_cake.qos'
	option download '45000'
	option upload '9000'

hnyman · April 28, 2017, 6:15am

Great that irqbalance helped you. But be cautious with it. Dissent1 noticed some problems if that was active quite at the boot. I did not achieve any magic improvement with it myself so I am normally not running it.

You should read the whole irqbalance discussion ending at Netgear R7800 exploration (IPQ8065, QCA9984) - #72 by hnyman

moeller0 · April 28, 2017, 12:15pm

Could you try to probe the thresholds for both directions independently, say start with 45/9 and start to increase the egress step-wise until you figure out between which two values latency increase under load starts to rise steeply, then repeat the same for ingress. With a bit of luck you will end up with a better feel for the trade-off you are selecting between bandwidth sacrifice and latency under load increase. Please note that for ingress shaping it might be worth wile to also test with multiple ingress streams as the shaper is more approximate and will show more bufferbloat with higher numbers of data flows. There is a development in cake that might make cake more independent on the number of concurrent flows (at a small cost of total throughput), watch for the "ingress" keyword to appear...

If bufferbloat is under control I would recommend to leave the off loads alone, a) cake AFAIK will segment giant packets to avoid too much lumpiness in dequeueing and b) techniques like GRO and GSO help your router better deal with high traffic situations.

Regarding your config, I would probably add "mpu 64 to both eqdisc_opts and iqdisc_opts. Also I would add option "linklayer_adaptation_mechanism 'default'", and if that does not work option linklayer_adaptation_mechanism 'cake' as otherwise mpu 64 will not work at all.

But first check whether mpu is listed in the output of:
"tc qdisc add root cake help"
if it does not give usage information for mpu refrain from adding it to the qdisc_opts...

I hope that helps

Best Regards

fantom-x · April 28, 2017, 2:52pm

Looks like I just need to copy your configuration from an earlier post. I will try that. Thx for the advice.

moeller0 · April 28, 2017, 3:40pm

Well, not really, you still should test out what bandwidth settings you are comfortable with, I believe the bufferbloat/bandwith sacrifice trade-off is a policy decision where every user will have a (slightly) different preferece. So just play around until you are happy. I just want to help getting there....

fantom-x · April 28, 2017, 4:04pm

Appreciate your help. Can you help with an example on how to do the above? I do not believe you have that in your configuration above.

moeller0 · April 28, 2017, 8:48pm

True, try:
option iqdisc_opts 'nat dual-dsthost mpu 64'
option eqdisc_opts 'nat dual-srchost mpu 64'

the rationale is that VDSL2 typically uses full ethernet frames including the FCS, and hence inherits ethernets minimum packet size of 64.

Best Regards

fantom-x · April 28, 2017, 11:41pm

Looks like I found the limit of my CPU: I cannot get more than 42M down no matter what download speed I configure above that value. The pings are 12..15..20ms (up from11ms) while heavy torrents are running while CPU utilization is at 60..75% across two cores.

Also pings remain at 11 ms while dslreport speed test is running with 32 concurrent downloads.

A steep price to pay (16% bandwidth) for improved latency, but hoping that newer versions will fix address this.

@moeller0, Thanks for you help.

These are my current settings if someone else is interested:

config queue 'wan'
	option debug_logging '0'
	option verbosity '5'
	option enabled '1'
	option interface 'pppoe-wan'
	option download '45000'
	option upload '9000'
	option linklayer 'ethernet'
	option overhead '34'
	option linklayer_advanced '1'
	option tcMTU '2047'
	option tcTSIZE '128'
	option tcMPU '64'
	option linklayer_adaptation_mechanism 'default'
	option qdisc 'cake'
	option script 'layer_cake.qos'
	option qdisc_advanced '1'
	option ingress_ecn 'ECN'
	option egress_ecn 'NOECN'
	option qdisc_really_really_advanced '1'
	option iqdisc_opts 'nat dual-dsthost mpu 64'
	option eqdisc_opts 'nat dual-srchost mpu 64'
	option squash_dscp '0'
	option squash_ingress '0'

fantom-x · April 29, 2017, 2:30am

Well, a correction is in order: the high CPU usage was caused by the torrent, not SQM. With regular 32-stream dslreport test I can get very close to 50M/10M while maintaining awesome ping latency and still having >80% of idle CPU. I am happy with this results. Thx again to everyone who helped along the way.

moeller0 · April 29, 2017, 7:27pm

Erm, you are running the torrent application on the router?

fantom-x · April 29, 2017, 7:56pm

No, not on the router. I have a Linux PC with two (1G) NICs connected to separate (1G) ports on the router. Torrents were running over one interface and pings were running over the other (two LXC containers). I had over 100 torrents of different ubuntu flavours downloading at the same time. I guess transmission opened so many sockets/streams that it put a huge strain on the router CPU.

fantom-x · April 29, 2017, 10:02pm

Oh, and the torrent traffic caused ~1,000 contexts switches per second on the router while normally the value is around 300 per second.

EDIT: Actually, I just noticed a spike to 6K context switches per sec on the chart...