Okay, so in theory I would expect at full saturation on average an added delay equal to the sum of the target values from cake's statistics, that would add up to 10ms. But in practice I often see more like double of the target sum, so that would be 2 * (5 + 5) = 20ms in your case and then the observed 15 to 35 seems in the right ball park. So that seems not great but okay (especially ince cake when the cpu is overburdened will keep the bandwidth up at the cost of a little added latency under load, HTB+fq_codel as in simplest.qos will keep the latency low at the cost of reduced bandwidth under load). So you could test this by trying simplest.qos+fq_codel...
In addition you might want to log into your rputer via SSH while you run your tests and look at the output of "top -d 1" which will give you an snapshot of your router's load every second. If idle hits zero or constantly hovers near zero you might be CPU cycle limited (in that case I would also expect the sirq value to be relatively high). But at 50/10 most not too old router's should cope one would hope...
Thx for the hint. Unfortunately simplest.qos+fq_codel provides worse latency: at 25/5 it is over 20ms while with cake it is 12..13 ms.
This router has a a dual core CPU at 1.7GHz and at 25/5 "top" reports >70% idle. At 45/9 it is ~50% idle and 35..50% sirq. It starts being single-core bound around 40/8 or less: I guess SQM is running on a single core?
At 35/7 latency drops to ~15..20ms again (with torrents and GRO disabled) with >60% CPU idle.
At 30/6 ping latency is < 15 ms and 70% CPU idle, 25% sirq.
I did not realize SQM would be so CPU intensive and this router has one of the most powerful CPUs...
Thx for the hint. Unfortunately simplest.qos+fq_codel provides worse latency: at 25/5 it is over 20ms while with cake it is 12..13 ms.
This router has a a dual core CPU at 1.7GHz and at 25/5 "top" reports >70% idle. At 45/9 it is ~50% idle and 35..50% sirq. It starts being single-core bound around 40/8 or less: I guess SQM is running on a single core?
At 35/7 latency drops to ~15..20ms again (with torrents and GRO disabled) with >60% CPU idle.
At 30/6 ping latency is < 15 ms and 70% CPU idle, 25% sirq.
50% idle probably means that one core is maxed out and the other completely idle
(in top, hit '1' to have it show each core separately), even 60% idle is pretty
tight.
It's very possible that you are running out of cpu here.
Like I commented to your question in my own build thread, HTB used by simple and simplest seem to perform weakly in dual-core R7800, especially with kernel 4.4 that is used for ipq806x with LEDE 17.01.
The main reason for the weak "simple" performance with HTB+fq_codel seems actually to be HTB, not fq_codel itself. It is also possible to use "simplest_tbf" that avoids HTB by using TBF but still normally uses fq_codel. That simplest_tbf performed much better than simple (at least with kernel 4.4).
So, irqbalance has made a huge difference: now with heavy torrenting, the ping latency is ~20ms (vs 11ms ideal) at 45/9 (the link is 50/10). Setting SQM speed to 47.5/9.5 makes the latency go to 50..100ms and higher.
@moeller0, now that I am no longer CPU bound, how can I further improve SQM on my router? DO I still need to disable gro/lro/gso/tso ? My current SQM settings are below.
Great that irqbalance helped you. But be cautious with it. Dissent1 noticed some problems if that was active quite at the boot. I did not achieve any magic improvement with it myself so I am normally not running it.
Could you try to probe the thresholds for both directions independently, say start with 45/9 and start to increase the egress step-wise until you figure out between which two values latency increase under load starts to rise steeply, then repeat the same for ingress. With a bit of luck you will end up with a better feel for the trade-off you are selecting between bandwidth sacrifice and latency under load increase. Please note that for ingress shaping it might be worth wile to also test with multiple ingress streams as the shaper is more approximate and will show more bufferbloat with higher numbers of data flows. There is a development in cake that might make cake more independent on the number of concurrent flows (at a small cost of total throughput), watch for the "ingress" keyword to appear...
If bufferbloat is under control I would recommend to leave the off loads alone, a) cake AFAIK will segment giant packets to avoid too much lumpiness in dequeueing and b) techniques like GRO and GSO help your router better deal with high traffic situations.
Regarding your config, I would probably add "mpu 64 to both eqdisc_opts and iqdisc_opts. Also I would add option "linklayer_adaptation_mechanism 'default'", and if that does not work option linklayer_adaptation_mechanism 'cake' as otherwise mpu 64 will not work at all.
But first check whether mpu is listed in the output of:
"tc qdisc add root cake help"
if it does not give usage information for mpu refrain from adding it to the qdisc_opts...
Well, not really, you still should test out what bandwidth settings you are comfortable with, I believe the bufferbloat/bandwith sacrifice trade-off is a policy decision where every user will have a (slightly) different preferece. So just play around until you are happy. I just want to help getting there....
Looks like I found the limit of my CPU: I cannot get more than 42M down no matter what download speed I configure above that value. The pings are 12..15..20ms (up from11ms) while heavy torrents are running while CPU utilization is at 60..75% across two cores.
Also pings remain at 11 ms while dslreport speed test is running with 32 concurrent downloads.
A steep price to pay (16% bandwidth) for improved latency, but hoping that newer versions will fix address this.
Well, a correction is in order: the high CPU usage was caused by the torrent, not SQM. With regular 32-stream dslreport test I can get very close to 50M/10M while maintaining awesome ping latency and still having >80% of idle CPU. I am happy with this results. Thx again to everyone who helped along the way.
No, not on the router. I have a Linux PC with two (1G) NICs connected to separate (1G) ports on the router. Torrents were running over one interface and pings were running over the other (two LXC containers). I had over 100 torrents of different ubuntu flavours downloading at the same time. I guess transmission opened so many sockets/streams that it put a huge strain on the router CPU.