Like I commented to your question in my own build thread, HTB used by simple and simplest seem to perform weakly in dual-core R7800, especially with kernel 4.4 that is used for ipq806x with LEDE 17.01.
The main reason for the weak "simple" performance with HTB+fq_codel seems actually to be HTB, not fq_codel itself. It is also possible to use "simplest_tbf" that avoids HTB by using TBF but still normally uses fq_codel. That simplest_tbf performed much better than simple (at least with kernel 4.4).
So, irqbalance has made a huge difference: now with heavy torrenting, the ping latency is ~20ms (vs 11ms ideal) at 45/9 (the link is 50/10). Setting SQM speed to 47.5/9.5 makes the latency go to 50..100ms and higher.
@moeller0, now that I am no longer CPU bound, how can I further improve SQM on my router? DO I still need to disable gro/lro/gso/tso ? My current SQM settings are below.
Great that irqbalance helped you. But be cautious with it. Dissent1 noticed some problems if that was active quite at the boot. I did not achieve any magic improvement with it myself so I am normally not running it.
Could you try to probe the thresholds for both directions independently, say start with 45/9 and start to increase the egress step-wise until you figure out between which two values latency increase under load starts to rise steeply, then repeat the same for ingress. With a bit of luck you will end up with a better feel for the trade-off you are selecting between bandwidth sacrifice and latency under load increase. Please note that for ingress shaping it might be worth wile to also test with multiple ingress streams as the shaper is more approximate and will show more bufferbloat with higher numbers of data flows. There is a development in cake that might make cake more independent on the number of concurrent flows (at a small cost of total throughput), watch for the "ingress" keyword to appear...
If bufferbloat is under control I would recommend to leave the off loads alone, a) cake AFAIK will segment giant packets to avoid too much lumpiness in dequeueing and b) techniques like GRO and GSO help your router better deal with high traffic situations.
Regarding your config, I would probably add "mpu 64 to both eqdisc_opts and iqdisc_opts. Also I would add option "linklayer_adaptation_mechanism 'default'", and if that does not work option linklayer_adaptation_mechanism 'cake' as otherwise mpu 64 will not work at all.
But first check whether mpu is listed in the output of:
"tc qdisc add root cake help"
if it does not give usage information for mpu refrain from adding it to the qdisc_opts...
Well, not really, you still should test out what bandwidth settings you are comfortable with, I believe the bufferbloat/bandwith sacrifice trade-off is a policy decision where every user will have a (slightly) different preferece. So just play around until you are happy. I just want to help getting there....
Looks like I found the limit of my CPU: I cannot get more than 42M down no matter what download speed I configure above that value. The pings are 12..15..20ms (up from11ms) while heavy torrents are running while CPU utilization is at 60..75% across two cores.
Also pings remain at 11 ms while dslreport speed test is running with 32 concurrent downloads.
A steep price to pay (16% bandwidth) for improved latency, but hoping that newer versions will fix address this.
Well, a correction is in order: the high CPU usage was caused by the torrent, not SQM. With regular 32-stream dslreport test I can get very close to 50M/10M while maintaining awesome ping latency and still having >80% of idle CPU. I am happy with this results. Thx again to everyone who helped along the way.
No, not on the router. I have a Linux PC with two (1G) NICs connected to separate (1G) ports on the router. Torrents were running over one interface and pings were running over the other (two LXC containers). I had over 100 torrents of different ubuntu flavours downloading at the same time. I guess transmission opened so many sockets/streams that it put a huge strain on the router CPU.