Cake-mq - backport of multi-core capable CAKE implementation to 25.12 branch

I keep reading that SQM is not compatible with "SW offloading". Has that changed?

It depends, see Build for Netgear R7800 - #4015 by moeller0 for an explanation.

Not necessarily - this is very much a home router (it's a 2023 design based on a Mediatek MT7981 Filogic 820):

. /usr/share/libubox/jshn.sh && json_load "$(ubus call system board)" && json_get_var model model && echo "Board model: $model" ; cd /sys/class/net && for i in */queu
es ; do echo -n "${i%/queues} transmit queues: " ; ls -d ${i}/tx* | wc -l ; done | grep -v ' 1$'

gives the following output:

Board model: D-Link AQUILA PRO AI M30 A1
eth0 transmit queues: 16
internet transmit queues: 16

if eth0 is your WAN port then you're lucky! For many it's just a CPU port connected to internal switch.

?

On this particular model there are two ethernet devices - eth0 (2.5G) is the link to a 5 port switch chip, and internet is a dedicated wan port (1000BASE-T). Both have 16 tx queues.

OK, I finally had time to update to snapshot


root@linksys-mx4200v1-1:~# ls -d /sys/class/net/wan/queues/tx-* | wc -l
4
root@linksys-mx4200v1-1:~# ls -d /sys/class/net/lan1/queues/tx-* | wc -l
4
root@linksys-mx4200v1-1:~# echo 500000 > /sys/kernel/debug/cake_mq/sync_time_ns

root@linksys-mx4200v1-1:~# cat /etc/config/sqm

config queue 'eth1'
        option enabled '1'
        option interface 'lan3'
        option download '92160'
        option upload '92160'
        option qdisc 'cake'
        option script 'piece_of_cake.qos'
        option linklayer 'ethernet'
        option use_mq '1'
        option debug_logging '0'
        option verbosity '5'
        option overhead '42'

during a speedtest, htop shows 100% utilization on cores 0 and 3, but 0% utilization on cores 1 and 2. I'm also getting 50/60 on my 100/100 connection, with latency +56ms/+39ms, so something is definitely wrong when I'm able to use single core cake w/ 15% cpu utilization and get 90/90 w/ <5ms/<5ms.

I think one thing became clear, cake_mq really depends on proper multiqueue support by the NIC, and unlike in the enterprise NIC market that has not been an important factor for low-cost router-SoCs, so it seems that even when these support multi-queue often not with the expected performance characteristics to make cake_mq really fly.

Is there any reason why the cake_mq depends on number of TX queues in the NIC, rather than the just number of CPU cores?

I believe yes, cake_mq expects the NIC to be able to steer packets into its different queues (each attached/serviced by a different CPU). Not sure that this is the only possible design, but it seems to be a design that naturally evolved from multiqueue NICs... I believe @tohojo and @jkoeppeler will be able to actually explain this much better than my quite vague beliefs about this allow for.

In general, the question is how does the CPU give network packets to hardware. This is done through TX-rings (ringbuffer) in memory (RAM), the CPU enqueues a packet and the hardware will pull that packet into its internal engine and eventually send it. While doing this you might also touch some other data structures, gather statistics, etc. To ensure correct behavior on multicore systems you need to synchronize these accesses, this is typically done by a lock/mutex. Taking a lock can be an expensive operation, especially if many cores at the same time try to access it (because the CPU needs to ensure that all cores have the same view of the current value [locked or not locked]). To efficiently scale its best if you can avoid these locks, by just have another TX-ring such that every CPU can access their own rings, and do not need to synchronize for access. The qdisc model, allows you through MQ-qdisc to attach a qdisc for each queue. Each qdisc is again guarded by a root-qdisc lock. (Contention on this lock is what we try to avoid with cake_mq while still enforcing the rate limit by synchronizing the cores/qdisc-instances).
Alternatively, you could rewrite cake in a way that it does not need to rely on the root-qdisc lock, but you make everything inside thread-safe - meaning correct functioning while many cores are handling the cake data structures. But if you cannot access the hardware concurrently (because you only have one ring) then it still hinders your scalability.

I got finally some routers which are multi-queue capable, but the performance implications are often not that clear (not yet :wink: ). I believe in some cases, even if you see multiple queues per port, lets say 4 queues for eth0 and eth1, then this does not necessarily mean that the hardware has 8 tx rings, it could still be 4 but they are shared across ports. I am unsure at this point what the performance implications are. So in that sense, we are still working on it, but it may take some time.