Need help load balancing SQM on NanoPi R5c

hap1 · January 11, 2024, 4:05am

I have a NanoPi R5C running the latest OpenWrt snapshot. This device has a quad core Cortex A55 processor and I'm trying to balance the load when running SQM. My SQM settings are eth1 - 375000/375000 - cake - piece_of_cake.qos

Here are the default settings for this device

root@OpenWrt:~# grep eth /proc/interrupts
 73:          0    4191499          0          0       MSI 134742016 Edge      eth0
 74:          0          0    4221050          0       MSI 268959744 Edge      eth1

root@OpenWrt:~# cat /proc/irq/73/smp_affinity
2
root@OpenWrt:~# cat /proc/irq/74/smp_affinity
4

root@OpenWrt:~# cat /sys/class/net/eth0/queues/rx-0/rps_cpus
0
root@OpenWrt:~# cat /sys/class/net/eth1/queues/rx-0/rps_cpus
0

Default settings test

I use the Waveform bufferbloat test while monitoring the CPU usage with htop.

CPI 0: 0%
CPU 1: 65%
CPU 2: 100%
CPU 3: 0%

There is no load at all on CPU 0 and 3. I am not sure what the value of 0 means for /sys/class/net/eth1/queues/rx-0/rps_cpus. Which CPU core does it use?

Pin each on different cores

I applied the following change

root@OpenWrt:~# echo 1 > /sys/class/net/eth1/queues/rx-0/rps_cpus
root@OpenWrt:~# echo 8 > /sys/class/net/eth0/queues/rx-0/rps_cpus

CPI 0: 90%
CPU 1: 23%
CPU 2: 30%
CPU 3: 50%

All cores now have load, but during the download speed test the speed fluctuates quite a bit and sometimes dips below 200 Mbps. This did not happen with default settings

Pin queues on core 0 and core 3

root@OpenWrt:~# echo 9 > /sys/class/net/eth0/queues/rx-0/rps_cpus
root@OpenWrt:~# echo 9 > /sys/class/net/eth1/queues/rx-0/rps_cpus

CPI 0: 85%
CPU 1: 23%
CPU 2: 30%
CPU 3: 50%

All cores have load and the download speeds are more stable

Pin queues to all cores

root@OpenWrt:~# echo F > /sys/class/net/eth0/queues/rx-0/rps_cpus
root@OpenWrt:~# echo F > /sys/class/net/eth1/queues/rx-0/rps_cpus

CPI 0: 40%
CPU 1: 75%
CPU 2: 75%
CPU 3: 40%

All cores have load and the download speeds are more stable.

Based on my tests I have some questions:

What does the value of 0 mean for /queues/rx-0/rps_cpus?
Even though the load was more balanced in the last configuration the latency under load was similar to stock when only 2 cores were used. Why is this the case since they are the same. Does the core clock go down with more cores being used?
I thought SQM was single threaded, I was surprised to see the utilization spread more evenly with the different settings
What would be the best config to use here? I am not planning on running many services on the router so routing and SQM will be the highest utilization.

Thanks!

hap1 · January 11, 2024, 5:37am

I found a SUSE kb article about Receive Packet Steering (RPS)

By default, all bits will be 0 disabling RPS, and therefore the CPU which
handles the interrupt will also process the packet .

The article recommends this to improve performance

On non-NUMA machines, all CPUs can be used, and excluding the
CPU handling the network interface can boost performance if the
interrupt rate is very high.

I could not find out if the Rockchip SoC has a NUMA architecture but I doubt it. Based on this KB I think the best performance would be to exclude the cores handling the network interface

root@OpenWrt:~# echo 9 > /sys/class/net/eth0/queues/rx-0/rps_cpus
root@OpenWrt:~# echo 9 > /sys/class/net/eth1/queues/rx-0/rps_cpus

This is similar to what the NanoPi R4s wiki entry suggests, although in that case the CPU cores are heterogeneous but I think there is some overlap in reasoning

moeller0 · January 11, 2024, 6:39am

Typically you use one sqm instance each for ingress and egress traffic, and these two instances can be run on separate CPU. And yes each instance currently is single threaded...

Interrupt load for networking tends not to be too big an issue for decent NICs and lowish rates like 1 Gbps, so keeping interrupt processing isolated/separated from other network processing is not always a win, so you need to try this out...
BTW (aimed at casual readers not the OP), for 4 CPUs the value to set to allow processing on all IMHO you need to set the value to:
1(CPU0)+2(CPU1)+4(CPU2)+8(CPU3)=15,
a value of 9 means: 1+8, so CPU0 and CPU4

hap1 · January 11, 2024, 4:29pm

Thanks for your thoughts!

Interrupt load for networking tends not to be too big an issue for decent NICs and lowish rates like 1 Gbps, so keeping interrupt processing isolated/separated from other network processing is not always a win, so you need to try this out...

I’m curious why that would be the case. Even if load is low isn’t it better to give SQM its own separate cores?

Does having the interrupts on the same core as SQM save the CPU from needing to do some sort of context switching or data transfer (not sure if cache is shared?). That might explain why I got less consistent performance with pinning each queue to a separate core even compare to the base case where RPS is disabled

moeller0 · January 11, 2024, 7:35pm

If you have cores to spare, maybe, maybe not. This is one of the things that needs to be optimized empirically... as I carefully argued "not always a win". On a dualcore router that method results in both ingress and egress cake instances sharing a single CPU (if the wan and lan interface are processed by CPU0, like the case for mvebu), which is rarely a good thing...

Yes, if you have enough CPUs that might be true...

No real data, if I would need to speculate I would guess that this might be helpful for cache memory, but as I said that is speculation...

As I said, whether or not to avoid the interrupt processing CPUs is hard to predict and should be tuned individually for each set-up...

hap1 · January 11, 2024, 10:30pm

Thank you! Appreciate your patience and responses to this thread and the other as well

fantom-x · January 12, 2024, 2:03am

Not sure that is always the case. The same core can process both hard and soft interrupts if there is room on the CPU. What I think you will do in this case, is kill “cache locality” (that is what I think it is called): you have to either send the data to the other core or will have it reload the data again. In short, there is benefit in reusing the same cores and only using other ones if the load is heavy.

system · January 22, 2024, 2:03am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.