Packet steering should permit multiple cores be used for devices such as RPI 4

Hi openwrt developers, I found that through the following configuration I was able to achieve consistently higher put through, without any noticeable change to latency (sqm remained running), & without spreading the load in an odd fashion across all my raspberry pi 4's cores when it is running as a router.

  1. Enable IRQ balance
irqbalance.irqbalance=irqbalance
irqbalance.irqbalance.enabled='1'
  1. Enable packet steering & then modify /usr/libexec/network/packet-steering.uc like so ->
        rx_queue ??= "rx-*";                                
        let queues = glob(`/sys/class/net/${dev}/queues/${rx_queue}/rps_cpus`);
        let val = cpu_mask(cpu);                                               
+        if (dev == "eth1")                                                     
+                val = 5;      

(Ideally I would assign the value of 5 based upon the driver in use instead of matching directly against dev = eth1 but this was the easiest solution).

What this does is to enable cores 0 & 2 for eth1 which in my case is the wan & a r8152 device. I also happen to have another USB eth style device for a back up link (cdc_ether).

For some reason despite not manually tweaking or doing anything other than enabling irqbalance packet steering improves performance on my RPI 4 router ... I think soft & non-soft IRQs still sometimes tends towards CPU 2. Specifying that all cores can handle packets via RPS doesn't seem to help although load is distributed across more cores put through drops.

           CPU0       CPU1       CPU2       CPU3       
 11:     397818      55838     225440      60740     GICv2  30 Level     arch_timer
 14:      18034          0          0          0     GICv2  65 Level     fe00b880.mailbox
 15:         30          0          0          0     GICv2 114 Level     DMA IRQ
 26:          0          0          0          0     GICv2 175 Level     PCIe PME, aerdrv
 27:       1956     627533          0          0     GICv2 189 Level     eth0
 28:       3725          0          0     808164     GICv2 190 Level     eth0
 29:     593900          0          0          0  BRCM STB PCIe MSI 524288 Edge      xhci_hcd
 30:      12923          0          0          0     GICv2 158 Level     mmc1, mmc0
 31:          1          0          0          0     GICv2  66 Level     VCHIQ doorbell
 32:         10          0          0          0     GICv2 153 Level     uart-pl011
IPI0:      2929       3109       3001       3439       Rescheduling interrupts
IPI1:    220321      31515     140398      42191       Function call interrupts
IPI2:         0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0          0          0       Timer broadcast interrupts
IPI5:     28467      16594      14669      19909       IRQ work interrupts
IPI6:         0          0          0          0       CPU wake-up interrupts
Err:          0

I'll be honest and say that I am not entirely sure that my setup is "correct" or not but the idea I had was that

  • CPU0 was for eth1 and eth2 (via USB) the related xhci_hcd IRQ CPU affinity is f but it mostly seems almost entirely pinned to CPU0
  • CPU1 to handle eth0 IRQ 27 (GICv2 189 Level)
  • CPU3 handles eth0 IRQ 28 (GICv2 190 Level)
  • CPU2 handles arch_timer (more so than other cores) + misc other things

So it made sense to me to add CPU2 to eth1 rx_cpus to spread the load for USB received irqs (CPU0) rx across CPU0 & CPU2.

Here is what /proc/softirqs looks like for my router -

cat /proc/softirqs 
                    CPU0       CPU1       CPU2       CPU3       
          HI:     154546          0          0          0
       TIMER:      33277      21143      12542      22958
      NET_TX:     397203      15010     150211      60152
      NET_RX:     579858     832927     282525    1690024
       BLOCK:          0          0          0          0
    IRQ_POLL:          0          0          0          0
     TASKLET:     738565        856     160987      16464
       SCHED:      74418      51104      42057      48432
     HRTIMER:          0          0          0          0
         RCU:      55674      43136      45572      39496

In short, it would be great to be able to specify an overriding value for the CPU mask for a given interface rx and tx flows when packet steering is enabled.

You can set mask to zero to match IRQ and avoid closest cache misses moving packets between cpu cores. ymmv

Also check /proc/net/softnet_stat , 2nd 3rd cols are packets dropped in stack.

1 Like

Yes of course. However, in the RPI 4 case it might be because I also have SQM enabled that it seems to help to have eth1 soft IRQs across CPU0 and CPU2. Additionally, if rx or tx has multiple queues my understanding is that you should leave rps_cpus and xps_cpus as 0.

cat /proc/net/softnet_stat 
00239a46 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0003eefb 00000000 00000000 00000000 00000000 00000000
0000175a 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000000 00000000
0010be60 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0002177a 00000000 00000000 00000002 00000000 00000000
000014ba 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000000b5 00000000 00000000 00000003 00000000 00000000
1 Like

Anti-Example

cat /proc/net/softnet_stat 
0008c383 00000000 00000003 00000000 00000000 00000000 00000000 00000000 00000000 000043f9 00000000 00000000 00000000 00000000 00000000
00000a6b 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000000 00000000
0009eea6 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00003fd4 00000000 00000000 00000002 00000000 00000000
00000553 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000003 00000000 00000000

On "normal" network cards like i(x)gb(e) each queue is steered by irq, so you dont need extended steering. But you can manually throw 2 interrupts to different cores, and for each permit 2 independed cpus for steering.