Wired routing performance on rpi4 with 21.02.3

If you have been struggling to improve wired routing performance of your pi4 due to poor softirq handling, this post is for you. I'd like to thank early posts from the likes of @dlakelan and @julienrobin28 for giving me the basics from which to build on.

I have a 1GB rpi4 with the 1Gb r8152 based tplink USB ethernet adapter. Because of the homekit bug with 21.02.3, I knew I'd have to find a way to make this work on my 1Gb fiber link without firewall software offloading. My ISP is lame, and uses PPPoE, so there's an extra network encapsulation layer to complicate matters. I didn't reach this point until after putting the rpi4 in service as my gateway router, so I didn't have a proper isolated test bench. All bandwidth testing was done with speedtest, but my main concern was with irq management.

I was shocked to realize that an additional 'bug' with 21.02 appears to be a total loss of sofirq load balancing across CPU cores. My 19.07.10 routers (gl.inet b-1300, RB750Gr3) have no problem nearly saturating all cores with softirq, but not my rpi4 on 21.02.3. This was devastating because the rpi4 was finally supposed to bring all the CPU I could need to handle 1gb full duplex.

The 2 most promising config tweaks I read about were installing the irqbalance package and configuring packet_steering (global section of /etc/config/network) thusly:

config globals 'globals'
        option ula_prefix 'fdb9:b67b:4224::/48'
        option packet_steering yes

Neither one had the desired effect. I was always limited to 600-700mbs with a single core nearly maxed out on softirq load. checking 'cat /proc/interrupts' showed all interrupts were occuring on CPU0.
So I started messing around with cpu affinity for both IRQ's and network device rx/tx queues. It turns out that I can't modify the CPU affinity for the USB irq (my eth1) or the tx queue for eth1. Things didn't play out the way one would expect, but through a process of trial and error, I was able to get to a place where I can get over 900mbs up and down (half-duplex, tho, could probably get close to that full-duplex) and softirq load is reduced overall, spread over 2 cores and rarely goes above 50% for a single core. I settled on these few tweaks that I dropped into a script and call from /etc/rc.local, but you could just copy these lines into /etc/rc.local too. Just make sure to put them in above the 'exit 0' line.

# set cpu affinity for eth0 (irq31,32) from 0-3 to 1-3
echo '1-3' > /proc/irq/31/smp_affinity_list
echo '1-3' > /proc/irq/32/smp_affinity_list

# receive queues for both eth0 and eth1:
echo 2> /sys/class/net/eth0/queues/rx-0/rps_cpus
echo 3> /sys/class/net/eth1/queues/rx-0/rps_cpus

# eth0 has 5 transmit queues for some reason
# this spreads the load across all CPU cores:
echo 3 > /sys/class/net/eth0/queues/tx-0/xps_cpus
echo 2 > /sys/class/net/eth0/queues/tx-1/xps_cpus
echo 1 > /sys/class/net/eth0/queues/tx-2/xps_cpus
echo 0 > /sys/class/net/eth0/queues/tx-3/xps_cpus

If this works for you, great. If you have a better solution, feel free to drop a comment.