Ok, to follow up, I figured out how to enable Receive Packet Steering:
and enabled it on my ethernet devices:
echo 2 > /sys/class/net/eth1/queues/rx-0/rps_cpus
echo 1 > /sys/class/net/eth0/queues/rx-0/rps_cpus
so that CPU1 handling received packets on eth1 (USB)
so that CPU0 handling received interrupts on eth0
The result was 65% cpu idle, and one ksoftirqd/2 using 100% of the CPU... (?) with 35.5% softirq shown in top -d 1
bandwidth is now 830Mbps or so
[ 5] 5.00-6.00 sec 99.2 MBytes 832 Mbits/sec
[ 5] 6.00-7.00 sec 99.0 MBytes 830 Mbits/sec
[ 5] 7.00-8.00 sec 99.0 MBytes 830 Mbits/sec
[ 5] 8.00-9.00 sec 99.0 MBytes 830 Mbits/sec
So that feels like progress...
I bumped this to 860 or so via some kind of adjustment to the interrupt affinity for the eth0, and briefly had about 920 at one point... so the possibility is there to route line-speed... At no point does the device go below about 50% idle, so there are basically 2 cores not in use even routing almost a full gigabit.
EDIT: further info.
I installed the simple nftables firewall from my other thread QoS and nftables ... some findings to share
And I put in a custom hfsc based shaper on both eth0 and eth1, that I have used before, and with those in place, the Pi will route 575 Mbps, it shows 67% idle and 34% softirq.
The big issue seems to be failure to be able to multi-thread the handling of packets. I would have thought the two ethernet devices would have split across two CPUs and then we'd be seeing a higher level of throughput, but apparently not.
Here are the /proc/interrupt outputs:
CPU0 CPU1 CPU2 CPU3
17: 0 0 0 0 GICv2 29 Level arch_timer
18: 87012 5643314 31811 46223 GICv2 30 Level arch_timer
23: 339 0 0 0 GICv2 114 Level DMA IRQ
31: 3560 0 0 0 GICv2 65 Level fe00b880.mailbox
34: 6554 0 0 0 GICv2 153 Level uart-pl011
37: 0 0 0 0 GICv2 72 Level dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb3
38: 0 0 0 0 GICv2 169 Level brcmstb_thermal
39: 25015 0 0 0 GICv2 158 Level mmc1, mmc0
45: 0 0 0 0 GICv2 106 Level v3d
47: 4251576 0 0 0 GICv2 189 Level eth0
48: 20076230 0 0 0 GICv2 190 Level eth0
54: 51 0 0 0 GICv2 66 Level VCHIQ doorbell
55: 0 0 0 0 GICv2 175 Level PCIe PME, aerdrv
56: 8839918 0 0 0 Brcm_MSI 524288 Edge xhci_hcd
FIQ: usb_fiq
IPI0: 0 0 0 0 CPU wakeup interrupts
IPI1: 0 0 0 0 Timer broadcast interrupts
IPI2: 9669 13473 199712 95848 Rescheduling interrupts
IPI3: 474 2470 1309 1222 Function call interrupts
IPI4: 0 0 0 0 CPU stop interrupts
IPI5: 27560 1708 1753 2936 IRQ work interrupts
IPI6: 0 0 0 0 completion interrupts
Err: 0
I wonder if someone like @moeller0 has an idea or knows someone who has an idea about how to improve interrupt handling and speed up shaping etc.