RPi4 routing performance numbers

Interestingly, I went to my main router and ran a speed test while watching netdata. It uses way more CPU during the download phase of a speedtest than it does during the upload phase. This appears to be due to a dramatically higher interrupt rate. (softirq rates about 115k/s during download phase, and about 47k/s during upload phase)

I don't really understand this, because every packet my PC sends has to be received by the router, and then sent by the router... and every packet my PC receives has to be received by the router and sent by the router... so it's not like it's receiving fewer packets during upload than its receiving during download or anything like that. And it's not due to the shaper because it's actually hardware interrupts that are higher during the download phase (and disabling the shaper had little effect)...

What would make the download phase of a speedtest produce dramatically more interrupts? (tagging some people who might have some knowledge here @moeller0 who seems to be the latency expert and @jeff who did a bunch of benchmarking recently, feel free to tag a few others if you can think of someone who knows the nitty gritty of this stuff) If I could figure this out I might be able to tune things both on my main router and the PI to keep the interrupt rate down and maybe get a bunch more performance out of either of them.

Network topology for main router:

ISP Device -> Smart Switch <-bonded interface w vlans-> Router 
                   |
           More switches (but how do you *know* she is a switch?) ---> PC

EDIT: Further data....

Adding a second iperf3 instance on my main router, so I can send from my laptop to two separate servers, setting the HFSC scheduler to rate limit at 940Mbps, the overall rate goes to 433*2 = 866Mbps with shaping and the CPU usage drops to about 55% idle. Strangely the ksoftirqd is using about 45% CPU under this load whereas it's using about 100% CPU when trying to route a single very fast stream (which it does around 720 Mbps). This seems like something where with two separate streams it calms down due to spreading the load among different CPUs, or chunking the scheduling of the qdiscs into fewer interrupts or something. In any case. I think it's safe to say that under realistic routing loads, the PI can handle 500Mbps with shaping, without a problem, and with tuning (possibly default tuning under OpenWrt) could handle 700+Mbps single stream, and 850Mbps at least aggregated across multiple streams.

All this while using about 50% of its processor power... So you could run a squid proxy, or a NAS based on a USB3 spinning disk, or some kind of network monitoring with the remaining CPU.