Yes, I can confirm your observations, my R2S/R4S/R6S all giving me very low numbers which is contradicting to some YouTubers real world tests (for example R4S was tested to give ~800Mbps while R6S can go up to ~1Gbps when using FriendlyElec's build), at the beginning I though it was CPU affinity issue but I wasn't able to fix it by changing CPU affinity, and author of wg-bench also noticed that this issue happens on Big.Little architecture ARM platform, so yeah it could be kernel bug.
Thanks for the GitHub issue link I just added my comments there as well.
tbh I'm not interested in WireGuard, and the issue is more likely WireGuard-specific. Anyway the Kconfig change looks lovely, feel free to create a PR.
This thread is wireguard related, but my R5C tops out around 410 Mbps with CAKE, which also seems a bit slow for a quad core A55 at 2 GHz. But regardless, interest is noted.
On NanoPI R6S, if you just run the wg-bench test on the FriendlyWrt firmware, the result will be about 1.04 Gbits/sec.
But if after running the test - ./benchmark.sh -R simultaneously use the Luci interface, open and switch tabs, or switch to displaying updated graphs of processor load statistics or temperature, the result will be very surprising, since it will give almost 3.5 Gbits/sec!
I repeated this experiment several times on the FriendlyWrt-2024-12-09 firmware -> https://github.com/friendlyarm/Actions-FriendlyWrt/releases
Try to repeat the test conditions I said.
Obviously, this is the same kernel configuration error as in OpenWRT, only with another flag of the kernel configuration -> CONFIG_PREEMPT_VOLUNTARY=y.
Maybe this is a consequence of the RS6s little-big CPU set up?
Quad-core ARM Cortex-A76(up to 2.4GHz) and quad-core Cortex-A55 CPU (up to 1.8GHz)
So maybe it depends on which CPUs wireguards ends up running?
I did not notice any changes in the CPU cores load. It is not like one cluster is working. All 8 cores are loaded, this can be tracked by the load graphs. The only difference is that when the test gives 1Gbit/s - the cores are not heavily loaded, and when the result is 3.5Gbit/s - the cores load is about 80%.
I do wonder if this is is core-clock/mem subsystem scaling going on, ur sub-optimal memory saturation causing a bottle neck at that transaction insertion rate.
On one RPi3 that I use as remote node in order to get Tailscale/Wireguard running to ~100 Mbit line speed on such cheap device I for example cap my CPU min and max freqs between 900 MHz and 1.2 GHz, but also keep the DDR3 mem scaled up and overclocked vía config.txt.
If I don't peg the DDR and uncore/mem frequency, I would see 100% CPU utilizations with lower throughput. That was because the CPU would stall due to the bottleneck waiting for data, and thus keeping the CPU 'loaded' but not really making forward progress at the same rate.
Since the RS6s of yours has a far newer microarch, it might not be apples to apples, but I am certain there has to be a frequency sweet spot for which you could boost the floor and observe consistent 3.5 Gbit WG rate at 80% OS CPU utilizations - the "open up the GUI and use Luci might be causing those clocks to bump naturally" but leaving things to chance.