NanoPi R4S and Gigabit SQM

I recently bought an R5S hoping to use it as a gigabit router with some future proofing due to the 2.5G lan. I realized that although the R5S uses a newer SOC, it uses 4 low power cores (A55) vs the R4S that has 2 high performance cores (A72) and 4 efficiency cores (A53). So I was doing some benchmarking and as expected the R5S could handle gigabit routing just fine, however when I enabled SQM, CPU usage skyrocketed and throughput dropped (basically equivalent to the A53 dual core in my RT3200) implying that the SOC, despite being newer, is not as performant as the one in the R4S

I have read that the RPi4 is good for gigabit routing (4 A72 cores). Does this extend to the R4S as well, or is the only realy option for gig routing x86? Has anyone had experience with SQM on the R4S?

Thanks!

Biggest problem with SQM on the R4S at gigabit line rate is that the lower-speed cores aren't quite up to the task of handling interrupts and nowhere near powerful enough for line-rate shaping. The balanced (high) speeds of the Pi 4 cores make a difference, but at the expense of slightly higher WAN latency when using a USB NIC and less convenient sysupgrades (due to missing kernel modules).

Which SQM are you using? I recall cake was more intensive than fq_codel. For a reference point, my RPi4 does cake/piece_of_cake on an asymmetric 900000 kbit/s down/300000 kbit/s with no buffer bloat (waveform test). CPU usage on a single core does hit high nineties in htop.

Care to report the throughput you achieved with SQM on the RS5? It would be nice to document the OpenWrt version you used as well.

P.S.: As far as I know raspberyy pi4b's achieve gigabit shaping with mostly large MTU-sized packets, not sure what their actual packet per second limit is, but for most normal use cases this is somewhat moot, but it would be interesting to tests as likely the same PPS-limit would apply for 2.5 Gbps ethernet and might allow to predict the achievable SQM rates for faster NICs.

It seems pretty clear that in the router market with its relative thin margins cost is a major factor in component choice by manufacturers (if we are generous power consumption might be as well*). That in turn means newer devices are not necessarily more performant than older ones**, but for the RS series target market I am a bit puzzled that the RS5 is not a clear winner.

As your numbers indicate routing at ~1 Gbps is well in the reach of the RS5 and IIRC the RS4, the question is more can they also do fancier things on a packet by packet basis at that speeds. In theory high performance ARM cores will work equally well as "high performance"*** x86, but getting router using such cores is hard or expensive (think getting a current M1/2 apple mac mini only to instal Linux and use it as router).
In case you go the x86 route, keep in mind that older x86 architectures like Sandy-/Ivybridge tend to have a high sustained energy consumption which can be quite unattractive in spite of these machines being available second-hand for attractive prices. But there are threads in the forum about which second-hand x86 devoces/appliances make decent routers should you want to switch to x86.

*) but since the manufacturers do not pay for the operational cost and power numbers are not advertised aggressively I have some doubts.
**) Many comercial routers make up for lack of powerful CPUs by using dedicated hardware acceleration off-load engines that make normal internet access fast in spite of weak-ish CPUs
***) For traffic shaping at 1 Gbps one does not need that high performance an x86 CPU a number of atom based SoCs are also up to that task.

Often true, cake does more than HTB+fq_codel but at a cost. It is not fully clear what the cause is so no improvements for cake are on the horizon. That said switching to fq_codel/simple.qos (usinf HTB as shaper and fq_codel as scheduler+AQM) is a viable option if not all of cake's additional features are needed.

Here are the results I got from the Waveform Bufferbloat Test. I believe that network variation is a big factor as my results early morning are far more consistent and don't show a big drop.

All tests were using Verizon Fios 940/880 Mbps. I used FriendlyWRT 21.02 (FriendlyWRT 22.03 randomly stops routing packets and the early work on getting mainline OpenWRT working doesn't yet have SQM):

No SQM (748.9/593.1):

CAKE+layer_cake(735.3/226.0)

CAKE+piece_of_cake(780.0/233.0)

fq_codel + piece_of_cake(421.0/249.3):

Due to the weird result of SQM outperforming no SQM, I ran a second No SQM test (808.7/180.0):

Hope this helps!

Post your /etc/config/sqm
For reference, my pi4's:

config queue 'eth1'
  option debug_logging '0'
  option verbosity '5'
  option upload '300000'
  option qdisc 'cake'
  option script 'piece_of_cake.qos'
  option interface 'eth1'
  option linklayer 'ethernet'
  option overhead '0'
  option download '900000'
  option enabled '1'

It looks like this for CAKE + piece of cake:

config queue 'eth1'
	option linklayer 'none'
	option interface 'eth0'
	option download '900000'
	option upload '850000'
	option debug_logging '0'
	option verbosity '5'
	option script 'piece_of_cake.qos'
	option enabled '1'
	option qdisc 'cake'

Some thoughts from reading about throughput performance on the RxS devices.
Test throughput _though_the device and don't run the test with the R4S as an endpoint.
Default settings might have IRQs, queues and other processes on the same core or similar.
Use top or whatever appropriate tool to watch per CPU load during the tests.
Look at assigning processes to specific cores or try irq-balance.
More info on the device page:

Lots of discussion on this topic: