For those interested, below is a routing performance test comparing NanoPi R4S and NanoPi R5S with and without SQM.
Apart from enabling "Software Flow Offloading," all OpenWrt settings were at their default values.
What I found odd is that with cake / piece_of_cake, the download speed was limited by the CPU, but only one core was at 100% (the other cores had low usage or were idle). I believe that OpenWrt's SQM may not be multithreaded, which would explain this behavior.