Squeezing blood from a stone: SQM at 600+ Mbps on IPQ4018 without NSS?

Hi everyone,
I have a silly question. Or maybe not so silly.

We build mesh routers for crisis response (refugee camps, disaster zones, festivals). Our hardware is the 8dev Jalapeno / MeshPoint.One - IPQ4018, quad-core Cortex-A7 @ 717 MHz, 256 MB RAM. Old hardware by today's standards, but it's what's deployed in the field and we can't swap it out.

We've been doing a deep dive into SQM performance on this platform and found something interesting that I'd love the community's input on.

THE SITUATION

IPQ4018 has NO NSS cores.
So all packet processing is software-only on the ARM cores.

Current performance:

  • Raw forwarding (no SQM): ~780 Mbps
  • With software flow offload: ~950 Mbps
  • With CAKE (single-queue): ~200-250 Mbps <-- the problem
  • With fq_codel (no shaping): ~600-800 Mbps

CAKE on a single core is the bottleneck. Meanwhile 3 cores sit mostly idle. Classic.

WHAT I FOUND

  1. The IPQ4018 EDMA driver exposes 4 TX queues per netdev (EDMA_NETDEV_TX_QUEUE = 4 in edma.h, confirmed in both legacy essedma and new IPQESS drivers).

  2. CAKE_MQ (merged into net-next for Linux 7.0) creates one CAKE instance per hardware TX queue, distributing the work across cores.

  3. In theory: 4 TX queues x 4 CPU cores = CAKE distributed across all cores = potentially 600-800 Mbps with full QoS.

Has anyone tested CAKE_MQ on IPQ40xx hardware? Does the EDMA driver's multi-queue implementation actually distribute softirq processing across cores, or does it all end up on core 0 anyway?

THE BANDWIDTH PROBLEM

Our use case makes bandwidth estimation... interesting:

  • WiFi mesh backhaul: anywhere from 10 to 800 Mbps depending on distance, interference, weather, number of mesh hops
  • WiFi AP with clients: 5 to 150 simultaneous users, signal quality varies wildly
  • WAN uplink: often Starlink or cellular, bandwidth oscillates throughout the day

We can't hardcode a bandwidth value for CAKE because nothing is fixed.

We know about cake-autorate and it looks promising for the WAN side.

But for the mesh backhaul and AP interfaces, we're relying on kernel

fq_codel + mac80211 per-station fq_codel + AQL, with no explicit shaping.

Question: Is there a simple approach for periodic bandwidth measurement + SQM reconfiguration? Something like:

  1. Detect low-traffic period
  2. Run quick bandwidth probe (iperf3 to next mesh hop?)
  3. Reconfigure CAKE bandwidth parameter
  4. Repeat every few hours

Or is this overengineering and fq_codel without shaping is genuinely

"good enough" for mesh links where bandwidth is unknown?

WHAT I’M DOING NOW

Our current stack (all available today on OpenWrt 24.10):

  • Gateway WAN: CAKE besteffort + cake-autorate (only interface with explicit shaping)
  • Mesh backhaul: kernel fq_codel (default, zero config)
  • WiFi AP: mac80211 per-station fq_codel + AQL (driver-level)
  • LAN: kernel fq_codel (default)
  • IRQ affinity + RPS/XPS distributed across all 4 cores
  • NAPI budget tuned to 1000
  • CPU governor: performance

This gets us to approximately 300-350 Mbps with CAKE on WAN. We're hoping CAKE_MQ on OpenWrt 25.12 will push that to 600+.

SPECIFIC QUESTIONS

1. CAKE_MQ on IPQ40xx: Has anyone tested it? Does the EDMA multi-queue actually distribute across cores?

2. SFE + egress qdisc: We confirmed from source that SFE calls dev_queue_xmit() which preserves the egress qdisc. Anyone running SFE + CAKE/fq_codel in production on IPQ40xx? Any gotchas beyond the known ingress/IFB issue?

3. Mesh QoS without bandwidth knowledge: For WiFi mesh links where bandwidth varies 10-800 Mbps, is fq_codel genuinely the right answer? Or are we leaving performance on the table?

4. cake-autorate on Starlink: Anyone running this combination? How well does it adapt to Starlink's bandwidth variations?

5. Am I missing something obvious? Any IPQ4018-specific optimizations we haven't considered?

We've documented our full analysis including per-interface recommendations, CPU impact measurements, and community network research. Happy to share the write-up privately if anyone is interested.

A NOTE ON DAVE TAHT

I want to say something personal here. Dave was a mentor to me. During the years we were building MeshPoint - mesh routers for refugee camps along the Croatian border in 2015-2016 - Dave was incredibly generous with his time and knowledge. He answered emails within hours, jumped on calls whenever I asked, and never once made me feel like my questions were too basic. He genuinely cared about getting networks right for the people who needed them most.

The fact that fq_codel + AQL "just works" on our mesh nodes without any configuration - that's Dave's legacy in every packet we forward. We're trying to build on that foundation for crisis response networks where connectivity saves lives.

The 25.12 dedication is well deserved. Rest in peace, Dave.

Thanks for any insights. Happy to share our benchmark data and test methodology if useful.

2 Likes

Per Cake-mq - backport of multi-core capable CAKE implementation to 25.12 branch

You may just have to upgrade to 25.12.0

I’m running sqm on a ipq4018 with 25.12.0 and it did seem slightly faster.

1 Like

Very simply said, no. Your expectations exceed what ipq40xx can give you - and multi-core SQM will not magically triple your throughput either, the keyword there is 'sightly', not three-fold.

2 Likes

Kind of it is 1 tx 1 rx queue+interrupt per core, plain irqbalance does pretty well
rps xps is quite futile as packets come in/out on all cores anyway.

Cake runs explicitly on one core, if it shapes input and output then on two....
Obviously your numbers are not real, input and output are independent and you do not specify a single bit about subscription's capacity or guaranteed rate.

You kind of did not mention a single baseline test without qos.
Do one now and post the result link
x https://www.waveform.com/tools/bufferbloat

If it is 600Mbps and grade A without effort nno qos will improve it on low-midrange soc.

@slh @anon63541380 Thanks for the reality check.

You're right - 600+ Mbps with CAKE was wishful thinking. After going through the CAKE_MQ thread and @da_anton's IPQ4019 results (~170 Mbps, all cores maxed), the ceiling is clearly ~200-250 Mbps for CAKE on Cortex-A7. The title was aspirational, not a measurement.

The numbers I posted (780 Mbps, 950 Mbps) were estimates for raw forwarding without SQM, not shaped throughput. That wasn't clear - my fault. We don't have actual hardware benchmarks yet, those are coming once we flash 25.12 on our boards.

On EDMA IRQ distribution - interesting. We looked into the driver source and noticed the new ipqess driver (25.12) seems to handle this differently from the old essedma driver. Still figuring out the details on that one.

Some context on what we're doing: we build mesh networks for crisis response and remote communities. Field deployments use whatever WAN is available - 4G, Starlink, WiFi uplinks, 5G. Bandwidth changes constantly, which makes static CAKE bandwidth settings tricky.

What I'd really love to hear from people who've been deep in this:

  • What's worth optimizing vs. a waste of time on IPQ40xx? We don't want to chase diminishing returns.
  • Is there a middle ground between 'fq_codel with no shaping' (~950 Mbps but no rate control) and 'full CAKE' (~200 Mbps with complete bufferbloat control)?
  • If you were thinking outside the box and had full control over the software stack on this hardware - what would you try?

We're genuinely curious. We'll post real benchmarks once we have hardware running 25.12.

Good point. I did run some quick tests before but didn't save the results properly. I'll repeat with the Waveform bufferbloat test over the weekend and post the results here.

Our main concern isn't actually the 1 Gbps office line - it's field deployments on 4G/Starlink (30-100 Mbps variable) where the WAN is the bottleneck and bufferbloat is a real problem.