How does (software) flow_offloading interact with SQM?

Lots of conflicting info online.

Well software flow offloading (or any accellerator technology for that matter) bypasses various kernel subsystems to achieve higher overall throughput. This conflicts with shapers and queue management to some extent as these need access to all packets of a flow and not just the initial handshake ones.

Is this correct? Can someone give a high level summary of how flow offloading works as opposed to SQM? Can they coexist at all in openWRT?

I did some informal testing and found that SQM (fq_codel) "alongside" software flow offloading (flow_offloading not flow_offloading_hw) raised my downstream from around 120 Mbps to 140 Mbps. However it seemed to come at a cost of 0-15 ms of additional bufferbloat latency vs fq_codel on its own. In addition, I'm still not sure if fq_codel was active whatsoever, based on my reading that they "conflict" with each other. My bufferbloat tests are far better with "both" active than with both disabled - my loaded latency was 200-400ms before enabling them.

No offloading:

Yes offloading:

More reading:


Wish this would see more development ... :no_mouth:

I found couple of interesting threads:

I am not a developer but if there is any reading this , what do you think about this?
We are approaching territory where SQM becomes impossible on all but highest end hardware... You need a beast of a router to work with gigabit ethernet (Let's not even talk about higher speeds - 5 or even 10 gigabit, straight up impossible)

Not really:
a) most ISP routers are shipped with wholly under powered CPUs for the links they are deployed on, and make up for that by using some off-load engines. Already relatively old x86 devices manages a gigabit in software quite well and are not that power hungry. But sure most convenient all-in-one routers are out of their league for such fast link rates.
b) 5-10 Gbps, honestly, why does one need that right now? Besides folks constantly sinking and sourcing large bodies of data (this is the one application ), I see little practical improvement from switch from 100/40 to 250/40, let alone to 1000 or even "10000" Mbps. Well, it might give you bragging rights, but for that it really does not matter whether that contracted speed is actually usable :wink:

Personally I used my 100/40 link for a long time shaped down to 49/29 IIRC as that was the SQM limit for my old router; in my testing for my work-loads, sacrificing ~50% of the aggregate link speed was less of an issue than not having SQM. I ended up switching to another router that has no problems with SQM at 100/40, but my point is just because a contract says X Mbps there is no requirement to actually run the shaper closer t that rate. One might even able to save some money, by ordering a rate that is within one's router's SQM "limit". But that clearly is subjective, if for your use-cases raw rate is more important than acceptable responsiveness under load I am happy to accept that.

How are ISPs able to do SQM on their weak routers... is the hardware special in any way ?

Like dedicated chips for hardware SQM or are they using same hardware that we all have in our routers?

I think this should be the focus of development going forward...
We are already struggling with gigabit speeds - pinned post being buy new router if you wanna do SQM ... Once everyone is on 5 to 10 gigabit... only x86_64 will be able to do that?

You already can't in my country :stuck_out_tongue: Multiple ISPs offering 2Gbps as the cheapest package available.

Well, most ISPs simply do not offer SQM at all... for others like the NSS cores there are at least fq-codel & shaper implementations that allow some sort of SQM configuration to use the offload engine (but the NSS offload system uses two general purpose CPUs and not dedicated ASICs for specific network functionality).

Yeah that as far as I know does not really exist in that form (I hope I am simply misinformed). HOWEVER the most costly part of SQM is the actual traffic shaping, if that could be off-loaded generically to hardware and still combined with a software qdisc that would go a long way...

Sure, go ahead and develop in that direction :wink: I am sure you will find a lot of interested users...

Well, nobody forces you to either buy a 5-10 Gbps link and even if you do you can operate your traffic shaper at lower speeds.... I would guess that apples M1/M2 CPUs or amazons graviton2 would have little problem with 5-10 Gbps, but these likely are way more expensive than x86 solutions (unless you need multi-dozend core xeoan CPUs these are pretty pricy themselves).

Well, OK, but you still can operate a traffic shaper at sat 100/100 even on a 2 Gbps link :wink: