Shaping performance

It would be nice to have a table showing maximum shaping performance by CPU, similar to the OpenSSL benchmarks in the wiki. For many, addressing bufferbloat is one of the key reasons to use OpenWrt/LEDE, but it is tricky to find out how much CPU power you need to cover your internet connection speed. Some of this information is buried in the recommendation threads, but it's not super easy to find.

If there is a clear method for testing this, that would help others contribute the benchmark data. There may be differences between fq_codel and cake.

Note: this is different than the bufferbloat score you get from the DSL Reports speedtest. I want to know the maximum bandwidth a given CPU can handle with SQM enabled.

There is also a lot of confusion surrounding whether HW NAT acceleration is compatible with SQM. It seems that it isn't, but this question keeps coming up. This should be clarified.

1 Like

HW NAT is not compatible with SQM according to @nbd here: Hardware NAT For LEDE - #264 by nbd

maybe a colum in the device list?
but would be hard to keep accurate due to code/performance flux

rule of thumb imo

  • number of cores doesnt really matter
  • ~Ghz arm will do up to a few hundred Mbit
  • for more use something with a fan (x86 with intel nic)

While that might sound nice, it's basically impossible to 'crowdsource' this benchmarking data. Yes, it's easy to get anecdotal data like "works for my 50 MBit/s connection", that doesn't give you any advice for 100, 150 or 300 MBit/s lines - and the number of users with full symmetric 1 GBit/s connections is way too low for gathering information for a broader number of devices.

Measuring effective routing speed isn't as easy as running iperf or flent, given that you have to take additional overhead (e.g. PPP(oE) for xDSL and some fibre contracts; setting up a pppd is something not many users can or will do for benchmarking) into account and need sufficiently powerful endpoints for your testing (wireless or smartphones just don't cut it). Simulating bufferbloat is even harder than this. Vendors have the infrastructure to do this, LEDE developers and very advanced users might be able to accomplish this as well, but you won't be able to gather broad device coverage and comparisons from this relatively small group (no one can buy 'every' device on the market) - and even there I'd expect relatively large variations in the benchmarking results caused by slightly different test setups.

1 Like

ooooh yes :slight_smile:

Incompatible is a strong word here. Hardware assisted flow-offloading should not break a SQM setup, however the SQM flows can't actually be offloaded (as they need more processing by the netfilter code than the fast paths provide). In theory, flow-offloading is 'simple', it allows subsequent traffic to bypass large parts of the netfilter code, which can provide a pretty decent speed boost. However SQM needs to keep close track of all data passed through it, delay it by exactly the time needed to achieve constant throughput and not to overload the connection. This does need full treatment (slow-path) from the kernel.

So yes, assuming fully working (as designed) flow-offloading, you can enable SQM just fine, but you will inevitably lose the performance bump you'd get without SQM. That obviously doesn't rule out bugs, which are inevitably bound to happen in very new code (like flow offloading)

Thanks for the tips guys. So maybe the experts can chime in on what various platforms can be accomplish. I'd be happy to create the wiki given the information. The rule of thumb is nice - I wonder if there's something like X mbps/Y GHz for ARM and A mbps/B GHz on MIPS/x86/etc.

I thought one of the advantages of the Mediatek MT7621AT devices (e.g. Edgerouter X, DIR-860L) was the fact that it was dual core. Did I miss something there? This would also be nice to clarify.

It really isn't that simple, pure CPU performance (MHz/ GHz) doesn't have a direct relation to the I/O performance of a SOC. mt7621 is a prime example for this, its CPU performance is far from stellar, but it still can do routing at close to 1 GBit/s line speed (routing != SQM, != VPN, != anything that needs actual CPU performance in addition to 'pure' routing/ NAT). IPQ806x on the other hand has very fast krait 300 cores (~ARM cortex a15), but it can not compete with mt7621 in terms of routing throughput (this may change if the nss cores can be integrated into a hardware flow-offload driver); this changes as soon as CPU intensive loads (e.g. VPN) enter the picture though.

Good point @slh, especially the VPN part which can obviously have a major impact on CPU utilization. I vaguely remember reading that cache size was an important factor in SQM performance. So, you gave me some info about routing performance, but I'm specifically interested in SQM performance. Basically, what CPU do I need to shape a given internet connection speed. I think this will help people choose a router that is appropriate for their needs.

It's hard to crowd-source this information, but I agree it'd be a really useful thing to have in the table of hardware.

If I could get a moderate size dataset of hardware and max bandwidth shape-able with that hardware, I could build a pretty simple predictive model and then estimate the max shapeable bandwidth for other hardware.

It'd probably be good enough to have 3 or 4 data points for each of the major architectures: mips, mt, ARM, x86... and then extrapolate from there to other hardware.

EDIT: suggestions for a way to set up a CSV file that people could add their data to?

What about a measurement criteria of, at what thruput rate do you no longer have some % of idle CPU time? I.E., when you hit 0% idle time, t that speed you are saturated by the bandwith/SQM and everything else workload.

I do this by watching it in Top, while I'm running a DSLreports test. Speed is controlled by setting the limits in SQM. Kinda basic and clunky, probably rather susceptible to outside variables, but easy... and able to check my C7s limits due to me having more bandwidth than a C7 can handle under SQM.

By this method, I think I can say an Archer C7 can handle more than 340mbit DL/35 UL while NAT routing, (never runs out of idle, max I get out of the Cable connection) and around 140mbit running cake SQM thru ethernet only, or 120mbit thru the wifi AP.

So, what do you think of Top and the 0% idle time method? Something wrong with it, ways to improve on it? It should be easy for many to do, creating conditions to saturate the router might be a little harder, but less lazy types than myself could set up a few iperf boxes...

in the context of sqm-performance, this is.
sure multiple cores are good for doing multiple things, but the ammount of work for our sqm task which is able to run parallel, is somewhat limited because it often uses the same data.
also 'transfering' a task from one cpu to another, costs time in wich it could have done 'real work'
varies from setup to setup but disabling hyperthreading (and irqbalance) seems to be a longstanding recommendation for router-machines.

I see. Very helpful. Thanks.

Yes, the measurement criteria of at what bandwidth does idle hit say 5% or less running SQM would be a good candidate for a "max bandwidth".

Here's a suggested data format:

RouterName , CPUType , CPUMHz, BWmbpsdown, BWmbpsup, Idlepercent

If people run a speedtest on their line, whatever their line is, and add a data point to a table formatted as a CSV with those columns, then with sufficient data points, it's possible for me to build a statistical model that would estimate the BWmpbs at which idlepercent goes to zero. It doesn't even require people to have a connection with enough bandwidth to make it go to zero, as long as I have a variety of data points between say 50 and 25% the extrapolation to zero is probably "good enough" for most people's decision making.

put BWmbps in decimal values so if you have 300 kbps do 0.300

@tmomas can we get another one of those "wiki" posts with an editable table like this to let people put together a data set?

How shall we name it? "Shaping performance statistics"?

You'll want to be sure to have clear instructions on how to test.
Also, I believe the up vs down bandwidth with SQM may not be relevant. It's more like a total shaping bandwidth the router can handle, regardless of how it's distributed up vs down.

In essence this seems to be correct, but as far as I can tell downstream shaping is a bit more computationally expensive as the ifb device adds some additional cost (meaning downstream plus upstream probably will have a lower total shaped-bandwdth compared to upstream shaping alone; not that this situation is that relevant in real life....)

I think "shaping performance statistics (contribute your data)" would be a reasonable name, or something like it.

I'll see if I can write up some kind of suggested text for how to do the test. Would be good if there's a way to get the CPU usage data off LUCI rather than only for ssh users.

I fear that many people will not have internet connections fast enough to max out their CPU and contribute bad data. Same goes for properly configuring SQM. As alluded to earlier, I think this is a bit tricky to test for. Are there any experts we can consult for this data? Maybe the bufferbloat mailing list? Or maybe they have a suggestion on how to benchmark? I know @richb-hanover has some scripts for testing here: https://github.com/richb-hanover/OpenWrtScripts but I think there's still the problems related to settings and bandwidth.

No, with a proper statistical model (and this is something I'm expert in) the people who aren't bringing their CPU anywhere near its max performance will be simply ignored by the model, it's fine to have them in the data set, they won't hurt anything. The important thing is getting at least a few people who stress the cpu at least somewhat... less than 50% idle, less than 25% even better. Also, I'll be able to pool across multiple routers which have similar architecture. It'll be fine.

I think it should be enough to ssh into the router run top -d 1 and then from a computer on the wired LAN run a dslreports speed test and record the smallest idle percentage during the test, and record the bandwidth measurements up and down given by the dslreports test. All preferably while the rest of the network is quiet.

If we can get a few of you to put in some preliminary data here to see how well doing that works, it'd be helpful before we try the "wiki", just post your data in the format given above in separate comments:

1 Like