I bought a fortigate 50e as a toy.
As far as I know, this mvebu SoC is quite powerful and I have seen a few posts on forums that it has achieved 1Gbps nat and 500Mbps cake. However, I have also seen posts that say it cannot even handle a 330Mbps cake.
I ran the benchmark by varying the number of concurrent connections on iperf3, and on this device, increasing the number of concurrent connections immediately caused sqm performance to drop.
To make an assessment of that you need to actually monitor CPU usage (per CPU, especially on the CPU(s) handling the software interrupts for the cake qdisc(s)).
I think would be helpful, if you would capture tc -s qdisc snapshots for all tests (say 10 seconds into the test) to see what statistics cake reports, as well as CPU usage data from around the same time in the test.
We see that the amount of drops increases a lot between these tests... and I notice you have "ingress" mode activated. Ingress mode is great in that it will basically equalize ingress rate into the shaper instead of egress rate out of the shaper, but it will also become increasingly aggressive the more flows you employ... you also carry a number of additional "expensive" keywords around, like diffserv4, dual-dsthost, nat... Don't get me wrong these keywords are all offering valuable behaviour, but they are not exactly free of cost.
I would like to ask you to repeat the test without the ingress keyword, as ingress mode is likely to have an effect here. (We also need to tackle your 100 CPU load issue somehow, as that makes it somewhat hard to come to conclusions what is limiting cake's throughput).
Also from my testing own with a dual core mvebu device (turris omnia) the most I got it to shape reliably and robustly was 550/550 Mbps and only after manually editing the receive packet steering script (this defaulted to keeping softirq processing off the CPU handling the NICs hardware interrupts, but with a dual core CPU that ended moving both sqm instances onto CPU1 and with bidirectionally saturating tests, these now compete for the same CPU cycles).
These can help (if configured correctly) to free up CPU cycles that then can be used by qdiscs like cake...
Mmh, there is a way to see on which CPU interrupts ::
if you run this after a test and post here we might be able to see what happens with the distribution of the processing over your CPUs.
But your result already indicates that CPU load is part of the problem... which again is no surprise as 900 Mbps is out of the reliably achievable shaper rate for that SoC.
OK, I note that all of dual-dsthost and triple-isolate involve two level hashing, so these might be related to the cost. However there clearly is another cost involved and that is the multi-way associative hashing, that is cake will need to do more work when there are potential hash collisions. And we see a clear increase of way_inds and way_miss for the 50 flow test...
I would recommend to test: flows instead of dual-dsthost or triple-isolate besteffort instead of diffserv4 nonat instead of nat, for flows nat becomes useless egress instead of ingress
and if you really want to push it try no-split-gso instead of split-gso
EDIT: you could also try: flowblind instead of flows, as that will take the whole fq scheduling out of the equation (to test whether it is the set associative hashing that limits throughput at high active flow count).
This way you will pare down cake's bells and whistles in the service of trying to maximize throughput.
HOWEVER since you already report cake being at 100% CPU load even with a single flow you will need to do these tests with different numbers of flows to see whether these cost reductions make some difference when the CPU debt is not already insurmountably large...
Also try this with setting the shaper to say 500 Mbps and see whether this makes a difference....
This might not be ideal (unless you can push cake onto CPU1, but I seem to recall that there are limits n which CPUs can handle which interrupts on mvebu) but my limited understanding of
seems to indicate that cake runs on the same CPU as the NIC interrupts, which for a single cake instance on a dual core SoC is not ideal... (once you shape both directions it makes sense again to put one shaper onto CPU0, but that is not the current test).
There is no difference between 900Mbps and 500Mbps. Probably, both numbers exceed the limit of 330~340Mbps, so they do not seem to have any effect.
Other options make no difference. (nat/nonat, ingress/egress)
But flows made a difference. (446Mbps) But as far as I know this works like fq-codel without host isolation, right? I could see that host isolation was an expensive option.
Is there a way to change the interrupt?
Changing the smp_affinity option gave an error. (write error: I/O error)
I am looking for information now.
Yes, the dual and triple isolation options are quite helpful, but clearly do not come for free, but even then you seem to still be deep underwater CPU wise... Maybe set the shaper to 300-400?
And if you feel you want to see a real train wreck use MSS clamping on the NAT device and force packet size down to say 250 bytes...
So one thing to keep in mind here is that shaper rates are gross rates, with your configuration, cake will assume the traditional 14 bytes that Linux accounts for.
so for local IPv4/TCP without ant IP or TCP options you can expect the following good-put (what iperf3 reports as far as I know) of:
900 * ((1500-20-20)/(1514)) = 867.9 Mbps % already a bit limited with a single flow
500 * ((1500-20-20)/(1514)) = 482.2 Mbps % ~reached with 1 to 10 flows, clearly missed with >= 50 flows
400 * ((1500-20-20)/(1514)) = 385.7 Mbps % missed with 128
350 * ((1500-20-20)/(1514)) = 337.5 Mbps % near-miss, partial success
300 * ((1500-20-20)/(1514)) = 289.3 Mbps % yepp, for 128 flows under your testing conditions you need to lower the shaper rate down to 300 Mbps...
The question is IMHO always how robust and reliable one wants/needs the shaper to perform.
But I also note that 128 parallel bulk flows is not a typical load for a home network.... but it s also not a completely unrealistic scenario either, think torrenting/seeding a popular ubuntu image or doing highly parallelizes S3 uploads. Maybe @dtaht wants to take this as additional point to ponder for his mqcake/cakemq proposal?
I think with mvebu the NIC interrupts are basically as they are and con not be changed, the only thing possible is to move the qdisc processing to CPU1, I think this is typically achieved by enabling packet steering.
Here are my "notes" what I did in my testing (enable RPS to both CPUs, but that was testing with one shaper per direction, so I wanted to spread them over both CPUs):
## TURRIS omnia receive side scaling:
for file in /sys/class/net/*
echo 3 > $file"/queues/rx-0/rps_cpus"
echo 3 > $file"/queues/tx-0/xps_cpus"
for file in /sys/class/net/*
echo $file RX rps_cpus
# xps_cpus does not work with recent enough OpenWrt (21 in my case)
#echo $file TX xps_cpus