What is the realistic number of concurrent flows in cake?

I bought a fortigate 50e as a toy.
As far as I know, this mvebu SoC is quite powerful and I have seen a few posts on forums that it has achieved 1Gbps nat and 500Mbps cake. However, I have also seen posts that say it cannot even handle a 330Mbps cake.
I ran the benchmark by varying the number of concurrent connections on iperf3, and on this device, increasing the number of concurrent connections immediately caused sqm performance to drop.

1 connection : 830 Mbps
5 connections : 550 Mbps
10 connections : 524 Mbps
20 connections : 466 Mbps
50 connections : 422 Mbps
100 connections : 358 Mbps
128 connections : 336 Mbps (max of iperf3)

So, is it correct to say that testing at 128 connections is the actual limit of sqm that the CPU can handle when used at home?

Edit: This is a download speed test. I get better speeds on upload test.

To make an assessment of that you need to actually monitor CPU usage (per CPU, especially on the CPU(s) handling the software interrupts for the cake qdisc(s)).
I think would be helpful, if you would capture tc -s qdisc snapshots for all tests (say 10 seconds into the test) to see what statistics cake reports, as well as CPU usage data from around the same time in the test.

you run iperf3 on the device and a client?

In all tests, one core shows 100% usage.
It's a pretty simple test, but I feel the difference in results is huge.

qdisc information for 1, 10, and 50 connections.

connection : 1

root@OpenWrt:~# tc -s qdisc show dev ifb4br-wan
qdisc cake 801e: root refcnt 2 bandwidth 900Mbit diffserv4 dual-dsthost nat nowash ingress no-ack-filter split-gso rtt 100ms raw overhead 0
 Sent 1056223305 bytes 697743 pkt (dropped 0, overlimits 11338 requeues 0)
 backlog 269492b 178p requeues 0
 memory used: 520612b of 15140Kb
 capacity estimate: 900Mbit
 min/max network layer size:           60 /    1514
 min/max overhead-adjusted size:       60 /    1514
 average network hdr offset:           14

                   Bulk  Best Effort        Video        Voice
  thresh      56250Kbit      900Mbit      450Mbit      225Mbit
  target            5ms          5ms          5ms          5ms
  interval        100ms        100ms        100ms        100ms
  pk_delay          0us       2.49ms        343us          3us
  av_delay          0us       1.81ms         12us          0us
  sp_delay          0us       1.74ms         12us          0us
  backlog            0b      269492b           0b           0b
  pkts                0       697898           19            4
  bytes               0   1056489255         3302          240
  way_inds            0            0            0            0
  way_miss            0           25            2            1
  way_cols            0            0            0            0
  drops               0            0            0            0
  marks               0            0            0            0
  ack_drop            0            0            0            0
  sp_flows            0            0            1            0
  bk_flows            0            1            0            0
  un_flows            0            0            0            0
  max_len             0        66616          262           60
  quantum          1514         1514         1514         1514


connections : 10

root@OpenWrt:~# tc -s qdisc show dev ifb4br-wan
qdisc cake 801e: root refcnt 2 bandwidth 900Mbit diffserv4 dual-dsthost nat nowash ingress no-ack-filter split-gso rtt 100ms raw overhead 0
 Sent 1616044276 bytes 1067728 pkt (dropped 208, overlimits 12678 requeues 0)
 backlog 322482b 213p requeues 0
 memory used: 751396b of 15140Kb
 capacity estimate: 900Mbit
 min/max network layer size:           60 /    1514
 min/max overhead-adjusted size:       60 /    1514
 average network hdr offset:           14

                   Bulk  Best Effort        Video        Voice
  thresh      56250Kbit      900Mbit      450Mbit      225Mbit
  target            5ms          5ms          5ms          5ms
  interval        100ms        100ms        100ms        100ms
  pk_delay          0us       6.94ms        608us          3us
  av_delay          0us       4.95ms         71us          0us
  sp_delay          0us       3.76ms         60us          0us
  backlog            0b      322482b           0b           0b
  pkts                0      1068080           64            5
  bytes               0   1616664962        16408          300
  way_inds            0            0            0            0
  way_miss            0           51            3            1
  way_cols            0            0            0            0
  drops               0          208            0            0
  marks               0            0            0            0
  ack_drop            0            0            0            0
  sp_flows            0            0            1            0
  bk_flows            0           10            0            0
  un_flows            0            0            0            0
  max_len             0        66616          790           60
  quantum          1514         1514         1514         1514



connections : 50

root@OpenWrt:~# tc -s qdisc show dev ifb4br-wan
qdisc cake 801e: root refcnt 2 bandwidth 900Mbit diffserv4 dual-dsthost nat nowash ingress no-ack-filter split-gso rtt 100ms raw overhead 0
 Sent 2264814730 bytes 1496485 pkt (dropped 5073, overlimits 12952 requeues 0)
 backlog 228614b 151p requeues 0
 memory used: 900184b of 15140Kb
 capacity estimate: 900Mbit
 min/max network layer size:           60 /    1514
 min/max overhead-adjusted size:       60 /    1514
 average network hdr offset:           14

                   Bulk  Best Effort        Video        Voice
  thresh      56250Kbit      900Mbit      450Mbit      225Mbit
  target            5ms          5ms          5ms          5ms
  interval        100ms        100ms        100ms        100ms
  pk_delay          0us        7.8ms        753us        150us
  av_delay          0us       5.11ms        222us          2us
  sp_delay          0us        3.1ms        162us          2us
  backlog            0b      228614b           0b           0b
  pkts                0      1501488          215            6
  bytes               0   2272645312        78194          360
  way_inds            0        10767            0            0
  way_miss            0          105            3            1
  way_cols            0            0            0            0
  drops               0         5073            0            0
  marks               0            0            0            0
  ack_drop            0            0            0            0
  sp_flows            0            0            1            1
  bk_flows            0           50            0            0
  un_flows            0            0            0            0
  max_len             0        66616         2108           60
  quantum          1514         1514         1514         1514

j4105 x86 openwrt box (iperf3 server) > foritgate 50e openwrt (nat) sqm on wan bridge port (br-wan) > laptop (iperf3 client)

thanks, i misread what you wrote originally then. have you enabled packet steering?

Packet steering and irqbalance have not been tested yet. But these have nothing to do with sqm, right?

one 100% core suggests that there is chance.

It's not difficult. I just turned on packet steering.
It's a little better. In my 50 connections test, I got 515Mbps. (422 Mbps without packet steering)
Core 1 is used 100%, and core 2 is used 33%.

We see that the amount of drops increases a lot between these tests... and I notice you have "ingress" mode activated. Ingress mode is great in that it will basically equalize ingress rate into the shaper instead of egress rate out of the shaper, but it will also become increasingly aggressive the more flows you employ... you also carry a number of additional "expensive" keywords around, like diffserv4, dual-dsthost, nat... Don't get me wrong these keywords are all offering valuable behaviour, but they are not exactly free of cost.

I would like to ask you to repeat the test without the ingress keyword, as ingress mode is likely to have an effect here. (We also need to tackle your 100 CPU load issue somehow, as that makes it somewhat hard to come to conclusions what is limiting cake's throughput).

Also from my testing own with a dual core mvebu device (turris omnia) the most I got it to shape reliably and robustly was 550/550 Mbps and only after manually editing the receive packet steering script (this defaulted to keeping softirq processing off the CPU handling the NICs hardware interrupts, but with a dual core CPU that ended moving both sqm instances onto CPU1 and with bidirectionally saturating tests, these now compete for the same CPU cycles).

These can help (if configured correctly) to free up CPU cycles that then can be used by qdiscs like cake...

Mmh, there is a way to see on which CPU interrupts ::

cat /proc/softirq
cat /proc/interrupts

if you run this after a test and post here we might be able to see what happens with the distribution of the processing over your CPUs.

But your result already indicates that CPU load is part of the problem... which again is no surprise as 900 Mbps is out of the reliably achievable shaper rate for that SoC.

Sorry, I ate dinner. Let's do some more tests.
First of all, cake's options seem to have little effect on throughput.

connections : 128

diffserv4 dual-dsthost nat nowash ingress : 338 Mbps
diffserv4 dual-dsthost nonat nowash : 345 Mbps
diffserv4 triple-isolate nonat nowash : 342 Mbps
besteffort triple-isolate nonat wash : 339 Mbps
root@OpenWrt:~# cat /proc/softirqs
                    CPU0       CPU1
          HI:          0          0
       TIMER:      26091       8341
      NET_TX:      96299         75
      NET_RX:     192988        166
       BLOCK:          0          0
    IRQ_POLL:          0          0
     TASKLET:      84707          1
       SCHED:     133546     131713
     HRTIMER:          0          0
         RCU:      12284      14221
root@OpenWrt:~# cat /proc/interrupts
           CPU0       CPU1
 25:          0          0     GIC-0  27 Edge      gt
 26:     312719     216989     GIC-0  29 Edge      twd
 27:          0          0      MPIC   5 Level     armada_370_xp_per_cpu_tick
 29:        174          0     GIC-0  34 Level     mv64xxx_i2c
 30:         12          0     GIC-0  44 Level     ttyS0
 40:          0          0     GIC-0  41 Level     f1020300.watchdog
 44:          0          0     GIC-0  96 Level     f1020300.watchdog
 45:     141667          0      MPIC   8 Level     eth0
 46:      92306          0      MPIC  10 Level     eth1
 47:          0          0      MPIC  12 Level     eth2
 48:          0          0     GIC-0  51 Level     f1090000.crypto
 49:          0          0     GIC-0  52 Level     f1090000.crypto
 50:          0          0     GIC-0  53 Level     f10a3800.rtc
 51:          0          0     GIC-0  48 Level     xhci-hcd:usb1
 53:          2          0     GIC-0  54 Level     f1060800.xor
 54:          2          0     GIC-0  97 Level     f1060900.xor
 55:          1          0  f1018100.gpio  20 Level     f1072004.mdio-mii:00
 56:          0          0  f1018140.gpio   9 Level     f1072004.mdio-mii:01
 60:          0          0  mv88e6xxx-g1   3 Edge      mv88e6xxx-f1072004.mdio-mii:02-g1-atu-prob
 62:          0          0  mv88e6xxx-g1   5 Edge      mv88e6xxx-f1072004.mdio-mii:02-g1-vtu-prob
 64:          1          3  mv88e6xxx-g1   7 Edge      mv88e6xxx-f1072004.mdio-mii:02-g2
 66:          0          0  mv88e6xxx-g2   0 Edge      mv88e6xxx-1:00
 67:          0          0  mv88e6xxx-g2   1 Edge      mv88e6xxx-1:01
 68:          0          0  mv88e6xxx-g2   2 Edge      mv88e6xxx-1:02
 69:          0          0  mv88e6xxx-g2   3 Edge      mv88e6xxx-1:03
 70:          1          3  mv88e6xxx-g2   4 Edge      mv88e6xxx-1:04
 81:          0          0  mv88e6xxx-g2  15 Edge      mv88e6xxx-f1072004.mdio-mii:02-watchdog
 82:          0          0  f1018140.gpio  22 Edge      gpio-keys
IPI0:          0          1  CPU wakeup interrupts
IPI1:          0          0  Timer broadcast interrupts
IPI2:       1462       1663  Rescheduling interrupts
IPI3:      80008      48114  Function call interrupts
IPI4:          0          0  CPU stop interrupts
IPI5:          0          0  IRQ work interrupts
IPI6:          0          0  completion interrupts
Err:          0

OK, I note that all of dual-dsthost and triple-isolate involve two level hashing, so these might be related to the cost. However there clearly is another cost involved and that is the multi-way associative hashing, that is cake will need to do more work when there are potential hash collisions. And we see a clear increase of way_inds and way_miss for the 50 flow test...

I would recommend to test:
flows instead of dual-dsthost or triple-isolate
besteffort instead of diffserv4
nonat instead of nat, for flows nat becomes useless
egress instead of ingress
and if you really want to push it try
no-split-gso instead of split-gso

EDIT: you could also try:
flowblind instead of flows, as that will take the whole fq scheduling out of the equation (to test whether it is the set associative hashing that limits throughput at high active flow count).

This way you will pare down cake's bells and whistles in the service of trying to maximize throughput.

HOWEVER since you already report cake being at 100% CPU load even with a single flow you will need to do these tests with different numbers of flows to see whether these cost reductions make some difference when the CPU debt is not already insurmountably large...

Also try this with setting the shaper to say 500 Mbps and see whether this makes a difference....

This might not be ideal (unless you can push cake onto CPU1, but I seem to recall that there are limits n which CPUs can handle which interrupts on mvebu) but my limited understanding of

seems to indicate that cake runs on the same CPU as the NIC interrupts, which for a single cake instance on a dual core SoC is not ideal... (once you shape both directions it makes sense again to put one shaper onto CPU0, but that is not the current test).

There is no difference between 900Mbps and 500Mbps. Probably, both numbers exceed the limit of 330~340Mbps, so they do not seem to have any effect.

Other options make no difference. (nat/nonat, ingress/egress)
But flows made a difference. (446Mbps) But as far as I know this works like fq-codel without host isolation, right? I could see that host isolation was an expensive option.

Is there a way to change the interrupt?
Changing the smp_affinity option gave an error. (write error: I/O error)
I am looking for information now.

1 Like

At 128 for sure, but at 1, 10, 50 connections?

Yes, the dual and triple isolation options are quite helpful, but clearly do not come for free, but even then you seem to still be deep underwater CPU wise... Maybe set the shaper to 300-400?
And if you feel you want to see a real train wreck use MSS clamping on the NAT device and force packet size down to say 250 bytes...

BTW, thank you so much for doing these tests, it is a great idea to try to figure out which parts of cake carry which costs (depending on the conditions).

900Mbps 1 connection : 852 Mbps
900Mbps 10 connections : 535 Mbps
900Mbps 50 connections : 425 Mbps
900Mbps 128 connections : 341 Mbps

500Mbps 1 connection : 481 Mbps
500Mbps 10 connections : 480 Mbps
500Mbps 50 connections : 429 Mbps
500Mbps 128 connections : 341 Mbps

300Mbps 128 connections : 289 Mbps
330Mbps 128 connections : 317 Mbps
340Mbps 128 connections : 325 Mbps
350Mbps 128 connections : 332 Mbps
400Mbps 128 connections : 331 Mbps

1 Like

So one thing to keep in mind here is that shaper rates are gross rates, with your configuration, cake will assume the traditional 14 bytes that Linux accounts for.
so for local IPv4/TCP without ant IP or TCP options you can expect the following good-put (what iperf3 reports as far as I know) of:

900 * ((1500-20-20)/(1514)) = 867.9 Mbps % already a bit limited with a single flow
500 * ((1500-20-20)/(1514)) = 482.2 Mbps % ~reached with 1 to 10 flows, clearly missed with >= 50 flows
400 * ((1500-20-20)/(1514)) = 385.7 Mbps % missed with 128
350 * ((1500-20-20)/(1514)) = 337.5 Mbps % near-miss, partial success
300 * ((1500-20-20)/(1514)) = 289.3 Mbps % yepp, for 128 flows under your testing conditions you need to lower the shaper rate down to 300 Mbps...

The question is IMHO always how robust and reliable one wants/needs the shaper to perform.
But I also note that 128 parallel bulk flows is not a typical load for a home network.... but it s also not a completely unrealistic scenario either, think torrenting/seeding a popular ubuntu image or doing highly parallelizes S3 uploads. Maybe @dtaht wants to take this as additional point to ponder for his mqcake/cakemq proposal?

I think with mvebu the NIC interrupts are basically as they are and con not be changed, the only thing possible is to move the qdisc processing to CPU1, I think this is typically achieved by enabling packet steering.

Here are my "notes" what I did in my testing (enable RPS to both CPUs, but that was testing with one shaper per direction, so I wanted to spread them over both CPUs):

## TURRIS omnia receive side scaling:
for file in /sys/class/net/*
do
echo 3 > $file"/queues/rx-0/rps_cpus"
echo 3 > $file"/queues/tx-0/xps_cpus"
done

# check, 
for file in /sys/class/net/*
do
echo $file RX rps_cpus
cat $file"/queues/rx-0/rps_cpus"
# xps_cpus does not work with recent enough OpenWrt (21 in my case)
#echo $file TX xps_cpus
#cat $file"/queues/tx-0/xps_cpus"
done

Here is my current output:

root@turris:~# for file in /sys/class/net/*
> do
> echo $file RX rps_cpus
> cat $file"/queues/rx-0/rps_cpus"
> # xps_cpus does not work with recent enough OpenWrt (21 in my case)
> #echo $file TX xps_cpus
> #cat $file"/queues/tx-0/xps_cpus"
> done
/sys/class/net/br-guest-turris RX rps_cpus
0
/sys/class/net/br-lan RX rps_cpus
0
/sys/class/net/eth0 RX rps_cpus
3
/sys/class/net/eth1 RX rps_cpus
3
/sys/class/net/eth2 RX rps_cpus
3
/sys/class/net/eth2.42 RX rps_cpus
0
/sys/class/net/eth2.7 RX rps_cpus
0
/sys/class/net/ifb4pppoe-wan RX rps_cpus
0
/sys/class/net/ip6tnl0 RX rps_cpus
0
/sys/class/net/lan0 RX rps_cpus
0
/sys/class/net/lan1 RX rps_cpus
0
/sys/class/net/lan2 RX rps_cpus
0
/sys/class/net/lan3 RX rps_cpus
0
/sys/class/net/lan4 RX rps_cpus
0
/sys/class/net/lo RX rps_cpus
0
/sys/class/net/pppoe-wan RX rps_cpus
0
/sys/class/net/sit0 RX rps_cpus
0
/sys/class/net/tun_turris RX rps_cpus
0
/sys/class/net/wlan0 RX rps_cpus
3
/sys/class/net/wlan1 RX rps_cpus
3

Packet steering seems to be the only option that improves performance. Running the irqbalance package made no difference.

350 Mbps with Packet Steering : 337 Mbps
400 Mbps with Packet Steering : 384 Mbps
450 Mbps with Packet Steering : 394 Mbps

As a result, in mvebu (armada385), the cake appears to have a limit of 330~340Mbps without packet steering and 400Mbps with packet steering.

And this is a test on 128 connections in iperf3, so if there are more hundreds of simultaneous connections while moving files, you will see lower performance, but that would be a rare case.

1 Like