Fq_codel with simplest_tbf outperforming cake with piece_of_cake

Have been a "loyal" user of cake and piece_of_cake for several years.

However, testing fq_codel with simplest_tbf tonight may have changed that.

Testing on Waveform with an Archer C7 v2 running the 21.02.1 release version.

Connection is 50/50 fiber.

With cake and piece_of cake, I was getting around 9 to 11 ms of latency on download, and 0 on upload.

Speed was in the neighborhood of 44-45 Mbps for both download and upload.

So an "A" from Waveform...OK.

With fq_codel and simplest_tbf, latency on download dropped to 1 ms (and again 0 on upload)

Speed went up to 47 for both download and upload...and an "A+"

Similar results after several runs.

Almost spit coffee on my keyboard...

Egress and ingress are shaped at 49 Mbps, and Per Packet Overhead set at 18.

Not sure how this will work out in the long run, but I'll be interested to see what happens.

1 Like

Seems to be a little kinder to my 720 CPU as well...

Peaked at around 75% on the download test.

With cake and piece_of_cake, I've seen it close to 90%.

So there are potentially two things at play here:
a) TBF+fq_codel is computationally much less intensive than cake (cake does more so the added cycles are well-spent, but it means that on a low-end (by today's standards) router like yours cake might not allow you to reach ~50/50 or 100 Mbps combined)
b) cake's failure mode if it runs out of CPU is to allow latency under load to increase somewhat more than HTB-fq_codel (where the consequence of running out of CPU is mostly less throughput); I have not tested TBF+fq_codel in that respect myself, but assume it will behave similar to HTB+fq_codel

Given your differential speeds with cake and higher speeds with TBF I think a) and b) might effortlessly explain your observation.

Yes, one more support for the CPU-overload hypothesis, especially since cake will already be unhappy if it does not get the CPU fast enough. That 90% load number is an average over time, meaning the CPU is fully pegged a fraction of the time and idle at other fractions, and cake will already get somewhat unhappy if the CPU is pegged by someone else when cake should run and hence cake has to wait.

Given the speedtests it also looks like your cake barely reaches the desired speed, so I would try with cake set to say 40/40 or for testing purposed to shed some significant load with 25/25. If latency-under-load get significantly better it would conform this hypothesis experimentally.

1 Like

Agree.

I was quite surprised...

I'll test it.

1 Like

It appears to me that something in CAKE has become a LOT less CPU efficient over the last few releases.

I have experienced CAKE performance in general degrading significantly over the last releases on a slightly faster 2C/4T CPU (MT7621AT). With 19.07 I was getting ~175 Mbps, with 21.02 ~140 Mbps and with a recent 5.10 kernel snapshot ~110 Mbps tops.

I've tried all combinations of packet steering and irqbalance active/inactive with no substantial variation in performance. Same device, same SQM settings. CPU utilization usually shows at least one "core" (thread) on each real core pegged high in htop during Waveform tests - no surprise there, it's running out of CPU. I have not yet explored manually assigning IRQ's to cores with irqbalance yet, but the trend over the last releases is there regardless.

Thanks for the report, my router is still on kernel 4.XX and OpenWrt 19, so I might simply not yet see that fall-out in my own testing yet.

Now since cake always has been CPU-hungry that it is not cake that changed, but maybe something else got more powerful and resource hungry and now sharing a core with cake is not working as well as it did.

If you could pinpoint a commit that introduced the changed behavior it would be most excellent.

Recap -

Testing a 50/50 connection with an Archer C7 v2 (720 CPU) running 21.02.1

Fq_codel with simplest_tbf shaped at 49
CPU peak - 86%

Cake with piece_of_cake shaped at 49
CPU peak - 93%

Cake with piece_of_cake shaped at 40
CPU peak - 79%

Cake with piece_of_cake shaped at 25
CPU peak - 54%

I would say this supports our hypothesis that cake at 50/50 is too much for your router with the current kernel. Now you need to figure out what you prefer higher throughput with less bells and whistles (TBF) or the full cake experience at lower throughput (I would guess that 30 maybe 35 should still work with unloaded ~6ms).

1 Like

I feel a bit guilty as I write this, as it truly would be most excellent, but I don't have the intestinal fortitude to attempt that much bisection and flashing. I had a suspicion it was a DSA gift that keeps on giving with MT7621, until I saw reports of CAKE slowdown on non DSA targets. I'm just throwing random idea darts at a board, but I have also wondered if it has something to do with CAKE going upstream as of the 4.19 kernel, since 19.07 is on 4.14 and 21.02 and later are on 5.4 and later. If the pre-21.02 CAKE package were better optimized source code, or as a separate package it was compiled with more optimized compiler flags not used when CAKE is compiled with the kernel? I don't know - I'm just speculating with no basis from testing. I've never even tried building OpenWrt myself yet, just to give you an idea of my expertise level (low).

OldNavyGuy's testing does indeed look like a CPU cycle shortage. One of my ISP options recently offered too good a rate to pass up for 2 years of 400/20 service, so I took them up on it. With SQM on my ER-X gateway that's only 100/20 with recent snapshots, plenty fast enough for me, but just knowing that extra bandwidth could be had (it's like a disease, isn't it? LOL)....really thinking hard now about a NanoPi R4S and turning the ER-X into an over spec'd managed switch or stuffing a slightly larger GS308T smart switch on hand into my too tiny telecom cabinet, but the ER-X is already mounted in there, so....lazy :wink:

Out of curiosity, I rolled back to 19.07.8, and will run the tests again later tonight.

19.07.8 results...

Fq_codel with simplest_tbf shaped at 49
CPU peak - 79% (7% less than 21.02.1)

Cake with piece_of_cake shaped at 49
CPU peak - 73% (20% less than 21.02.1)

Cake with piece_of_cake shaped at 40
CPU peak - 67% (12% less than 21.02.1)

Cake with piece_of_cake shaped at 25
CPU peak - 49% (5% less than 21.02.1)

So the differences are not that huge (6ms added average delay on a download might still be caused a chance event). The CPU load however does seem to indicate that something got more expensive in OpenWrt > 19, but it is unclear what. Could be anything from sch_cake, to the kernel in general, to more stuff running in parallel by default...

BTW, even if I switch from OpenWrt 19 to 20 or even snapshot I do not really expect to see these issues, as my turris omnia was capable ot traffic shaping at ~ 550/550 Mbps bidirectionally saturating traffic, but my access link is only 116/40 so I will not easily notice drops in traffic shaping throughput.

1 Like

htop (at rest) 19.07.8 -

root@OpenWrt:~# htop

  CPU[|||||                                                           6.5%]   Tasks: 17, 0 thr; 1 running
  Mem[||||||||||||||||||||                                      21.3M/122M]   Load average: 0.48 0.44 0.18
  Swp[                                                               0K/0K]   Uptime: 00:02:37

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 2744 root       20   0  2148  1916  1020 R  2.0  1.5  0:00.76 htop
 2735 root       20   0  1144   836   740 S  1.3  0.7  0:00.86 /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 2077 root       20   0  1780  1280  1144 S  0.7  1.0  0:01.59 /usr/sbin/hostapd -s -P /var/run/wifi-phy1.pid -B /var/run/hostapd-phy1.conf
 2277 root       20   0  1780  1288  1144 S  0.7  1.0  0:00.53 /usr/sbin/hostapd -s -P /var/run/wifi-phy0.pid -B /var/run/hostapd-phy0.conf
    1 root       20   0  1560   920   788 S  0.0  0.7  0:01.18 /sbin/procd
  553 root       20   0  1220   800   728 S  0.0  0.6  0:00.12 /sbin/ubusd
  554 root       20   0   920   600   568 S  0.0  0.5  0:00.05 /sbin/askfirst /usr/libexec/login.sh
  571 root       20   0  1024   716   672 S  0.0  0.6  0:01.20 /sbin/urngd
  958 root       20   0  2032  1112   948 S  0.0  0.9  0:00.22 /sbin/rpcd -s /var/run/ubus.sock -t 30
 1048 root       20   0  1076   824   788 S  0.0  0.7  0:00.00 /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 1105 root       20   0  1744  1040   884 S  0.0  0.8  0:00.23 /sbin/netifd
 1366 root       20   0  1312   832   756 S  0.0  0.7  0:00.05 /usr/sbin/uhttpd -f -h /www -r OpenWrt -x /cgi-bin -t 60 -T 30 -k 20 -A 1 -n 3 -N 100 -R -p 0.
 1592 root       20   0  1208   892   856 S  0.0  0.7  0:00.00 udhcpc -p /var/run/udhcpc-eth0.2.pid -s /lib/netifd/dhcp.script -f -t 0 -i eth0.2 -x hostname:
 1976 root        5 -15  1212  1028   988 S  0.0  0.8  0:00.00 /usr/sbin/ntpd -n -N -l -S /usr/sbin/ntpd-hotplug -p time.nist.gov
 2661 dnsmasq    20   0  1344   920   832 S  0.0  0.7  0:00.05 /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.
 2721 root       20   0  1244   760   692 S  0.0  0.6  0:00.00 /sbin/logd -S 64
 2736 root       20   0  1216   904   856 S  0.0  0.7  0:00.01 -ash

htop (at rest) 21.02.1 -

root@OpenWrt:~# htop

  CPU[||||                         7.1%] Tasks: 17, 0 thr, 30 kthr; 1 running
  Mem[||||||||||||||         29.2M/121M] Load average: 0.04 0.28 0.16
  Swp[                            0K/0K] Uptime: 00:05:18

  PID USER       PRI  NI  VIRT   RES   SHR S CPU%-MEM%   TIME+  Command
 4038 root        20   0  2192  1952  1060 R  3.6  1.6  0:00.50 htop
    1 root        20   0  1628   996   848 S  0.0  0.8  0:01.08 /sbin/procd
  639 ubus        20   0  1260   844   768 S  0.0  0.7  0:00.13 /sbin/ubusd
  640 root        20   0   940   624   592 S  0.0  0.5  0:00.03 /sbin/askfirst /
  674 root        20   0  1044   672   628 S  0.0  0.5  0:01.33 /sbin/urngd
 1166 logd        20   0  1264   816   712 S  0.0  0.7  0:00.08 /sbin/logd -S 64
 1235 root        20   0  2084  1160  1000 S  0.0  0.9  0:00.25 /sbin/rpcd -s /v
 1454 root        20   0  1148   896   856 S  0.0  0.7  0:00.00 /usr/sbin/dropbe
 1552 root        20   0  4316  2016  1840 R  0.0  1.6  0:09.35 /usr/sbin/hostap
 1553 root        20   0  4192  1688  1584 S  0.0  1.4  0:00.14 /usr/sbin/wpa_su
 1615 root        20   0  1784  1088   940 S  0.0  0.9  0:00.26 /sbin/netifd
 1970 root        20   0  3936  1984  1848 S  0.0  1.6  0:00.31 /usr/sbin/uhttpd
 2154 root        20   0  1252   936   900 S  0.0  0.8  0:00.00 udhcpc -p /var/r
 2894 root         5 -15  1256   944   904 S  0.0  0.8  0:00.01 /usr/sbin/ntpd -
 3994 dnsmasq     20   0  1396   972   888 S  0.0  0.8  0:00.04 /usr/sbin/dnsmas
 4028 root        20   0  1208   904   808 S  0.0  0.7  0:00.81 /usr/sbin/dropbe
 4029 root        20   0  1260  1008   964 S  0.0  0.8  0:00.01 -ash

Thanks!

I did not think, persistently higher CPU load, but that a few processing steps, e.g. somewhere in the kernel's network stack, might have gotten a bit slower, you would only see these higher costs when there is actual network load. Say when comparing speedtests with SQM disabled between the two versions (I like fast.com as load generator for such tests, as in the configuration options one can configure the test to use say 4 to 4 parallel flows and run between 30 and 30 seconds). And as always, monitoring %SIRQ and %IDLE would be the most relevant (you can configure htop to use color codes for different components of the CPU load).

As you might notice, I have no useful hypothesis yet what might have changed, so this is a bit of a "fishing expedition".

1 Like

Actually things have been working fairly well in 21.02.1.

I thought the results overall were interesting...one that I noticed is htop showing 30 kernel threads in 21.02.1, to none in 19.07.8.

No clue what those are doing, or how it may (or may not) affect performance.

Appreciate your feedback.

Thanks!

Thanks for doing those tests. You are seeing a lot less difference in CAKE SQM throughput capability (or CPU usage) on your Archer C7 than I with my ER-X from 19.07 to 21.02 and current snapshot.

I'm using VLANs on my ER-X to manage segregated lan, iot and guest networks on different sub-nets distributed on wired back haul to all-in-ones set up as dumb AP's (and did same on 19.07). DSA could yet be a possible culprit on MT7621. DSA migration has been a bit bumpy.

I think, but I can be wrong here, that with DSA there are now new pitfalls where one ends up doing things in software instead of in hardware simply by the choice of how one configures a desired bridge.... The beauty of DSA is that with it a switch is configured using the same tools as a software bridge would, but that means one needs to take not to end up with a software bridge by accident. That, or I misunderstood something (quite likely)...

1 Like

I'll add taking a second look at my bridge set-up to my to do list. I was so happy to get things working using the "mini-DSA tutorial" thread shortly after it was first posted that I just stopped there. It's probably past time I re-read that whole thread for nuggets of wisdom added after I got my set-up working.

I do measure just over 500 Mbps routed over my lan with iperf3 using no offloading (software or hardware), which seems about right for an ER-X. However, that doesn't mean I haven't fallen into some DSA software pit or other I need to straighten out. Appreciate the tip.

I did a "factory reset" on a recent kernel 5.10 snapshot, installed only SQM and set up a simple direct to WAN network with a hardwired PC. Same results. Not getting better than 100-110 Mbps with CAKE on the MT7621 ER-X.

So that at least rules out my VLAN set up as the culprit for the slow down.

I also think it's a bit odd that I can't get anywhere near line rate for upload. I am on a 400/20 plan that is slightly over-provisioned on upload to ~23 Mbps, and the best I can get with CAKE before latency starts getting bad fast is around 13-15 Mbps. This isn't new - I just forgot to mention it earlier in this thread.

Weird thing about the upload is that while fq_codel/simple can handle download up to ~150 Mbps, fq_codel/simple are actually slower than CAKE on upload (7-9 Mbps range).

I'm using a 24x8 DOCSIS 3.0 MB7621 modem. All 24 downstream channels are bonded and locked on QAM256 but only 3 upstream channels are bonded and they are ATDMA channels.