Fq_codel_fast helpers, test and testers?

An overload would be an unresponsive flow. 'ping -f -s 1500 somewhere' for example, can be used to do you in, or you can use a udp flood from a variety of tools like netperf or iperf or pktgen. Or use any of the udp_flood tests in flent --list-tests. Have long meant to combine these into a rrul + flood test more direclty...

It turns out ce_threshold was not a failed experiment in google but used for their dctcp-like implementation of bbr (we think). I was going to just rename it to sce_threshold before the l4s/sce debate blew up in the ietf. My principal reason for wanting to fiddle with it is to look at the cpu impact of mangling the ecn field that often, and secondarily explore the impact of gso on or off in fq_codel. gso/gro "superpackets" have a tendency to bloat the aqm, and hurt fq, but also save on cpu.Third is to actually try out the existing sce enabled tcps....

As for fixing tc, simplest way is to just copy the tc/*fq_codel.c file and change it to look for fq_codel_fast on the callback. Honestly I figured it was gonna be saner to come up with another name for the qdisc in the first place. fqc perhaps. Although, if you say it aloud, it isn't pg-13. :slight_smile: REALLY BAD at names here, feel free to suggest a better name!

flent has means to track and plot cpu stats on all the hosts on the path. There is a flent package for openwrt with the needed tools. You setup .ssh/config to get to 'em and do a --te=cpu_stats_hosts=hostA,hostB,hostC

The tools have a tendency to heisenbug the tests, and may not have been tested in a while. We had a new bug introduced in tc's batch facility that breaks the c version of the tc monitoring tool (there's a shell version also)

As for the stastical significance of your findings so far, well, it's barely above the noise...

1 Like

Yer not hijacking the thread at all.

I really LIKE explaining this stuff to people with different perspectives. I've been at this so long that I don't know what other people know, and having a chance to write it down, and get a little help
on exploring new ideas, and making stuff faster in general, really helps. I kind of have a lonely life in the last time I got to discuss a m/m/1 queue theory was a year or three back... I really wish more folk had essentially memorized: https://archive.org/details/queueingsystems01klei https://archive.org/details/queueingsystems02klei as covid-bloat is queue theory 101....

The original idea for fq_codel fast was to speed it up as much as possible, then layer in two needed
ideas, and then... find some way to create a multi-core shaper. htb has severe scaling problems, ebpf is not particularly fast, and I thought, at the time, that I was starting to get an inkling as to how to make
a multicore shaper "just work", then got distracted by other things. The timing here is good in that we have time to play here and in the wifi stack before the next major openwrt release, so long as we get enough folk helping out.

I hope to get back on this seriously by the weekend, I just cleared a ton of space in my lab to get
the ath10k and mt76 stuff back online along with a new x86 embedded product, updated a couple servers to ubuntu 20.4, (and then blew the power circuit I was on....)

1 Like

I actually tried that first (I also had to make one edit to the Makefile and several in iproute2/include/uapi/pkt_sched.h - which looks the same as the one in your repo btw). My modded "tc" compiles, installs and runs but didn't like fq_codel_fast. That's when I went back and found the ce_threshold commit in your repo (probably should have looked there first - live and learn).

I still think I have a shot at making tc work better with fq_codel_fast but anything I come up with will be hackish for now (especially if there are two variants of pkt_sched.h). Constructive criticism, change requests, or other oversight is welcome.

Even for a "proof of concept" that might not go anywhere, I think it is time and I would like to make fq_codel_fast distinct from fq_codel. I'll take a shot at it and use fqcf as a temporary name. The name is cosmetic and can be changed any time.

EDIT: I will not change the name. There would be substantial changes to your repo not to mention your repo's name. Obviously you can change the name at any time - I'll do my best to keep up. I will attempt to separate it a bit more from fq_codel than it is now.

You should know that for personal privacy reasons, I choose not to identify my self in public online forums. The consequence is that my contributions are not directly included in openwrt. With respect to "fqcf," I'm not bothered about attribution to me so if i do end up making a useful contribution, any form of "copy and paste" is acceptable (a respected and known developer should review it).

1 Like

Thank you for taking the time to explain. I'm happy to learn and participate. I expect others will join in as their interest and time permits.

1 Like

it creates a fq_codel_fast.so lib you need to package or copy over....

1 Like

Also, although I had removed ce_threshold in an earlier commit, I put it back on the commit fro the gso splitting stuff.

for reference in your existing test setup, for measuring cpu, you might want to try cake. Over on the cake mailing list toke just put out a new patch that might speed it up a bit, and certainly does wonders for wireguard.... If yer feeling really ambitious, give dualpi a shot.

ya, but it doesn't look like its all there...

I'm about to "manually" revert it to put it back such that it matches fq_codel. Here is why:

After editing pkt_sched.h to match for both fq_codel_fast and tc, some name updates in fq_codel_fast (adding _fast where applicable). Both tc and and fq_codel_fast compile and install. However, base on:

r7500v2 # tc qdisc add dev eth0 root handle 1: mq
r7500v2 # tc qdisc add dev eth0 parent 1:1 fq_codel
r7500v2 # tc qdisc
qdisc noqueue 0: dev lo root refcnt 2 
qdisc mq 1: dev eth0 root 
qdisc fq_codel 800a: dev eth0 parent 1:1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
r7500v2 # tc qdisc replace dev eth0 parent 1:1 fq_codel_fast
RTNETLINK answers: Invalid argument
r7500v2 # tc qdisc replace dev eth0 parent 1:1 fq_codel_fast ce_threshold 2.5ms
RTNETLINK answers: Invalid argument
r7500v2 # tc qdisc replace dev eth0 parent 1:1 fq_codel_fast ce_threshol
What is "ce_threshol"?
Usage: ... fq_codel_fast [ limit PACKETS ]
[ flows NUMBER ]
[ memory_limit BYTES ]
[ target TIME ]
[ interval TIME ]
[ quantum BYTES ] [ [no]ecn ]
[ ce_threshold TIME ]

I think its failing due to the still missing ce_threshold components in sch_fq_codel_fast.c, codel.h, and codel_impl.h.

I'll put it back and see where that gets me. Assuming this fixes the above issue but you don't want something that I put back (i.e. ce_mark), we can go from that point...

EDIT0: I'll double check the install libs as well, thx for the suggestion.

I checked this, there is no *.so for any qdisc, tc looks like a static bin and "strings tc | grep codel" returns fq_codel_fast strings. there is a "m_xt.so" (xtables) installed with tc but I haven't found anything here yet.

EDIT1: ok, I finally noticed "sce" vs "ce"... I'll have to look it over again. Variables with "ce_threshold" in the name got me confused.

If there is a problem with sch_fq_codel_fast.c or your headers, I haven't found it yet. Google results for "RTNETLINK answers: Invalid argument" suggest checking that the "kernel supports" the qdisc - but this is unhelpful given I have the module and its loaded.

nitpick: ping -f is not an unresponsive flood.
from manpage "outputs packets as fast as they come back or one hundred times per second, whichever is more"

it is responsive to latency and plateaus at 100pps for rtt<=10ms

ok, I might have stumbled on a clue.

tc works with fq_codel except for

r7500v2 # tc qdisc change dev eth0 root fq_codel limit 10241
r7500v2 # tc qdisc
qdisc noqueue 0: dev lo root refcnt 2 
qdisc fq_codel 8014: dev eth0 root refcnt 9 limit 10241p flows 1024 quantum 1514 target 5.0ms interval 99us memory_limit 4Mb ecn 
r7500v2 # tc qdisc change dev eth0 root fq_codel flows 1024
RTNETLINK answers: Invalid argument

so I think I'll try some printk's in sch_fq_codel_fast init and change functions.

perhaps a minor bug carried over from sch_fq_codel...

EDIT: changing

err = nla_parse_nested(tb, TCA_FQ_CODEL_FAST_MAX, opt,      
                       fq_codel_fast_policy, NULL);


err = nla_parse_nested_deprecated(tb, TCA_FQ_CODEL_FAST_MAX, opt,       
                                  fq_codel_fast_policy, NULL);          

EDIT: it looks like using the nested_deprecated is all that is needed... (the now edited out entry in this post regarding a kernel panic was my bad):

r7500v2 # tc qdisc change dev eth0 root fq_codel_fast ce_threshold 2.5ms
r7500v2 # tc qdisc
qdisc noqueue 0: dev lo root refcnt 2 
qdisc fq_codel_fast 8004: dev eth0 root refcnt 9 limit 10240p flows 1024 quantum 1514 target 5.0ms ce_threshold 2.5ms interval 100.0ms memory_limit 4Mb ecn 

I sure hope someone with more kernel experience joins in soon... clearly I'm kinda slow at it.

I have to be "domestic" for a bit. Will return in a day or two bit (distracted by other interests).

@anon98444528 I have a bit of spare time this weekend. Could we maybe split a private repo for a while since yer shy and see if I can get caught up with you?

Hmm, my public repo with the work I've done to date is here. I should have mentioned that the tc and fq_codel_fast updates were pushed to that repo last week. Feel free to clone, fork, and/or cherry pick from that into your own.

I'm not sure a private repo is necessary but I can try that if it is easier for you. Otherwise, I'm happy to track your public repo and post whatever I do into my own so it is available for anyone to access.

FWIW, I have tried ce_threshold 2.5ms with the modified tc and fq_codel_fast. No crashes but I can't (yet) determine if it is doing something (a knowledge gap issue on my part). I'm still working through flent/netperf and my home network test setup.

I thought a private repo was easier for you. Me, I prefer to do the maximum amount of work in public.

The two things under test for that are - the cpu cost and network impact of gso/gro splitting, and what happens when you scribble on the ECT(1) bit a lot on normal traffic (both cpu and network effects)

To actually see a result from fiddling with that bit transports that are aware of it need to be used. The SCE related work for that is here: https://github.com/chromi/sce

The competition for the bit is called L4S in the IETF.

However, in both cases, fiddling with the ECT(1) bit has cpu costs and network impacts (most recently found and fixed was a bug in vpn decapsulations), especially on cheap embedded hw, that nobody has taken a hard look at yet...

... and it's easier (and safer) to test SCE than L4S, as the present L4S implementation is too scary to use.

1 Like

ha, every time I read the l4s vs the sce link I realize my knowledge "gap" is more of a canyon. It will take me time.

after looking over your tc-adv repo and what is in chormi's iproute-sce, I should probably review what I did for tc to see if its "sane"

BTW is there a way to pass generic netperf "test-specific" parameters through flent? I'd like to add something like -- -P 5002 to flent's call to netperf so I navigate through various firewalls without having to completely drop them.

EDIT: nvmd, it's easy enough to hack flent to get this for one data socket, but for tests like rrul which run several simultaneous data sockets, it's not really worth the effort (both to modify flent and to configure the firewall on the box running netserver).

there is also a metric ton of "newspeak" with the l4s drafts. It makes me very crazy to navigate it, and is intentionally misleading to non-experts. Producing a glossary of translations has been on my mind.

"Scalable congestion control" - who could be against something that's scalable? Except that the implementation... isn't.

"Low Latency Queue" - which is what a queue theorist would call a "priority queue". But because the L4S folk are so scared of net neutrality folk, they went around renaming things like this... I note that I have zero problem with an admission controlled "priority queue"...

"Classic Traffic" - all known forms of traffic today

And so on.... It is double-plus ungood that l4s has got as far as it has in a SDO without any running code until recently.

I note that the L4S folks also believe that they can offer a priority queue without admission control, which is at best naive and overly optimistic, at worst simply nuts....

hmm, flent --te=cpu_stats_hosts=root@<router.ip> isn't working for me.

Perhaps I'm missing something?

On the router I have

r7500v2 # opkg list | grep flent
flent-tools - 1.2.2-1

EDIT0: opkg install coreutils-sleep takes care of the sleep invalid number error (due to busybox sleep not supporting times less than one second). Plots of cpu status are still empty tho - sigh always something to do.

EDIT1: so it looks like the "cpu stats" data is going to stderr

EDIT2: htop shows the cpu's as being under load durring a flent test and I can get cpu use from /proc/stat...

EDIT3: flent using cpu_stats_hosts returns:

(p382.ob) [12] $ flent -H XXX.XXX.45.127 --te=cpu_stats_hosts=root@XXX.XXX.43.26 rrul > flent-$(date +"%Y%m%d-%H%M").out 2>&1
(p382.ob) [15] $ more flent-20200608-0707.out 
Started Flent 1.3.2 using Python 3.8.2.
Starting rrul test. Expected run time: 70 seconds.
WARNING: Command produced no valid data.
Runner class: CpuStatsRunner
Command: /bin/bash /home/n/.pyenv/versions/3.8.2/envs/p382.ob/lib/python3.8/site-packages/flent/scripts/stat_iterate.sh -I 0.20 -c 350 -H root@XXX.XXX.43.26
Return code: 0
Stdout: ---
Time: 1591614468.
Time: 1591614468.
41 42 0.0238095
Time: 1591614468.
42 43 0.0232558
Data file written to ./rrul-2020-06-08T070745.324701.flent.gz.                  
Summary of rrul test run from 2020-06-08 11:07:45.324701                        
                             avg       median          # data pts               
 Ping (ms) ICMP   :         8.36         8.12 ms              350               
 Ping (ms) UDP BE :         8.67         8.40 ms              350               
 Ping (ms) UDP BK :         8.91         8.69 ms              350               
 Ping (ms) UDP EF :         8.94         9.03 ms              350               
 Ping (ms) avg    :         8.72          N/A ms              350               
 TCP download BE  :       142.66       140.07 Mbits/s         350               
 TCP download BK  :       140.73       137.50 Mbits/s         350               
 TCP download CS5 :       146.57       149.64 Mbits/s         350               
 TCP download EF  :       127.06       124.16 Mbits/s         350               
 TCP download avg :       139.25          N/A Mbits/s         350               
 TCP download sum :       557.02          N/A Mbits/s         350               
 TCP totals       :      1157.45          N/A Mbits/s         350               
 TCP upload BE    :       156.58       161.03 Mbits/s         350               
 TCP upload BK    :       148.67       155.88 Mbits/s         350               
 TCP upload CS5   :       146.64       153.31 Mbits/s         350               
 TCP upload EF    :       148.54       157.00 Mbits/s         350               
 TCP upload avg   :       150.11          N/A Mbits/s         350               
 TCP upload sum   :       600.43          N/A Mbits/s         350               

so it looks like stat_iterate.sh is doing what is suppose to do (return code 0) but upon returning a result to flent (runners.py, class ProcessRunner, method run at line 584):

        self.result = self.parse(self.out, self.err)                            
        if not self.result and not self.silent:                                 
            logger.warning("Command produced no valid data.",                   
                           extra={'runner': self})                              

flent takes issue with self.result... I'd use pdb but I have no idea yet where to put a break point to catch this.

You get that error because Flent can't parse the output. In this case because the timestamp is missing the fractional second part.

Installing coreutils-date should fix this; I've bumped the flent-tools package version and added this and coreutils-sleep as dependencies.

1 Like

Just a quick follow up. I did end up testing with flent after the above fixes but the results are not straightforward to me (I may be making it more complicated than it needs to be).

Regardless, shortly after some initial tests, my family requested "the better AP" be returned to regular service so I've stopped for now.

Ironically, it seems new issues related to the wifi on my r7500v2 have come up on master likely from multiple sources (ath10k-ct firmware, mac80211 upgrades, hostapd changes, etc.) not the least may be related to my own configuration tweaking (some adjustments to my VLAN config).

Regarding the testing with flent that I did manage to do, I'm uncertain about being able to measure a 5% cpu benefit "on overload" when the r7500v2 cpu is "maxed out" and I'm not somehow holding throughput constant. For example, I seem to reach a cpu limit (rather than say the switch or nics) if using "one" core (irqbalance turned off). However, if I turned on irqbalance, I can get an extra ~50 mbit/s through the r7500v2.

I hope to come back to this, but it could be a year or more given the way school for my children and covid 19 are looking ATM.

Any news on fq_codel_fast? @dtaht