Fq_codel_fast helpers, test and testers?

moeller0 · May 20, 2020, 12:59pm

Well, the elephant in the room is typically the actual shaper, so either HTB, TBF or cake's inbuilt shaper. Once these are operating the small changes in efficiency of the fq leaf qdiscs easily disappear into the noise. I am not saying they can not be measured, but just that you should not expect miracles in typical shaping situations. That said, I am quite curious how it compares against the references.

anon98444528 · May 21, 2020, 3:26am

The more I try to test on nmrh's cheep and cheerful home network (while the family is using the network), the more I think I am not going to demonstrate anything useful.

Here is my latest attempt, mostly to orient myself, and shared in the event others find parts of it helpful.

I don't recommend drawing any conclusions from the results below nor do I recommend testing this way.

### on a laptop running ubuntu: setting "350 up/down rate limit" & iperf server
modprobe ifb numifbs=1
sudo ip link set dev ifb0 up
sudo tc qdisc add dev eno1 handle ffff: ingress
sudo tc filter add dev eno1 parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev ifb0
sudo tc qdisc add dev ifb0 root tbf rate 350mbit burst 1mbit latency 400ms
sudo tc qdisc add dev eno1 root tbf rate 350mbit burst 1mbit latency 400ms

iperf -s

### on r7500v2, mpstat and iperf -c commands for test results below
mpstat 2 30 > cpu-$(date +"%Y%m%d-%H%M").log&
iperf -c XXX.XXX.40.10 -d -t60 -i10

### laptop: 350 up/down shaping; r7500v2: sqm, fq_codel_fast, cpu-20200520-2144.log; 
### Note I got the sqm up/down limits from a prior test but see below for a test without sqm

root@r7500v2:~# cat /etc/config/sqm

config queue 'eth1'
        option qdisc_advanced '0'
        option interface 'eth0.2'
        option debug_logging '0'
        option verbosity '5'
        option linklayer 'ethernet'
        option overhead '22'
        option script 'simple.qos'
        option download '180000'
        option upload '300000'
        option qdisc 'fq_codel_fast'
        option enabled '1'

[  4] local XXX.XXX.40.10 port 5001 connected with XXX.XXX.40.5 port 56008
------------------------------------------------------------
Client connecting to XXX.XXX.40.5, TCP port 5001
TCP window size:  204 KByte (default)
------------------------------------------------------------
[  6] local XXX.XXX.40.10 port 52728 connected with XXX.XXX.40.5 port 5001
[  6]  0.0-60.0 sec  1.21 GBytes   173 Mbits/sec
[  4]  0.0-60.0 sec  2.06 GBytes   294 Mbits/sec

Linux 5.4.41 (r7500v2)  05/20/20        _armv7l_        (2 CPU)

21:44:24     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    2.59    0.02   20.25    0.00    0.00   14.90  0.00    0.00    0.00   62.25

### laptop: 350 up/down shaping; r7500v2: sqm, fq_codel, cpu-20200520-2200.log
root@r7500v2:~# cat /etc/config/sqm

config queue 'eth1'
        option qdisc_advanced '0'
        option interface 'eth0.2'
        option debug_logging '0'
        option verbosity '5'
        option linklayer 'ethernet'
        option overhead '22'
        option script 'simple.qos'
        option download '180000'
        option upload '300000'
        option qdisc 'fq_codel'
        option enabled '1'

[  4] local XXX.XXX.40.10 port 5001 connected with XXX.XXX.40.5 port 56010
------------------------------------------------------------
Client connecting to XXX.XXX.40.5, TCP port 5001
TCP window size:  204 KByte (default)
------------------------------------------------------------
[  6] local XXX.XXX.40.10 port 52912 connected with XXX.XXX.40.5 port 5001
[  6]  0.0-60.0 sec  1.21 GBytes   173 Mbits/sec
[  4]  0.0-60.1 sec  2.05 GBytes   294 Mbits/sec

Linux 5.4.41 (r7500v2)  05/20/20        _armv7l_        (2 CPU)

22:00:54     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    2.35    0.03   19.86    0.00    0.00   15.39    0.00    0.00    0.00   62.37

### laptop: 350 up/down shaping; r7500v2: no sqm, fq_codel, cpu-20200520-2219.log

[  4] local XXX.XXX.40.10 port 5001 connected with XXX.XXX.40.5 port 56012
------------------------------------------------------------
Client connecting to XXX.XXX.40.5, TCP port 5001
TCP window size:  246 KByte (default)
------------------------------------------------------------
[  6] local XXX.XXX.40.10 port 52982 connected with XXX.XXX.40.5 port 5001
[  6]  0.0-60.0 sec  1.38 GBytes   198 Mbits/sec
[  4]  0.0-60.1 sec  2.34 GBytes   334 Mbits/sec
Linux 5.4.41 (r7500v2)  05/20/20        _armv7l_        (2 CPU)

22:19:12     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    2.03    0.00   29.46    0.00    0.00   27.13    0.00    0.00    0.00   41.37

### laptop: no shaping; r7500v2: fq_codel_fast, no sqm, cpu-20200520-2242.log

[  4] local XXX.XXX.40.10 port 5001 connected with XXX.XXX.40.5 port 56016
------------------------------------------------------------
Client connecting to XXX.XXX.40.5, TCP port 5001
TCP window size:  187 KByte (default)
------------------------------------------------------------
[  6] local XXX.XXX.40.10 port 53094 connected with XXX.XXX.40.5 port 5001
[  4]  0.0-60.0 sec  4.64 GBytes   664 Mbits/sec
[  6]  0.0-60.0 sec  4.44 GBytes   636 Mbits/sec

Linux 5.4.41 (r7500v2)  05/20/20        _armv7l_        (2 CPU)

22:42:18     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    4.26    0.01   41.45    0.00    0.00   46.50    0.00    0.00    0.00    7.78

### laptop: no shaping; r7500v2: fq_codel, no sqm, cpu-20200520-2232.log

[  4] local XXX.XXX.40.10 port 5001 connected with XXX.XXX.40.5 port 56014
------------------------------------------------------------
Client connecting to XXX.XXX.40.5, TCP port 5001
TCP window size:  382 KByte (default)
------------------------------------------------------------
[  6] local XXX.XXX.40.10 port 52990 connected with XXX.XXX.40.5 port 5001
[ ID] Interval       Transfer     Bandwidth
[  6]  0.0-60.0 sec  5.32 GBytes   762 Mbits/sec
[  4]  0.0-60.0 sec  4.16 GBytes   596 Mbits/sec

Linux 5.4.41 (r7500v2)  05/20/20        _armv7l_        (2 CPU)

22:32:44     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    4.90    0.02   43.11    0.00    0.00   48.10    0.00    0.00    0.00    3.87

Asking myself why I got the results above leads me to believe testing between a laptop and the GW/AP with sqm does not simulate "the internet" well at all. Perhaps not enough buffers between the two devices, poor choice of configuration to rate limit the connection, or all of the above.

Comparing fq_codel with fq_codel_fast without sqm on a home network like this might be informative but under more controlled conditions (no family using the AP serving as the switch between the laptop and r7500v2) and with some statistics (lots of replicate experiments).

anon98444528 · May 25, 2020, 2:44pm

My current level of (mis)understanding:

I am curious about

Since it looks like the return value in fq_codel_drop was set to avoid calling this part of the code in fq_codel_enqueue

        if (ret == idx) {                                                
                qdisc_tree_reduce_backlog(sch, prev_qlen - 1,                   
                                          prev_backlog - pkt_len);              
                return NET_XMIT_CN;                                             
        }

I did the following to see what would happen if that part of the code was called (I think as intended in fq_codel for dropping from the current flow):

--- a/sch_fq_codel_fast.c                                                       
+++ b/sch_fq_codel_fast.c                                                       
@@ -148,16 +148,16 @@ static unsigned int fq_codel_drop(struct                  
        sch->qstats.drops += i;                                                 
        sch->qstats.backlog -= len;                                             
        sch->q.qlen -= i;                                                       
-       idx = 1055; // just ignore for now                                      
+       // idx = 1055; // just ignore for now                                   
        // idx = (q->flows - q->fat_flow) >> FIXME SOME_CORRECT_DEFINE          
-       return idx;                                                             
+       return 0; // success                                                    
 }                                                                              
                                                                                
 static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch,            
                            struct sk_buff **to_free)                           
 {                                                                              
        struct fq_codel_sched_data *q = qdisc_priv(sch);                        
-       unsigned int idx, prev_backlog, prev_qlen;                              
+       unsigned int idx, prev_backlog, prev_qlen, drop_current_flow = 0;       
        struct fq_codel_flow *flow;                                             
        int uninitialized_var(ret);                                             
        unsigned int pkt_len;                                                   
@@ -215,6 +215,7 @@ static int fq_codel_enqueue(struct sk_bu                    
        if(flow->backlog > q->fat_backlog) {                                    
                q->fat_flow = flow;                                             
                q->fat_backlog = flow->backlog;                                 
+               drop_current_flow = 1;                                          
        }                                                                       
                                                                                
        if (list_empty(&flow->flowchain)) {                                     
@@ -248,9 +249,10 @@ static int fq_codel_enqueue(struct sk_bu                   
         * If we dropped a packet for this flow, return NET_XMIT_CN,            
         * but in this case, our parents wont increase their backlogs.          
         */                                                                     
-       if (ret == idx) {                                                       
+       if (drop_current_flow) {                                                
                qdisc_tree_reduce_backlog(sch, prev_qlen - 1,                   
                                          prev_backlog - pkt_len);              
+               printk("dropping current flow idx: %u", idx);                   
                return NET_XMIT_CN;                                             
        }                                                                       
        qdisc_tree_reduce_backlog(sch, prev_qlen, prev_backlog);

With the above patch, fq_codel_fast builds, loads, and runs on my r7500v2...

I can get a few drops (say 376 according to tc -d -s qdisc show dev ifb4eth0.2 after some runs with iperf) using fq_codel_fast but apparently not enough to trigger dropping from the current flow (i.e. no messages in dmesg). I guess with 1024 flows this should be expected.

Since I appear to be spamming the OP's thread (and I'm not the person he's looking for), I'm going to take a break from this for a while and hopefully some one else will take an interest.

HTH

dtaht · May 26, 2020, 5:32pm

I got rather busy reopening lupin lodge for the memorial day weekend and haven't got back on this.

yes, dropping from the current flow seemed like a statistically rare event and possibly not worth optimizing for. I wanted speed, speed, speed.

so ,like, it doesn't crash? can your post some tc -s output and also try ce_threshold 2.5ms?

anon98444528 · May 27, 2020, 2:31am

no worries, I just don't want to "take over" your thread by making too many posts that may or may not be what your after. I'm not bothered if you don't respond for long periods.

no, I have not crashed it yet, but give me time. Like most users I'm good at breaking programs from good coders. (I did think about setting the flows to 1 from 1024 - I'm pretty sure that would make a mess of things)

Below are some tc stats from more "casual" tests. I do plan on trying flent as I expect its output will be of more use to you, but I'm still playing around with iperf and mpstat ATM.

I'd like to see if can keep the iperf throughput constant enough and still measure a change in cpu utilization between fq_codel and fq_codel_fast. Unfortunately, my stats are no better than my coding. I know enough to recognize this as a multivariate problem in which the responses (cpu usage and iperf throughput) are correlated. Making a statistically sound comparison (like a t-test) between them requires some effort. I don't suppose flent is set for this? Anyway, I'm hopping keeping iperf throughput sufficiently constant is sufficient to make a meaningful comparison on cpu usage.

BTW while I can set ce_threshold to 2.5ms for fq_codel, I can not do it for fq_codel_fast. I think this constraint is due to the "tc" command which likely needs to be upgraded to be aware of fq_codel_fast. If that is not too difficult, I can make the change myself - but it will take me time.

EDIT: WRT ce_threshold, there is also this:

commit e7e3d08831ed3ebe5afb8b77f94a2e47fd4ccce2
Author: dave taht <dave.taht@gmail.com>
Date:   Wed Aug 29 00:26:10 2018 +0000

    Get rid of ce_threshold
    
    This was a failed experiment at google.

details about the "casual" test are below

r7500v2 # tc -s qdisc show dev ifb4eth0.2
qdisc htb 1: root refcnt 2 r2q 10 default 0x10 direct_packets_stat 0 direct_qlen 32
 Sent 9553980594 bytes 7316928 pkt (dropped 59, overlimits 396547 requeues 0) 
 backlog 0b 0p requeues 0
qdisc fq_codel_fast 110: parent 1:10 [Unknown qdisc, optlen=72] 
 Sent 9553980594 bytes 7316928 pkt (dropped 59, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0

# tests between r7500v2 in GW/AP mode (firewall enabled, SQM enabled, one 5 
# GHz radio on but nothing connected, etc) with its WAN port connected to a 
# switch.  Also connected to the switch is a laptop running "iperf -s"  (I'm still playing
# with tbf on the laptop as described above but I don't think I need this and will 
# likely stop unless I want to use "netem" for emulating packet loss or something

# results above generated from several iperf commands similar to:
r7500v2 # mpstat -P ALL 2 30 > cpu-$(date +"%Y%m%d-%H%M").log&
r7500v2 # iperf -c XXX.XXX.45.137 -d -t60 -i10
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to XXX.XXX.45.137, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[  4] local XXX.XXX.45.101 port 52476 connected with XXX.XXX.45.137 port 5001
[  5] local XXX.XXX.45.101 port 5001 connected with XXX.XXX.45.137 port 34160
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   566 MBytes   475 Mbits/sec
[  5]  0.0-10.0 sec   516 MBytes   433 Mbits/sec
[  4] 10.0-20.0 sec   576 MBytes   483 Mbits/sec
[  5] 10.0-20.0 sec   514 MBytes   431 Mbits/sec
[  4] 20.0-30.0 sec   580 MBytes   486 Mbits/sec
[  5] 20.0-30.0 sec   513 MBytes   431 Mbits/sec
[  4] 30.0-40.0 sec   579 MBytes   486 Mbits/sec
[  5] 30.0-40.0 sec   533 MBytes   447 Mbits/sec
[  4] 40.0-50.0 sec   583 MBytes   489 Mbits/sec
[  5] 40.0-50.0 sec   504 MBytes   423 Mbits/sec
[  4] 50.0-60.0 sec   582 MBytes   488 Mbits/sec
[  4]  0.0-60.0 sec  3.38 GBytes   484 Mbits/sec
[  5] 50.0-60.0 sec   478 MBytes   401 Mbits/sec
[  5]  0.0-60.0 sec  2.99 GBytes   428 Mbits/sec
[SUM]  0.0-60.0 sec  3.49 GBytes   500 Mbits/sec
[3]-  Done                       mpstat -P ALL 2 30 1>cpu-$(...).log

# at this iperf throughput with SQM, the r7500v2 cpu idle is ~20%  so it is getting
# close to its limits under these conditions.

r7500v2 # cat /etc/config/sqm
config queue 'eth1'
	option qdisc_advanced '0'
	option interface 'eth0.2'
	option debug_logging '0'
	option verbosity '5'
	option linklayer 'ethernet'
	option overhead '22'
	option script 'simple.qos'
	option download '500000'
	option upload '500000'
	option qdisc 'fq_codel_fast'
	option enabled '1'

anon98444528 · May 27, 2020, 7:28pm

less casual, but not rigorous

tested between r7500v2 WAN port (GW/AP, no SQM, iperf -c <laptop.ip> -t60 -i10) and a laptop using a tbf qdisc to control the iperf rate (laptop running iperf -s). The cpu % idle is an average over 50 seconds of 2 second samples made 10 seconds after starting the iperf -c test. i.e. on the r7500v2:
sleep 10; mpstat -P ALL -o JSON 2 25 > cpu-${date}.log
From prior testing I observed that the cpu % idle tended to reach a nominal value after about 4-6 seconds after starting the iperf test.

max      53.0          526		                        50.7          526	
ave      51.2          526	                            47.8          526	
min      49.0          526                              46.1          526	
stdev    1.7           0                                2.0           0	
95% CI   2.6           -                                3.1           -
count    4             4                                4             4
id       Cpu (%idle)   iperf (mbps)  qdisc      id      Cpu (%idle)   iperf (mbps)     qdisc
1343     48.98	       526           fq_codel   1417    46.11         526              fq_codel_fast
1353     52.98         526           fq_codel   1419    47.3          526              fq_codel_fast
1355     51.5          526           fq_codel   1423    50.65         526              fq_codel_fast
1356     51.5          526           fq_codel   1424    47.25         526              fq_codel_fast

# tc -s qdisc for fq_codel:
r7500v2 # cat qdisc-stats-20200527-1358.log 
qdisc noqueue 0: dev lo root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc mq 0: dev eth0 root 
 Sent 54544731437 bytes 36093582 pkt (dropped 0, overlimits 0 requeues 4210) 
 backlog 0b 0p requeues 4210
qdisc fq_codel 0: dev eth0 parent :1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
 Sent 54544731437 bytes 36093582 pkt (dropped 0, overlimits 0 requeues 4210) 
 backlog 0b 0p requeues 4210
  maxpacket 1514 drop_overlimit 0 new_flow_count 8814 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc mq 0: dev eth1 root 
 Sent 14346 bytes 84 pkt (dropped 0, overlimits 0 requeues 1) 
 backlog 0b 0p requeues 1
qdisc fq_codel 0: dev eth1 parent :1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
 Sent 14346 bytes 84 pkt (dropped 0, overlimits 0 requeues 1) 
 backlog 0b 0p requeues 1
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth1.1 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.2 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan0 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0

# tc -s qdisc for fq_codel_fast:
r7500v2 # cat qdisc-stats-20200527-1435.log 
qdisc noqueue 0: dev lo root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc mq 1: dev eth0 root 
 Sent 16514454481 bytes 10911281 pkt (dropped 0, overlimits 0 requeues 325) 
 backlog 0b 0p requeues 325
qdisc fq_codel_fast 8009: dev eth0 parent 1:1 [Unknown qdisc, optlen=72] 
 Sent 16514454481 bytes 10911281 pkt (dropped 0, overlimits 0 requeues 325) 
 backlog 0b 0p requeues 325
qdisc mq 0: dev eth1 root 
 Sent 15216 bytes 89 pkt (dropped 0, overlimits 0 requeues 1) 
 backlog 0b 0p requeues 1
qdisc fq_codel 0: dev eth1 parent :1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
 Sent 15216 bytes 89 pkt (dropped 0, overlimits 0 requeues 1) 
 backlog 0b 0p requeues 1
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth1.1 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev eth0.2 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan0 root refcnt 2 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0

For 8 observations in which iperf is held at 526 mbps during each observation, cpu % idle is 47.8 +/- 3.1 when using fq_codel_fast compared to 51.2 +/- 2.6 cpu % idle for fq_codel. This difference is not statistically significant. I'm guessing this may be an indication of what you mean by:

More than 4 observations per qdisc are required to show anything tho.

Would you mind describing what you mean by overload and how that might be achieved in a test? I'm going to take a look at flent now so feel free to describe testing "on overload" with flent...

dtaht · May 28, 2020, 1:36pm

An overload would be an unresponsive flow. 'ping -f -s 1500 somewhere' for example, can be used to do you in, or you can use a udp flood from a variety of tools like netperf or iperf or pktgen. Or use any of the udp_flood tests in flent --list-tests. Have long meant to combine these into a rrul + flood test more direclty...

It turns out ce_threshold was not a failed experiment in google but used for their dctcp-like implementation of bbr (we think). I was going to just rename it to sce_threshold before the l4s/sce debate blew up in the ietf. My principal reason for wanting to fiddle with it is to look at the cpu impact of mangling the ecn field that often, and secondarily explore the impact of gso on or off in fq_codel. gso/gro "superpackets" have a tendency to bloat the aqm, and hurt fq, but also save on cpu.Third is to actually try out the existing sce enabled tcps....

As for fixing tc, simplest way is to just copy the tc/*fq_codel.c file and change it to look for fq_codel_fast on the callback. Honestly I figured it was gonna be saner to come up with another name for the qdisc in the first place. fqc perhaps. Although, if you say it aloud, it isn't pg-13. REALLY BAD at names here, feel free to suggest a better name!

flent has means to track and plot cpu stats on all the hosts on the path. There is a flent package for openwrt with the needed tools. You setup .ssh/config to get to 'em and do a --te=cpu_stats_hosts=hostA,hostB,hostC

The tools have a tendency to heisenbug the tests, and may not have been tested in a while. We had a new bug introduced in tc's batch facility that breaks the c version of the tc monitoring tool (there's a shell version also)

As for the stastical significance of your findings so far, well, it's barely above the noise...

dtaht · May 28, 2020, 1:50pm

Yer not hijacking the thread at all.

I really LIKE explaining this stuff to people with different perspectives. I've been at this so long that I don't know what other people know, and having a chance to write it down, and get a little help
on exploring new ideas, and making stuff faster in general, really helps. I kind of have a lonely life in the last time I got to discuss a m/m/1 queue theory was a year or three back... I really wish more folk had essentially memorized: https://archive.org/details/queueingsystems01klei https://archive.org/details/queueingsystems02klei as covid-bloat is queue theory 101....

The original idea for fq_codel fast was to speed it up as much as possible, then layer in two needed
ideas, and then... find some way to create a multi-core shaper. htb has severe scaling problems, ebpf is not particularly fast, and I thought, at the time, that I was starting to get an inkling as to how to make
a multicore shaper "just work", then got distracted by other things. The timing here is good in that we have time to play here and in the wifi stack before the next major openwrt release, so long as we get enough folk helping out.

I hope to get back on this seriously by the weekend, I just cleared a ton of space in my lab to get
the ath10k and mt76 stuff back online along with a new x86 embedded product, updated a couple servers to ubuntu 20.4, (and then blew the power circuit I was on....)

anon98444528 · May 28, 2020, 9:22pm

I actually tried that first (I also had to make one edit to the Makefile and several in iproute2/include/uapi/pkt_sched.h - which looks the same as the one in your repo btw). My modded "tc" compiles, installs and runs but didn't like fq_codel_fast. That's when I went back and found the ce_threshold commit in your repo (probably should have looked there first - live and learn).

I still think I have a shot at making tc work better with fq_codel_fast but anything I come up with will be hackish for now (especially if there are two variants of pkt_sched.h). Constructive criticism, change requests, or other oversight is welcome.

Even for a "proof of concept" that might not go anywhere, I think it is time and I would like to make fq_codel_fast distinct from fq_codel. ~~I'll take a shot at it and use fqcf as a temporary name. The name is cosmetic and can be changed any time.~~

EDIT: I will not change the name. There would be substantial changes to your repo not to mention your repo's name. Obviously you can change the name at any time - I'll do my best to keep up. I will attempt to separate it a bit more from fq_codel than it is now.

You should know that for personal privacy reasons, I choose not to identify my self in public online forums. The consequence is that my contributions are not directly included in openwrt. With respect to "fqcf," I'm not bothered about attribution to me so if i do end up making a useful contribution, any form of "copy and paste" is acceptable (a respected and known developer should review it).

anon98444528 · May 28, 2020, 9:27pm

Thank you for taking the time to explain. I'm happy to learn and participate. I expect others will join in as their interest and time permits.

dtaht · May 29, 2020, 4:51pm

it creates a fq_codel_fast.so lib you need to package or copy over....

dtaht · May 29, 2020, 4:52pm

Also, although I had removed ce_threshold in an earlier commit, I put it back on the commit fro the gso splitting stuff.

dtaht · May 29, 2020, 5:10pm

for reference in your existing test setup, for measuring cpu, you might want to try cake. Over on the cake mailing list toke just put out a new patch that might speed it up a bit, and certainly does wonders for wireguard.... If yer feeling really ambitious, give dualpi a shot.

anon98444528 · May 29, 2020, 5:13pm

ya, but it doesn't look like its all there...

I'm about to "manually" revert it to put it back such that it matches fq_codel. Here is why:

After editing pkt_sched.h to match for both fq_codel_fast and tc, some name updates in fq_codel_fast (adding _fast where applicable). Both tc and and fq_codel_fast compile and install. However, base on:

r7500v2 # tc qdisc add dev eth0 root handle 1: mq
r7500v2 # tc qdisc add dev eth0 parent 1:1 fq_codel
r7500v2 # tc qdisc
qdisc noqueue 0: dev lo root refcnt 2 
qdisc mq 1: dev eth0 root 
qdisc fq_codel 800a: dev eth0 parent 1:1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
r7500v2 # tc qdisc replace dev eth0 parent 1:1 fq_codel_fast
RTNETLINK answers: Invalid argument
r7500v2 # tc qdisc replace dev eth0 parent 1:1 fq_codel_fast ce_threshold 2.5ms
RTNETLINK answers: Invalid argument
r7500v2 # tc qdisc replace dev eth0 parent 1:1 fq_codel_fast ce_threshol
What is "ce_threshol"?
Usage: ... fq_codel_fast [ limit PACKETS ]
[ flows NUMBER ]
[ memory_limit BYTES ]
[ target TIME ]
[ interval TIME ]
[ quantum BYTES ] [ [no]ecn ]
[ ce_threshold TIME ]

~~I think its failing due to the still missing ce_threshold components in sch_fq_codel_fast.c, codel.h, and codel_impl.h.~~

~~I'll put it back and see where that gets me. Assuming this fixes the above issue but you don't want something that I put back (i.e. ce_mark), we can go from that point...~~

EDIT0: I'll double check the install libs as well, thx for the suggestion.

I checked this, there is no *.so for any qdisc, tc looks like a static bin and "strings tc | grep codel" returns fq_codel_fast strings. there is a "m_xt.so" (xtables) installed with tc but I haven't found anything here yet.

EDIT1: ok, I finally noticed "sce" vs "ce"... I'll have to look it over again. Variables with "ce_threshold" in the name got me confused.

If there is a problem with sch_fq_codel_fast.c or your headers, I haven't found it yet. Google results for "RTNETLINK answers: Invalid argument" suggest checking that the "kernel supports" the qdisc - but this is unhelpful given I have the module and its loaded.

fuller · May 30, 2020, 8:26am

nitpick: ping -f is not an unresponsive flood.
from manpage "outputs packets as fast as they come back or one hundred times per second, whichever is more"

edit:
it is responsive to latency and plateaus at 100pps for rtt<=10ms

anon98444528 · May 30, 2020, 10:00am

ok, I might have stumbled on a clue.

tc works with fq_codel except for

r7500v2 # tc qdisc change dev eth0 root fq_codel limit 10241
r7500v2 # tc qdisc
qdisc noqueue 0: dev lo root refcnt 2 
qdisc fq_codel 8014: dev eth0 root refcnt 9 limit 10241p flows 1024 quantum 1514 target 5.0ms interval 99us memory_limit 4Mb ecn 
...
r7500v2 # tc qdisc change dev eth0 root fq_codel flows 1024
RTNETLINK answers: Invalid argument

so I think I'll try some printk's in sch_fq_codel_fast init and change functions.

perhaps a minor bug carried over from sch_fq_codel...

EDIT: changing

err = nla_parse_nested(tb, TCA_FQ_CODEL_FAST_MAX, opt,      
                       fq_codel_fast_policy, NULL);

to

err = nla_parse_nested_deprecated(tb, TCA_FQ_CODEL_FAST_MAX, opt,       
                                  fq_codel_fast_policy, NULL);

EDIT: it looks like using the nested_deprecated is all that is needed... (the now edited out entry in this post regarding a kernel panic was my bad):

r7500v2 # tc qdisc change dev eth0 root fq_codel_fast ce_threshold 2.5ms
r7500v2 # tc qdisc
qdisc noqueue 0: dev lo root refcnt 2 
qdisc fq_codel_fast 8004: dev eth0 root refcnt 9 limit 10240p flows 1024 quantum 1514 target 5.0ms ce_threshold 2.5ms interval 100.0ms memory_limit 4Mb ecn

I sure hope someone with more kernel experience joins in soon... clearly I'm kinda slow at it.

I have to be "domestic" for a bit. Will return in a ~~day or two~~ bit (distracted by other interests).

dtaht · June 6, 2020, 3:52am

@anon98444528 I have a bit of spare time this weekend. Could we maybe split a private repo for a while since yer shy and see if I can get caught up with you?

anon98444528 · June 6, 2020, 10:49am

Hmm, my public repo with the work I've done to date is here. I should have mentioned that the tc and fq_codel_fast updates were pushed to that repo last week. Feel free to clone, fork, and/or cherry pick from that into your own.

I'm not sure a private repo is necessary but I can try that if it is easier for you. Otherwise, I'm happy to track your public repo and post whatever I do into my own so it is available for anyone to access.

FWIW, I have tried ce_threshold 2.5ms with the modified tc and fq_codel_fast. No crashes but I can't (yet) determine if it is doing something (a knowledge gap issue on my part). I'm still working through flent/netperf and my home network test setup.

dtaht · June 6, 2020, 2:28pm

I thought a private repo was easier for you. Me, I prefer to do the maximum amount of work in public.

The two things under test for that are - the cpu cost and network impact of gso/gro splitting, and what happens when you scribble on the ECT(1) bit a lot on normal traffic (both cpu and network effects)

To actually see a result from fiddling with that bit transports that are aware of it need to be used. The SCE related work for that is here: https://github.com/chromi/sce

The competition for the bit is called L4S in the IETF.

However, in both cases, fiddling with the ECT(1) bit has cpu costs and network impacts (most recently found and fixed was a bug in vpn decapsulations), especially on cheap embedded hw, that nobody has taken a hard look at yet...

... and it's easier (and safer) to test SCE than L4S, as the present L4S implementation is too scary to use.

anon98444528 · June 6, 2020, 2:54pm

ha, every time I read the l4s vs the sce link I realize my knowledge "gap" is more of a canyon. It will take me time.

after looking over your tc-adv repo and what is in chormi's iproute-sce, I should probably review what I did for tc to see if its "sane"

BTW is there a way to pass generic netperf "test-specific" parameters through flent? I'd like to add something like -- -P 5002 to flent's call to netperf so I navigate through various firewalls without having to completely drop them.

EDIT: nvmd, it's easy enough to hack flent to get this for one data socket, but for tests like rrul which run several simultaneous data sockets, it's not really worth the effort (both to modify flent and to configure the firewall on the box running netserver).