Beyond fq_codel & cake

I'll try that as well. I couldn't get hwsim to work properly. hwsim should make it easier to reproduce the results.

The copied code allowed me to keep the same list implementation. I'll check if we have the same functions/macros for the regular list.

I'll try to find those patches.

Flent is on my list of things to set up and use. The local lab environment with a router and two hosts is also on that list.

The testing done with fast.com was just a basic sanity check to check if mac80211 still works with the patch or not.

fq_pie seemed to drop traffic and to have an overall behaviour similar to RED. This is just my perception based on the fact that it reduced the observed rates down to about 100 mbps. It's just something to add to the list of qdiscs to test to compare the behaviours of the various algorithms and implementations.

We're discussing a few different algorithms and qdiscs on this thread: fq_impl used by mac80211, fq_codel and cake. Let's go through them one by one.

mac80211 from the Linux kernel uses fq_impl with 8192 virtual queues assigned dynamically to tins and a quantum of 300 bytes. When a tin puts a flow in a virtual queue X which is already used by another tin, the packets of the flow are put instead in the default_flow of this tin.

The quantum of 300 bytes forces a flow to go through the list a few times until the deficit goes above 0 before a packet can be dequeued again from that virtual queue. This adds on top of what the SQF paper calls longest queue drop (drop packets from the queue with the most bytes in its backlog) and on top of the codel control code which can drop a lot of packets. This fq_impl code also borrows the DRR+ implementation from fq_codel. The DRR+ approach forces the new empty flows to go through the list of old flows. This gives us a dequeue function which is quite far from being O(1), even if we ignore the codel drop and longest queue drop.

Dropping packets from the longest queue and the codel control code can stay as they are. We could improve the dequeueing code to avoid loops where we increment the deficit repeatedly and to not pick up new empty flows which are going through the queue of old flows.

Another issue is that the tin accepts collisions too eagerly. A busy AP is likely to end up with a lot of flows in the default_flow of every tin. This bulk virtual queue is controlled by a single codel instance.

fq_codel has a similar approach to the one used by fq_impl. The empty new flows can also be selected during dequeue when going through the lists. fq_codel uses an MTU sized quantum by default. This reduces its complexity somewhat because a flow has to go through the list only twice before the next packet is dequeued: once to increment the MTU and once to dequeue a packet.

cake is the most complex one. I didn't spend as much time looking at cake. cake has a DRR implementation which is improved further. Since it also has bandwidth shaping, multiple tins for different classes of traffic and other features, I'd just leave it out for now.

All of these three implementations share some weaknesses. Backlogged flows with MTU sized packets can end up in the same virtual queues as sparse flows with small packets. Busy routers which are handling a large number of flows at the same time will end up having fewer sparse flows due to these collisions. Cake mitigates this to some extent with the 8 way associative hashing and the multiple tins. The virtual queues need to move from the head to the tail of the list of old flows before packets can be dequeued. DRR, DRR+ and DRR++ aren't SQF.

We could change a few things:

  1. implement SQF to not need the new_flows list or augment DRR+ with SQF
  2. add strict 4 way associative hashing to mac80211 - flows go to the default_flow virtual queue when accepting a collision
  3. try to make the DRR part closer to O(1) for quantum values smaller than the MTU of the interface

I just throw it in the mix to compare it with other implementations. This just confirms that the theory behind it isn't solid either, not just the implementation.

These papers have some good ideas.

http://docenti.ing.unipi.it/g.stea/papers/Tecnhical%20Report%20Oct%202003.pdf

I do not believe that the harm here is not classifying a flow as sparse, but that in a collision between unequally fast flows the slower will have to suffer the drop schedule elicited by the faster... For fq_codel the stochastic remedy is to increase the number of has bins. For cake this is not as clear because a) in cake number of bins is a compile- not a run-time option as in fq_codel b) cake will always allow 1 or 2 (do not recall correctly) packets per hash bin so the pure round-robin service delay will easily exceed the target with a sufficiently high number of flows on a sufficiently slow link. I am sure you are aware of all of that already.

Again, I am not saying PIE and derivatives do not work at all, just that they are not without their own issues (including the fact that changing the delay target apparently also requires to change other parameters in lock-step, hardly user friendly).

I am hopping on a plane shortly, expect sparse replies for the next few days, if at all. There's also been some discussion of new cake features over here: https://github.com/rchac/LibreQoS/issues/121#issuecomment-1261210995

The impact of a larger number of virtual queues is certainly something to test. It's quite easy to understand that the large number of flows would exceed codel's 5ms time. Perhaps we'll gather some data from my tests with a large number of flows.

My concern in this context was that the sparse flows with smaller and fewer packets end up being put in the backlogged virtual queues of heavy flows. This would be even worse in the scenario you've mentioned with 1024 virtual queues backlogged with bittorrent transfers. The sparse flows would collide with the backlogged flows. Codel would most likely drop a lot of packets.

Perhaps an algorithm similar to SQF will help with this.

Different algorithms and smaller data structures should help. Better scheduling decisions on the client device and on the home routers should help. There are bloated and poorly managed buffers in upstream equipment at most ISPs. Some connections such as LTE and wifi have highly variable rates. Transmitting the packets of sparser flows and the packets of flows with the least backlogged bytes of data first would give us a chance to have a better overall experience.

The libreqos and ISP like setups are already different. Perhaps it's possible to implement fine grained locking in mq to have an mq_fq_codel qdisc of some kind.

How could we modify the drivers or the kernel to use hardware timestamping? I'm not sure how we can convince those NICs to set timestamps on packets.

I've refactored the previous patch for fq_impl to be used with the 5.15 wifi backports. It builds without any warnings. It's not tested yet.

diff --git a/include/net/fq.h b/include/net/fq.h
index 2eccbbd..8f76f67 100644
--- a/include/net/fq.h
+++ b/include/net/fq.h
@@ -27,7 +27,7 @@ struct fq_tin;
 struct fq_flow {
 	struct fq_tin *tin;
 	struct list_head flowchain;
-	struct sk_buff_head queue;
+	struct list_head queue;
 	u32 backlog;
 	int deficit;
 };
diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index a5f67a2..7a5c22c 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -51,9 +51,10 @@ static struct sk_buff *fq_flow_dequeue(struct fq *fq,
 
 	lockdep_assert_held(&fq->lock);
 
-	skb = __skb_dequeue(&flow->queue);
+	skb = list_first_entry(&flow->queue, struct sk_buff, list);
 	if (!skb)
 		return NULL;
+	list_del(&skb->list);
 
 	fq_adjust_removal(fq, flow, skb);
 
@@ -66,21 +67,22 @@ static int fq_flow_drop(struct fq *fq, struct fq_flow *flow,
 	unsigned int packets = 0, bytes = 0, truesize = 0;
 	struct fq_tin *tin = flow->tin;
 	struct sk_buff *skb;
-	int pending;
+	u32 pending;
 
 	lockdep_assert_held(&fq->lock);
 
-	pending = min_t(int, 32, skb_queue_len(&flow->queue) / 2);
+	pending = min_t(u32, 32*1514, flow->backlog >> 1);
 	do {
-		skb = __skb_dequeue(&flow->queue);
+	        skb = list_first_entry(&flow->queue, struct sk_buff, list);
 		if (!skb)
 			break;
+		list_del(&skb->list);
 
 		packets++;
 		bytes += skb->len;
 		truesize += skb->truesize;
 		free_func(fq, tin, flow, skb);
-	} while (packets < pending);
+	} while (bytes < pending);
 
 	__fq_adjust_removal(fq, flow, packets, bytes, truesize);
 
@@ -226,7 +228,7 @@ static void fq_tin_enqueue(struct fq *fq,
 			      &tin->new_flows);
 	}
 
-	__skb_queue_tail(&flow->queue, skb);
+	list_add_tail(&skb->list, &flow->queue);
 	oom = (fq->memory_usage > fq->memory_limit);
 	while (fq->backlog > fq->limit || oom) {
 		flow = fq_find_fattest_flow(fq);
@@ -252,15 +254,17 @@ static void fq_flow_filter(struct fq *fq,
 			   fq_skb_free_t free_func)
 {
 	struct fq_tin *tin = flow->tin;
-	struct sk_buff *skb, *tmp;
+	struct sk_buff *skb;
+	struct list_head *tmp;
 
 	lockdep_assert_held(&fq->lock);
 
-	skb_queue_walk_safe(&flow->queue, skb, tmp) {
+	list_for_each(tmp, &flow->queue) {
+		skb = list_entry(tmp, struct sk_buff, list);
 		if (!filter_func(fq, tin, flow, skb, filter_data))
 			continue;
 
-		__skb_unlink(skb, &flow->queue);
+		list_del(&skb->list);
 		fq_adjust_removal(fq, flow, skb);
 		free_func(fq, tin, flow, skb);
 	}
@@ -331,7 +335,7 @@ static void fq_tin_reset(struct fq *fq,
 static void fq_flow_init(struct fq_flow *flow)
 {
 	INIT_LIST_HEAD(&flow->flowchain);
-	__skb_queue_head_init(&flow->queue);
+	INIT_LIST_HEAD(&flow->queue);
 }
 
 static void fq_tin_init(struct fq_tin *tin)
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 50025ff..dc616bc 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -3347,7 +3347,7 @@ static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
 
 	tin = &txqi->tin;
 	flow = fq_flow_classify(fq, tin, flow_idx, skb);
-	head = skb_peek_tail(&flow->queue);
+	head = list_first_entry(&flow->queue, struct sk_buff, list);
 	if (!head || skb_is_gso(head))
 		goto out;
 

1 Like

I tested it out of my curiosity. Unfortunately, the WLAN interface fails and provokes the device to reboot.

2 Likes

All this is beyond me, but I do have a question. Are you designing for the devices of today, or the devices of 2+ years from now? When you factor in the time it takes for a project to be developed and tested, it could be 2 years before it's available and stable in most devices. Or a more informed time estimate could be made by looking at how long codel and CAKE took.

More specifically, most OpenWRT users seem to use hardware that is already 1-5 years old, due to how cheap it is on the used market. So 2 years from now the typical OpenWRT device will be one that was made sometime in 2019-2023. Gigabit, too, still seems like it will be a luxury in the US for the next 2 years.

2 Likes

Thank you for testing the changes. I'm sorry the patch wasn't functional. I'll make sure to test the changes from now on.

The next version of the patch will be tested.

The network stack gains new features all the time. New standards appear. Some devices have three wireless radios in them. Some of the ARM routers still can't handle shaping and AQM via CAKE at 1 gpbs. Shaping a 300 mbps/200 mbps connection with CAKE wasn't easy on ARM hardware not that long ago.

Let's assume that the new routers are fast enough to handle 2 gbps of traffic while shaping both ways with CAKE. That'd be great. We still have a problem when we switch to 10 gbps and 25 gbps. These routers can't handle such connections. The CPUs will have to process a much larger number of packets. Even the fastest x86-64 CPUs have tight budgets for the time they have to process a packet at 50 gbps and 100 gbps connections. CPUs don't become faster in time to make up for the increased number of packets and for the increasingly complex network/OS code. CPUs and network interfaces also do their work in a bursty manner. This is good for overall processing throughput. It's not good for latency and consistency. Who wants 1gbps of throughput with an additional 50-150ms of unnecessary latency?

We need to make the algorithms better and to optimize the code. You can say that we have N operations for a given number of packets per second. A reduction to N/3 operations for the same number of packets improves performance. We might be able to handle 33% more packets on the same hardware if the CPU is the bottleneck.

We're not in control of what happens when the packets leave our devices. Better local scheduling choices can lead to better overall outcomes. We've seen this already with fq_codel and CAKE.

1 Like

This patch for mac80211 works for me:

diff --git a/include/net/fq.h b/include/net/fq.h
index 2eccbbd..8f76f67 100644
--- a/include/net/fq.h
+++ b/include/net/fq.h
@@ -27,7 +27,7 @@ struct fq_tin;
 struct fq_flow {
 	struct fq_tin *tin;
 	struct list_head flowchain;
-	struct sk_buff_head queue;
+	struct list_head queue;
 	u32 backlog;
 	int deficit;
 };
diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index a5f67a2..3d37dd3 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -51,9 +51,12 @@ static struct sk_buff *fq_flow_dequeue(struct fq *fq,
 
 	lockdep_assert_held(&fq->lock);
 
-	skb = __skb_dequeue(&flow->queue);
+	if (list_empty(&flow->queue))
+		return NULL;
+	skb = list_first_entry(&flow->queue, struct sk_buff, list);
 	if (!skb)
 		return NULL;
+	list_del(&skb->list);
 
 	fq_adjust_removal(fq, flow, skb);
 
@@ -66,21 +69,22 @@ static int fq_flow_drop(struct fq *fq, struct fq_flow *flow,
 	unsigned int packets = 0, bytes = 0, truesize = 0;
 	struct fq_tin *tin = flow->tin;
 	struct sk_buff *skb;
-	int pending;
+	u32 pending;
 
 	lockdep_assert_held(&fq->lock);
 
-	pending = min_t(int, 32, skb_queue_len(&flow->queue) / 2);
+	pending = min_t(u32, 32*1514, flow->backlog >> 1);
 	do {
-		skb = __skb_dequeue(&flow->queue);
+	        skb = list_first_entry(&flow->queue, struct sk_buff, list);
 		if (!skb)
 			break;
+		list_del(&skb->list);
 
 		packets++;
 		bytes += skb->len;
 		truesize += skb->truesize;
 		free_func(fq, tin, flow, skb);
-	} while (packets < pending);
+	} while (bytes < pending);
 
 	__fq_adjust_removal(fq, flow, packets, bytes, truesize);
 
@@ -226,7 +230,7 @@ static void fq_tin_enqueue(struct fq *fq,
 			      &tin->new_flows);
 	}
 
-	__skb_queue_tail(&flow->queue, skb);
+	list_add_tail(&skb->list, &flow->queue);
 	oom = (fq->memory_usage > fq->memory_limit);
 	while (fq->backlog > fq->limit || oom) {
 		flow = fq_find_fattest_flow(fq);
@@ -252,15 +256,17 @@ static void fq_flow_filter(struct fq *fq,
 			   fq_skb_free_t free_func)
 {
 	struct fq_tin *tin = flow->tin;
-	struct sk_buff *skb, *tmp;
+	struct sk_buff *skb;
+	struct list_head *tmp, *pos;
 
 	lockdep_assert_held(&fq->lock);
 
-	skb_queue_walk_safe(&flow->queue, skb, tmp) {
+	list_for_each_safe(pos, tmp, &flow->queue) {
+		skb = list_entry(pos, struct sk_buff, list);
 		if (!filter_func(fq, tin, flow, skb, filter_data))
 			continue;
 
-		__skb_unlink(skb, &flow->queue);
+		list_del(&skb->list);
 		fq_adjust_removal(fq, flow, skb);
 		free_func(fq, tin, flow, skb);
 	}
@@ -331,7 +337,7 @@ static void fq_tin_reset(struct fq *fq,
 static void fq_flow_init(struct fq_flow *flow)
 {
 	INIT_LIST_HEAD(&flow->flowchain);
-	__skb_queue_head_init(&flow->queue);
+	INIT_LIST_HEAD(&flow->queue);
 }
 
 static void fq_tin_init(struct fq_tin *tin)
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 50025ff..2fcf263 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -3347,7 +3347,9 @@ static bool ieee80211_amsdu_aggregate(struct ieee80211_sub_if_data *sdata,
 
 	tin = &txqi->tin;
 	flow = fq_flow_classify(fq, tin, flow_idx, skb);
-	head = skb_peek_tail(&flow->queue);
+	if (list_empty(&flow->queue))
+		goto out;
+	head = list_first_entry(&flow->queue, struct sk_buff, list);
 	if (!head || skb_is_gso(head))
 		goto out;
 

I'll post again when I have something to share.

I will also test it, and I will come back to you to confirm it works here.

I did some benchmarks with MikroTik's software flow offloading (FastTrack). That seems to highlight the performance difference between CAKE, fq-codel and codel pretty well. codel shapes at 55% of the unshaped rate at 756 Mbps. This is on a rather weak ARM Cortex A7 running up to 896 Mhz. I wonder how ARM Cortex A53 at up to 1.8 Ghz will scale. That's used for some of MikroTik's new Wi-Fi 6 products.

I don't understand your methodology on that link.

The best way to run cake and fq_codel in a gbit environment is just as an interface queue, replacing the default sfq or fifo that mikrotik provides. What's the comparison between:

"No queues with FastTrack: 942 Mbp" (this is just a short fifo)

and that?

I try to say "line" or "native" rate, rather than "unshaped"

personally i dont have much experience other than trying out. but ive found out what worked well for me especially from client side
these parameters with fq_codel and bbr congestion control
https://wiki.archlinux.org/title/Sysctl#Networking
worth a try. meanwhile my whole network is set up using the same config, router and clients.
it made most difference in wireless clients from client side for me.
personally i adjust /etc/rc.local and /etc/sysctl.conf and dont run qos
to make sure all runs i use crontab -e "@reboot root /etc/rc.local" and make it a script just to be safe
in some cases stuff doesnt get applied so adding delay will help like sleep 10
there are way more optimizations to be found in arch wiki and in general to apply to openwrt
many parameters and explanation in linux kernel documentation.
after applying more general haccs i see luci much more responsive much less cpu usage and overall better experience. just an example: https://www.kernel.org/doc/Documentation/sysctl/vm.txt
have fun