Beyond fq_codel & cake

If a NIC implements a traffic shaper in hardware and BQL or similar backpressure, and the kernel supports that no kernel changes required... the biggest cost in SQM is invariably the traffic shaper component. It is not that AQM and fq-scheduler are free, but they appear considerably less demanding.

Not sure more queues will help, as I tried to show above more queues comes with its own cost. We can argue however how many queues are enough :wink:

There has not been a halfway competitive MIPS core announced in the last decade I know of that would be suitable for a router, that decline is probably not unrelated to the slow and steady rise of ARM, but probably not caused by it alone.

Sure, one option is simply to use these devices inside their capability envelope. I had been using my MIPS wndr3700v2 for I think 2 years on a 100/40 Mbps link with the shaper set to ~49/32 as that was the maximum it could reliably traffic shape at, it turned out the network was considerably more responsive with the lowered capacity but competent scheduling and AQM than at higher un-shaped capacity but with out SQM. It would not have made much difference if a clever trick would have allowed me to shape 10 Mbps higher...

Maybe, maybe not, the mipsen had a good run, it might be time to let go at targeting new development at those :wink:

Cake offers per-internal-IP fairness which would restrict most of the fall-out to the machine running the torrent client in our example, no qosify required. Someone on the forum tested that with ~500 greedy TCP flows and reported acceptable performance on different machines....

Probably.

1 Like

Thanks! My post lacks many details, tests and diagrams. I need to set up a proper environment for testing. The goal is to use two hosts and one router for testing. That should avoid any ISP bottlenecks or any transient behaviour.

I'm not familiar with xdp and the intel NIC chips with such facilities. Perhaps you know of one such chip? I've seen RISC-V SoCs with networking implemented in FPGA. Would it be far fetched to grab one such design to add similar features to it? It's most likely non-trivial work. We still need a cheap FPGA board with two network ports.

Do you happen to know some good pseudocode implementations of any of those algorithms? I'd start looking at those. We might be able to come up with some better implementations. I can imagine there are a few people who'd want to try that, not just me.

My impression right now is that CAKE's COBALT is good enough to control flows. The DRR part of it is something which could be improved, both in terms of cache misses and fairness.

Cache misses I have nothing to say on, but what fairness deficiencies do you observe? The last time this came up it turned out that with the ingress keyword cake actually tends to equalize its ingress rate, but that can result in more unequal egress rate.

1 Like

The cache misses are certainly a factor for the MIPS devices with gigabit NICs used for Internet connections faster than 100-300 Mbps. Cake in diffserv4 mode has 4096 total queues (flows). The queue's flow struct was 72 or 80 bytes the last time I checked for 64bit architectures. Since the backlogged flows are dequeued and appended to the tail (assuming backlogged flows), this has an impact on the cache. A different implementation might avoid that.

Regarding the ingress keyword, it's something to test with and without to compare. Again, I need to prepare a proper environment to use for testing.

As for timing: Microchip is adding patches for their LAN966x to add MQPRIO. Seems more people are adding (parts of) HW QoS to their drivers. lan96xx mqpio and taprio patch

1 Like

strict priority queues are a basically broken idea. It looks like people are reimplementing ideas that didn't work all that well in 80s, because they are easy, with 8 levels of service based on the TOS bits (CS0-CS7), which was obsoleted in the late 90s....

In looking over that patch...

IEEE 802.1Qbv in itself, is less broken, but...

1 Like

There's actual code for pie and fq_codel for p4. Hit google scholar ...

1 Like

Is that because you want CAKE to be able to be used in non-private, commercial contexts?

In any case, let me know if you need any help in this area as I specialize in invalidating (albeit European) patents, and I'm entirely familiar with ascertaining what does and does not fall within the scope of a claim. Many poorly drafted patents present straightforward workaround opportunities.

1 Like

I guess if there is either admission control and/or decreasing capacity share with increasing priority one might end up with a workable system, but pure strict priority is truly harsh and unforgiving.

I wanted the fq_codel related algorithms to be shipped in hardware.

1 Like

That is why I am hoping that for some of our limited (low-end) targets we could use the WRR and rate limiting features from the hardware (Qos) and compliment that with more active software queue management.

I appreciate that SP queues are "broken" and yet, now we finally see some development to actually use those hardware features. For sure those people are much better aware of the limitation and possibilities than I am.

I am hoping they can be used sanely as well! Thing is, my own ability to carry things forward is in decline, I don't have the mental energy or financial resources I had pre-covid, and at the moment, I'm mostly just trying to pass essential knowledge and analysis along so others can take up the flag. The love, energy and enthusiasm others have for our outputs to date IS encouraging, but at one level I often think it best that others go ahead, try new things, and ignore the past, as it's the best way to learn! So many of the best things we've done came from papers that couldn't be published.

One of the most key papers for me, came from discovering the extended paper on SFQ, and vigorously implementing all that was impossible (on a 68020 processor in 1992), then observing the results: http://www2.rdrop.com/users/paulmck/scalability/paper/sfq.2002.06.04.pdf

SFQRED was a huge stepping stone towards fq_codel, that everyone's forgotten now. I wish I'd written that up... and "RED in a ifferent light", key also: http://mirrors.bufferbloat.net/~jg/RelevantPapers/Red_in_a_different_light.pdf

Hysterically (@tohojo) I just discovered we cited the wrong paper for SFQ in rfc8290! Just SQF https://perso.telecom-paristech.fr/bonald/Publications_files/BMO2011.pdf

In all things in life, I recommend a road not taken, so long as you can keep a roof over your head.

3 Likes

there are a lot of good ideas in this: https://dl.acm.org/doi/pdf/10.1145/3387514.3406591

1 Like

Also, it seems likely that many of a new generation of ethernet devices missed putting "BQL" in there. There's a mini-tutorial over here:

http://www.taht.net/~d/broadcom_aug9_2018.pdf

Your contributions have an impact. More people should be aware and appreciate that.

The SQF paper is missing many details. The presented algorithm seems to state only that the shortest flow is chosen. The paper seems to propose a packet scheduler similar to FQ which selects the shortest flow to dequeue from and the longest flow to drop from.

An implementation of SQF with codel could probably work just fine without the longest queue drop. Codel should control the queues. The only scenario when packets would be dropped outside of codel is when running out of memory, just like in fq_codel and cake. I'm not sure what the algorithm's loop would look like. Would it just go through the queues from shortest to the longest, let codel do its work, send a packet and continue?

We're not forced to implement SQF as it is described there. Buckets could be used If SQF must be avoided because of the patent. Flows would be put in buckets based on the amount of bytes they have in the backlog. This would mean that it might or it might not be the shortest flow. It'd be close to being the shortest flow.

It's difficult to understand why they'd patent this. It's difficult enough to convince people to implement proper fair schedulers with AQM and proper shaping where it makes sense.

There are a few places in the kernel where the sizes of the data structures can be reduced. A reduction of 8-16 bytes in a flow's data structure could lead to a performance improvement on embedded hardware with limited memory bandwidth and really small CPU caches. These changes could even allow a flow's struct to fit inside a single cache line.

The fq code from the kernel (net/include/fq.h and net/include/fq_imp.h) is used for mac80211. This is the struct:

struct fq_flow {
	struct fq_tin *tin;
	struct list_head flowchain;
	struct sk_buff_head queue;
	u32 backlog;
	int deficit;
};

The sk_buff_head queue wastes 8 bytes with two fields: a spin lock and a queue length (in packets). Both of these are unused by the mac80211 tx.c code and by fq_impl.h. These queues are handled only privately by mac80211 and by the fq code. They're not touched by other code. There are other spinlocks in place which are actually used.

This field could be replaced.

-       struct sk_buff_head queue;
+       struct sk_buff    *head;
+       struct sk_buff    *tail;

fq is configured to use 8192 queues for mac80211. The change would lower the total size of these flows by 64 KB.

A few macros and functions used by fq_impl.h and mac80211/tx.c need to be rewritten. __skb_dequeue, __skb_queue_tail, skb_queue_walk_safe, __sbk_unlink, skb_peek_tail. I'll try to get these changes to work properly. More changes are required.

Others may be able to come up with other improvements for parts of the kernel's code. There are people out there with more kernel hacking experience.

Perhaps adding ACK thinning code into mac80211 is worth a shot. Having it on the client side would also help. Wifi stations might see an improvement in throughput since wifi is only half-duplex.