fq_codel and cake have been available for some time. Cake has received a lot of improvements since it has become available. Both appear to not be particularly adequate for consumer embedded routers with MIPS and ARM microprocessors and symmetric gigabit links. Wireless routers have to do even more work when they handle NAT and they are also used as access points. These devices can't handle 1 gbps of traffic when using cake.
Software offloading can't work when using ingress shaping. Hardware offloading is also not available on many routers.
Another problem with shaping and AQM is latency. Network equipment from the provider's network and ONTs can also introduce latency/bloat. More changes and improvements are required. Reducing the time required to process packets can help.
This post is meant to be a place to brainstorm. The goals of this is to discuss ways to reduce latency, to increase throughput and to improve fairness.
Increasing throughput for fq_codel has been explored here: https://github.com/dtaht/fq_codel_fast. Removing unneeded fields from structs and shrinking the structs should help increase throughput for fq_codel and cake. The CPUs found in these routers have small caches for data and instructions. A significant reduction in cache misses can have a huge impact on performance.
Both cake and fq_codel are DRR based, even if they've greatly improved the algorithm. Alternatives to DRR might be worth exploring for both. fq_codel might be easier to experiment with due to the fact that its code is simple. What algorithms would be worth exploring? A different algorithm which doesn't remove items from the head of lists to append them to the tail may be able to get better performance and improved fairness.
fq_codel lacks GSO splitting, NAT awareness and the improved hashing that cake has. It might be interesting to benchmark a modified fq_codel implementation.
Using cake on a PPPoE interface still leaves us with fq_codel on the physical network interface. Perhaps benchmark with a local PPPoE server and client would provide the answer to the question: is it better to have CAKE on the PPPoE device or on the actual physical device?
Cake could be improved for x86-64 machines to increase the number of queues per tin. This should avoid collisions and improve fairness for networks with more flows. It might also need to probe for unused flows less often via the 8-way associative hashing if the tin has more queues. The 8-way associative hashing could probably be improved to avoid packet reordering when flows become empty. I'm not sure how that could be implemented.
mac80211 uses a different implementation of fair queueing which has a quantum of 300 bytes - see https://elixir.bootlin.com/linux/latest/source/include/net/fq_impl.h#L353. This leads to a few rounds of going through the list of flows before the packet can be transmitted. This is probably not a problem for x86-64 based access points. The smaller embedded ARM based and MIPS based routers may have a hard time going above a certain rate. Perhaps a quantum of 600 would help a little while still being somewhat fair towards small packets.