Tc/flow classifier + mq (and fq_codel)

shm0 · August 25, 2019, 4:11pm

Hi!
In the debloat.sh script from dtaht,
(https://github.com/dtaht/deBloat/blob/master/src/debloat.sh)
he shows how to use the flow classifier to distribute flows across multiple hardware queues.
But I can't get this working
First I did setup mq + fq_codel:

tc qdisc replace dev eth0 handle 1: root mq
tc qdisc replace dev eth0 parent 1:1 handle 110 fq_codel limit 1024 target 1ms interval 20ms
tc qdisc replace dev eth0 parent 1:2 handle 120 fq_codel limit 1024 target 1ms interval 20ms
tc qdisc replace dev eth0 parent 1:3 handle 130 fq_codel limit 1024 target 1ms interval 20ms
tc qdisc replace dev eth0 parent 1:4 handle 140 fq_codel limit 1024 target 1ms interval 20ms 
tc qdisc replace dev eth0 parent 1:5 handle 150 fq_codel limit 1024 target 1ms interval 20ms
tc qdisc replace dev eth0 parent 1:6 handle 160 fq_codel limit 1024 target 1ms interval 20ms
tc qdisc replace dev eth0 parent 1:7 handle 170 fq_codel limit 1024 target 1ms interval 20ms
tc qdisc replace dev eth0 parent 1:8 handle 180 fq_codel limit 1024 target 1ms interval 20ms

tc qdisc show dev eth0

qdisc mq 1: root
qdisc fq_codel 170: parent 1:7 limit 1024p flows 1024 quantum 1514 target 999us interval 20.0ms memory_limit 32Mb ecn
qdisc fq_codel 120: parent 1:2 limit 1024p flows 1024 quantum 1514 target 999us interval 20.0ms memory_limit 32Mb ecn
qdisc fq_codel 150: parent 1:5 limit 1024p flows 1024 quantum 1514 target 999us interval 20.0ms memory_limit 32Mb ecn
qdisc fq_codel 180: parent 1:8 limit 1024p flows 1024 quantum 1514 target 999us interval 20.0ms memory_limit 32Mb ecn
qdisc fq_codel 130: parent 1:3 limit 1024p flows 1024 quantum 1514 target 999us interval 20.0ms memory_limit 32Mb ecn
qdisc fq_codel 160: parent 1:6 limit 1024p flows 1024 quantum 1514 target 999us interval 20.0ms memory_limit 32Mb ecn
qdisc fq_codel 110: parent 1:1 limit 1024p flows 1024 quantum 1514 target 999us interval 20.0ms memory_limit 32Mb ecn
qdisc fq_codel 140: parent 1:4 limit 1024p flows 1024 quantum 1514 target 999us interval 20.0ms memory_limit 32Mb ecn

(Why is target 999us now, I don't know. Maybe rounding issue?)

tc -g class show dev eth0

+---(1:8) mq
+---(1:7) mq
+---(1:6) mq
+---(1:5) mq
+---(1:4) mq
+---(1:3) mq
+---(1:2) mq
+---(1:1) mq

Then I try to add the flow filter to eth0

tc filter add dev eth0 root protocol ip prio 10 flow hash keys src,dst,proto,proto-src,proto-dst divisor 8 baseclass 1:1

Results in:

RTNETLINK answers: Invalid argument
We have an error talking to the kernel, -1

Switchting out root for parent 1: also does not work.
(root is wrong there, I guess, because tc -g class dev eth0 root gives no output,
but tc -g class dev eth0 parent 1: does)
flow classifier is built into the kernel.

tapper · August 25, 2019, 10:21pm

Why do you want to do this? What's rong with SQM?
OpenWrt is not debian.

Copyright 2012 M D Taht. Released into the public domain.

This script is presently targetted to go into

/etc/network/ifup.d on debian derived systems.

shm0 · August 26, 2019, 4:10pm

Because I want to make use of the hardware queues of my device.

It should work on openwrt too.

dtaht · August 27, 2019, 11:22pm

Oy that code is ancient. Don't do that. If you want to use hw
mq just make your default qdisc be fq_codel and mq should pick it up automatically.

(this is not an sqm question, if you want to run at line rate)

A mildly better way to configure fq_codel on mq is to set the flows variable to 1024/x where x is the number of hw queues. But that would involve figuring out the filter command again, which I'm not up for today.

shm0 · August 28, 2019, 12:11am

It does. But the tc qdisc show output is almost the same.
mq as main qdisc and eight fq_codel sub qdiscs.
But only one fq_codel queue is used out of the eight. (tc stats output)
So the flow classifier is still needed, I guess...
//edit
Kernel 4.19.74

net: sched: fix reordering issues

[ Upstream commit b88dd52c62bb5c5d58f0963287f41fd084352c57 ]

Whenever MQ is not used on a multiqueue device, we experience
serious reordering problems. Bisection found the cited
commit.

The issue can be described this way :

- A single qdisc hierarchy is shared by all transmit queues.
  (eg : tc qdisc replace dev eth0 root fq_codel)

- When/if try_bulk_dequeue_skb_slow() dequeues a packet targetting
  a different transmit queue than the one used to build a packet train,
  we stop building the current list and save the 'bad' skb (P1) in a
  special queue. (bad_txq)

- When dequeue_skb() calls qdisc_dequeue_skb_bad_txq() and finds this
  skb (P1), it checks if the associated transmit queues is still in frozen
  state. If the queue is still blocked (by BQL or NIC tx ring full),
  we leave the skb in bad_txq and return NULL.

- dequeue_skb() calls q->dequeue() to get another packet (P2)

  The other packet can target the problematic queue (that we found
  in frozen state for the bad_txq packet), but another cpu just ran
  TX completion and made room in the txq that is now ready to accept
  new packets.

- Packet P2 is sent while P1 is still held in bad_txq, P1 might be sent
  at next round. In practice P2 is the lead of a big packet train
  (P2,P3,P4 ...) filling the BQL budget and delaying P1 by many packets :/

To solve this problem, we have to block the dequeue process as long
as the first packet in bad_txq can not be sent. Reordering issues
disappear and no side effects have been seen.

Interesting.
This implies that fq_codel is multi queue aware.
So replacing mq with fq_codel

mindwolf · April 10, 2020, 6:44pm

Judging from these results. it seems fq_codel w/o mq is not as fruitful.
https://blog.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-networking-qdisc-fastabend.pdf

shm0 · April 10, 2020, 6:52pm

Well,
On WRT* devices there seems to be no much difference when using mq + fq_codel.
Looking at the tc stats output, only one hardware queue seems to get used most of the time.
(Removing the tx queue workaround patch doesn't create big difference)
I haven't found a way to make mq use more queues.
Maybe someone else has an idea?
So best bet is XPS/RPS which do take effect later in proccesing (?) and do the queue assignment in software?

mindwolf · April 10, 2020, 7:26pm

According to line 699 it is forced to use only a single queue unless the code is edited.

shm0 · April 10, 2020, 7:48pm

I guess, the comment wasn't updated?

dtaht · April 11, 2020, 5:55am

Heh. I remember trying this back in the day. But I don't remember the result. Either it crashed... or it was a strict priority queue implemented in hw with no way to turn it off.

ParanoidZoid · April 21, 2020, 2:55pm

A trip down memory lane: https://bugs.openwrt.org/index.php?do=details&task_id=294

This gave rise to this patch that has been with mvebu devices using mvneta since that year (I think it was kernel 4.4 at the time?), surviving to kernel 5.4:

github.com

openwrt/openwrt/blob/master/target/linux/mvebu/patches-5.4/300-mvneta-tx-queue-workaround.patch

The hardware queue scheduling is apparently configured with fixed
priorities, which creates a nasty fairness issue where traffic from one
CPU can starve traffic from all other CPUs.

Work around this issue by forcing all tx packets to go through one CPU,
until this issue is fixed properly.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
---
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -4332,6 +4332,14 @@ static int mvneta_ethtool_set_eee(struct
 	return phylink_ethtool_set_eee(pp->phylink, eee);
 }
 
+static u16 mvneta_select_queue(struct net_device *dev, struct sk_buff *skb,
+			       struct net_device *sb_dev)
+{
+	/* XXX: hardware queue scheduling is broken,
+	 * use only one queue until it is fixed */

This file has been truncated. show original

There's been a hypothesis that switching from swconfig to DSA may mean this patch isn't necessary anymore, but that is best left to be answered by @anomeome or by https://github.com/dengqf6 since I barely have time anymore to test anything: https://github.com/openwrt/openwrt/pull/2935

dtaht · May 7, 2020, 4:13pm

damned if I know. test.

dtaht · May 7, 2020, 4:41pm

the flent rrul test will show starvation assuming the hw does fixed priorities.

mindwolf · August 7, 2020, 6:18pm

Regarding the multi queue, found this recently which gives a great pointer to why it doesn't work...

useless