AQL and the ath10k is *lovely*

on osx:

sudo netstat -l your_device -qq

and I'm calling it a day. New theory - the rpi (openwrt.lan) has something rate limiting the performance of it's stack.

Before calling it a day -see below. May I ask what you expect to find with the output of that command?

netstat output (too long to paste directly)

Going to enjoy Father's day for a little while. :wink:

heh. I gave you the wrong command. -qq -I (thats an I not an L) should show the native fq_codel stats on the osx box. and even that might be the wrong command... gimmee sec

It makes more sense now, I should realise; here you go:

@reaper$ ➜  ~ sudo netstat -I en0 -qq
en0:
     [ sched:  FQ_CODEL  qlength:    0/128 ]
     [ pkts:  124486499  bytes: 170983491083  dropped pkts:  44041 bytes: 65229549 ]
=====================================================
     [ pri: VO (1)	srv_cl: 0x400180	quantum: 605	drr_max: 8 ]
     [ queued pkts: 0	bytes: 0 ]
     [ dequeued pkts: 33377	bytes: 6044211 ]
     [ budget: 0	target qdelay: 10.00 msec	update interval:100.00 msec ]
     [ flow control: 0	feedback: 0	stalls: 0	failed: 0 	overwhelming: 0 ]
     [ drop overflow: 0	early: 0	memfail: 0	duprexmt:0 ]
     [ flows total: 0	new: 0	old: 0 ]
     [ throttle on: 0	off: 0	drop: 0 ]
     [ compressible pkts: 0 compressed pkts: 0]
=====================================================
     [ pri: VI (2)	srv_cl: 0x380100	quantum: 3028	drr_max: 6 ]
     [ queued pkts: 0	bytes: 0 ]
     [ dequeued pkts: 5842684	bytes: 8182467304 ]
     [ budget: 0	target qdelay: 10.00 msec	update interval:100.00 msec ]
     [ flow control: 355	feedback: 355	stalls: 12	failed: 0 	overwhelming: 0 ]
     [ drop overflow: 0	early: 0	memfail: 0	duprexmt:0 ]
     [ flows total: 0	new: 0	old: 0 ]
     [ throttle on: 0	off: 0	drop: 0 ]
     [ compressible pkts: 0 compressed pkts: 0]
=====================================================
     [ pri: BE (7)	srv_cl: 0x0	quantum: 1514	drr_max: 4 ]
     [ queued pkts: 0	bytes: 0 ]
     [ dequeued pkts: 118254467	bytes: 162519175698 ]
     [ budget: 0	target qdelay: 10.00 msec	update interval:100.00 msec ]
     [ flow control: 9243	feedback: 9243	stalls: 40	failed: 0 	overwhelming: 0 ]
     [ drop overflow: 0	early: 43041	memfail: 0	duprexmt:0 ]
     [ flows total: 0	new: 0	old: 0 ]
     [ throttle on: 0	off: 0	drop: 0 ]
     [ compressible pkts: 0 compressed pkts: 0]
=====================================================
     [ pri: BK (8)	srv_cl: 0x100080	quantum: 1514	drr_max: 2 ]
     [ queued pkts: 0	bytes: 0 ]
     [ dequeued pkts: 355971	bytes: 275803870 ]
     [ budget: 0	target qdelay: 10.00 msec	update interval:100.00 msec ]
     [ flow control: 3	feedback: 3	stalls: 0	failed: 0 	overwhelming: 0 ]
     [ drop overflow: 0	early: 0	memfail: 0	duprexmt:0 ]
     [ flows total: 0	new: 0	old: 0 ]
     [ throttle on: 0	off: 0	drop: 0 ]
     [ compressible pkts: 83604 compressed pkts: 34500]

enjoy your fathers day. It is labor day here... and we have labored mightily.

did you patch the kernel or the mac80211 package?

wmm_ac_be_txop_limit=32 you see on the osx box?

this was a good result from aug 4:

a tcpdump of 2 flows up woul be useful also.

I patched package/kernel/mac80211.
Update: 1:38 AEDT - I just realised my mistake, recompiling and letting it ready for re-testing tomorrow in the early morning.

Yes, Sir, see below:

But I didn't use today the below parameters, I only modified txop=32, and I used burst=0.

tx_queue_data2_aifs=1
tx_queue_data2_cwmin=7
tx_queue_data2_cwmax=15
tx_queue_data2_burst=3.0

I will do it tomorrow.

Another round.
WMM test parameters:

tx_queue_data2_aifs=3
tx_queue_data2_cwmin=15
tx_queue_data2_cwmax=63
tx_queue_data2_burst=0

wmm_ac_be_txop_limit=0

AQL test parameters:

root@nanohd-downstairs:~# cat /sys/kernel/debug/ieee80211/phy1/aql_txq_limit
AC	AQL limit low	AQL limit high
VO	5000		12000
VI	5000		12000
BE	5000		12000
BK	5000		12000
root@nanohd-downstairs:~# cat /sys/kernel/debug/ieee80211/phy1/aql_threshold
24000
root@nanohd-downstairs:~# cat /sys/kernel/debug/ieee80211/phy1/aql_enable
1

flent rrul_be test:

flent tcpn[up,down] 2-threaded test:

3 Likes

yuck.

w/aql disabled?

1 Like

Sadly, nope. You can see this in the parameters list I posted before the graphs.

Where were we in july?

"At least it doesn't crash."

And, the last one for today.

WMM test parameters:

tx_queue_data2_aifs=3
tx_queue_data2_cwmin=15
tx_queue_data2_cwmax=63
tx_queue_data2_burst=0

wmm_ac_be_txop_limit=0

AQL test parameters:

root@nanohd-downstairs:~# cat /sys/kernel/debug/ieee80211/phy1/aql_txq_limit
AC	AQL limit low	AQL limit high
VO	2000		2000
VI	2000		2000
BE	2000		2000
BK	2000		2000
root@nanohd-downstairs:~# cat /sys/kernel/debug/ieee80211/phy1/aql_threshold
24000
root@nanohd-downstairs:~# cat /sys/kernel/debug/ieee80211/phy1/aql_enable
1

Please, note that aql_threshold gets nullified because low and high aql_tx_limits are identical.

flent rrul_be 300 s test graph:

flent tcpnup test with 1, 2, 4, 8 and 16 threads ping cdf graph:

flent tcpndown test with 1, 2, 4, 8 and 16 threads ping cdf graph:

And as usual, click here to download including a tcpdump of 2-threaded up and down captures.

I think we are onto something, but why is downloading so good and uploading so bad?! Can you think of anything @dtaht?

3 Likes

We aren't winning the election often enough. Distributing a txop in the beacon of 2ms or less might help. Nice to see progress (without y'alls help and interest I'd have given up multiple times on these fronts. Actually, I DO give up periodically in the hope that a zen like moment would yield inspiration, or someone else would have the inspiration).

I still need to look at these packet captures closely.

A key thing to remember is we got about 1/3 ms RTT with no load. What we achieve now is pretty miserable compared to that.

Fiddling with NAPI_POLL_WEIGHT=16 might pull some latency out of the ethernet interfaces.

On the osx upload side, there are no drops, essentially. My guess is all the latency is coming from a fixed length queue there, and when they run out, they push back on the stack.

nodrps

Two ways to test that - I can't find a way to lock the mcs rate in osx with a few searches of the web, but narrowing the channel width at the AP to HT20 might be revealing. Going from VHT80 to HT20 would probably quadruple the latency observed from the osx upload if that theory is correct.

Polling the netstat -qq -I the_interface and dumping that somewhere during that upload test might be revealing also.

for i in `seq 1 60`
do
netstat -qq -I the_theinterface
sleep 1
done > fq_codel_osx.log
1 Like

I will try this.

By the way, I just did a new quick test on Waveform, and with Apple's tool letting WLAN be the choking point... quite an improvement on these two tests, previously the results were +20-40 ms and < 1000 RPM.

==== SUMMARY ====
Upload capacity: 34.504 Mbps
Download capacity: 418.589 Mbps
Upload flows: 12
Download flows: 12
Responsiveness: High (3157 RPM)
Base RTT: 11
Start: 6/9/2022, 5:34:57 am
End: 6/9/2022, 5:35:07 am
OS Version: Version 12.5.1 (Build 21G83)

Let me run the tests before starting my daily fight against the Gravity. Stay tuned.

1 Like

Here you go. I ran two sets of tests, upload and download from 1 to 16 threads with the AP in HT20 mode. In parallel, I ran a version of your script to capture fq_code stats.

You can download all the files by clicking here.

To not I didn't see the same kind of "contention" between upload and download. I think that it might be that we are not hitting the download limit, so latency does not increase. Another of my nïves queries, I reckon.

I will patch this next. Test to come tomorrow, though, Dave.

Update: I changed my mind; I will fight against Gravity in the arvo. See below.

WMM test parameters:

tx_queue_data2_aifs=3
tx_queue_data2_cwmin=15
tx_queue_data2_cwmax=63
tx_queue_data2_burst=0

wmm_ac_be_txop_limit=0

AQL test parameters:

root@nanohd-downstairs:~# cat /sys/kernel/debug/ieee80211/phy1/aql_txq_limit
AC	AQL limit low	AQL limit high
VO	2000		2000
VI	2000		2000
BE	2000		2000
BK	2000		2000
root@nanohd-downstairs:~# cat /sys/kernel/debug/ieee80211/phy1/aql_threshold
24000
root@nanohd-downstairs:~# cat /sys/kernel/debug/ieee80211/phy1/aql_enable
1

Kernel patches:

MS2TIME(8)
NAPI_POLL_WEIGHT=16

flent rrul_be 300 s test graph:

flent tcpnup test with 1, 2, 4, 8 and 16 threads ping cdf (median ≈70 ms) graph:

flent tcpndown test with 1, 2, 4, 8 and 16 threads ping cdf (median ≈8 ms) graph:

Click me to download flent data and tcpdump capture.

Note that I didn't see any kernel warning; hence the driver is not calling netif_napi_add() with a weight value higher than the one defined (16).

        if (weight > NAPI_POLL_WEIGHT)
                netdev_err_once(dev, "%s() called with weight %d\n", __func__,
                                weight);
2 Likes

The napi pollweight affects the whole system. You should see an increase in context switches due to it. It's been 64 for two decades,
and I've always felt it was too high for modern (especially arm) multicore systems. It's better to do less work, more often, to have a more fluid experience.

Since your test result was identical, it's not clear if it was actually applied. A printk in init from the mt76 ethernet driver printing out what its set to would validate that it changed.

For all I know, 8 is closer to a good number on arm.

Very happy with the download result (we were at VHT80(?) 430Mbit though, what were we getting before? 80Mbit mt76 vs 120 OSX or so... to me this ratio we've had since the beginning of this thread STILL points maybe to an ampdu sizing or mcs-rate problem more than anything else, since we now have plenty of cpu left over.

FQ is working just fine and dandy with these reduced AQL values. I still don't get why aql needs to be enabled at all... (could you do another test with aql disabled with HT20?)

If you don't mind, I'd like to stay at HT20 for a while - makes the packet captures smaller! I ended up with 9GB on swap on the last cap....

To make 'em bigger... to be able to actually look at what's on the wireless... sigh... do you have a usb stick on available for the router? Does a wifi monitoring interface work on this chipset? tcpdump

I just swapped some email with @sultannaml - working on the latest mt79 chipset - who told me:

Other mt76 changes I made AP-side include making mt7915 construct larger A-MSDUs
when A-MSDUs are hardware offloaded (which it has been for over a year now), and
working around a weird firmware bug where mt7915 couldn't transmit frames with
MCS 10-11 when using 160 MHz bandwidth with 2 spatial streams over DFS spectrum.

I also made a handful of other changes to mt76 to fix mt7922 bugs and
performance issues, such as how mt7922 would never TX at 160 MHz bandwidth to my
mt7915 AP out of the box — TX would be limited to 80 MHz — despite working just
fine with a Broadcom AP.

[1] https://github.com/openwrt/openwrt/commit/f338f76a66a50d201ae57c98852aa9c74e9e278a
[2] https://github.com/kerneltoast/kernel_x86_laptop/commit/ca89780690f7492c2d357e0ed2213a1d027341ae

You went from VHT80 or 160 down to 20?

1 Like

Can I have a quick pointer to it, please?

I read those patches. I've moved from VHT80 to HT20 here. My devices are MT7621 CPUs with MT7615e chipsets.