AQL and the ath10k is *lovely*

mh we discovered that some wifi problems are related to 5.15 ?

I'm referring to the last three posts in PR 4036, but I think others above are running into issues on 5.15 that are unrelated to the AQL troubleshooting. No clue if what I observed is related since I never figured out why I had issues on 5.15. I can start a new thread if you want.

oh ok 5.15+swconfig = wifi slowdown o.O (sorry for ot but i wonder if times are mature to make the change)

For me, slow down with dsa as well i.e. its "just" a kernel 5.15 wifi issue.

(i bit ot but i hope we won't have to bisect stuff again....)
(in theory i have patches for all the kernel version between 5.10 and 5.15 so wonder if we manage to find something)

can you take a packet capture during a stress test? tcpdump -i your_server_interface -s 128 -w wtf.cap

I typically take those apart via "tcptrace -G" and then pick a large one via ls -l, and then xplot.org - but if you can just put the cap up somewhere....

At distance, too if possible.

To get even more extreme, there's a thing called an aircap, but that requires starting to capture at association.

For me the aqm statistic are ok for the station:

cat /sys/kernel/debug/ieee80211/phy1/netdev:wlan1-2/stations/#mac#/aqm

target 19999us interval 99999us ecn yes
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets flags
0 2 0 0 32368 105558 0 4197 6 2595065230 1807317 0x0(RUN)
1 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
2 3 0 0 1172 0 0 0 1 295790 1225 0x0(RUN)
3 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
4 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
5 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
6 0 0 0 1 0 0 0 0 373 1 0x0(RUN)
7 0 0 0 2 0 0 0 0 322 2 0x0(RUN)
8 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
9 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
10 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
11 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
12 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
13 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
14 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
15 0 0 0 0 0 0 0 0 0 0 0x0(RUN)

10s UDP iperf with 30% drop rate at 60Mbit

target 19999us interval 99999us ecn yes
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets flags
0 2 0 0 32461 108896 0 4616 6 2659436543 1852671 0x0(RUN)
1 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
2 3 0 0 1194 0 0 0 1 301226 1247 0x0(RUN)
3 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
4 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
5 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
6 0 0 0 1 0 0 0 0 373 1 0x0(RUN)
7 0 0 0 2 0 0 0 0 322 2 0x0(RUN)
8 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
9 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
10 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
11 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
12 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
13 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
14 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
15 0 0 0 0 0 0 0 0 0 0 0x0(RUN)

But for the device itself I don't see any real traffic:

cat /sys/kernel/debug/ieee80211/phy1/netdev:wlan1-2/aqm
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 90372 9 0 0 755 11493055 100566

10s UDP iperf with 30% drop rate at 60Mbit

cat /sys/kernel/debug/ieee80211/phy1/netdev:wlan1-2/aqm
ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets
2 0 0 90376 9 0 0 755 11493408 100570

The fairness seems to be ok. The UDP flood does not block a TCP connection and the throughput for both seems to be the same.

It's reassuring to see some drops here finally. You can see marks if you do a
sysctl -w net.ipv4.tcp_ecn=1 on both the server and client. A udp flood will drop,
a tcp connection will mark.

Still a packet capture would be best.

I think there is one strange thing. As far as I can remeber the aql q depth was always round about 12000 for the iperf udp "flood" test. Now it is changing rapidly random betwenn 0 and 12000. Was something changed?

I've pushed the airtime fairness rework to master, along with some more improvements (commits 6d49a25804d78d639e08a67c86b26991ce6485d8 and 7bf5233a2a2b1c81fae2f3a85bb971ef2a7f9b27).
The quantum change in the last commit might help with cases where the AQL budget isn't fully utilized.

10 Likes

Is it possible that previously published fixes caused link errors due to missing definition in mac80211.h file?

I added patch (botton), and will tests.

--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@ -153,6 +153,7 @@
 	IEEE80211_AC_BE		= 2,
 	IEEE80211_AC_BK		= 3,
 };
+#define IEEE80211_NUM_ACS	4
 
 /**
  * struct ieee80211_tx_queue_params - transmit queue configuration

In version OpenWrt 18.06 was a above definition, in 19.07 and later not was. This has been moved to the file linux/ieee80211.h. Definition IEEE80211_NUM_ACS is used in files cfg.c and main.c (information from patches), but to cfg.c file is included only linux/ieee80211.h and to file main.c is included only/net/mac80211.h. Sorry for not understanding the topic or thinking wrong, but I'm not a programmer, although I have some basics.

EDIT: First tests some very simmilar to my best from stable on 5.4.154 kernel (OpenWrt 21.02.1; NoTengoBattery v3.01 image)- link

EDIT2: Propably patches 336 and 337 they will also be helpful in some way...
Is without bug and better speed so far, ping is probably better for a small number of clients (I won't test it today because it's Sunday and there are only 3 devices connected).

4 Likes

I also encountered your errors on R7800 when running a 22.03 image from May 8 2022 (aka without any recent Felix' commits), with both AQL and ATF enabled.

dmesg | grep -i tx
[   25.527953] debugfs: File 'c2c_tx' in directory 'stats' already present!
[ 2566.673679] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[ 2566.673712] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[ 2566.680934] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[ 2566.689057] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[ 2566.697231] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[ 2566.705384] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[ 2566.713473] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[ 2566.721696] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[ 2566.729846] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[ 2566.738031] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode

# uname -a
Linux 5.10.113 #0 SMP Sun May 8 13:53:12 2022 armv7l GNU/Linux

# cat /etc/os-release 
NAME="OpenWrt"
VERSION="22.03-SNAPSHOT"
ID="openwrt"
ID_LIKE="lede openwrt"
PRETTY_NAME="OpenWrt 22.03-SNAPSHOT"
VERSION_ID="22.03-snapshot"
HOME_URL="https://openwrt.org/"
BUG_URL="https://bugs.openwrt.org/"
SUPPORT_URL="https://forum.openwrt.org/"
BUILD_ID="r19338+20-ae64d0624c"
OPENWRT_BOARD="ipq806x/generic"
OPENWRT_ARCH="arm_cortex-a15_neon-vfpv4"
OPENWRT_TAINTS="busybox"
OPENWRT_DEVICE_MANUFACTURER="OpenWrt"
OPENWRT_DEVICE_MANUFACTURER_URL="https://openwrt.org/"
OPENWRT_DEVICE_PRODUCT="Generic"
OPENWRT_DEVICE_REVISION="v0"
OPENWRT_RELEASE="OpenWrt 22.03-SNAPSHOT r19338+20-ae64d0624c"

Maybe problems with airtime fairness, pings, speed or other things are not so much problems with airtime, but with omissions or unnecessary corrections... maybe patches type link are not needed at all (clarity sometimes not indicated) and, for example, link- old junk from prehistory may be helpful in development. To solve the problem sometimes you have to dig deeper. I would like to use this second patch from the post :wink: - similar elements are in the cfg.c file (IEEE80211_NUM_ACS) but I do not know if it makes sense to play.

Hi Dave,

Here are some plots for you. Latency plots for original AQL / VTBA look smoother than the new AQL / RR. However, AQL / VTBA may exhibit extremely high latency in corner cases (e.g. clients leaving the WIFI network etc.). My tests for the new AQL/RR in the next several days is to observe whether the high latency problem would no longer exist in these corner cases.

There were 12 associated 5GHz clients. 3 of them were streaming Youtube and the rest was mostly idling.

For all these tests, I used vanilla ath10k instead of ath10k-ct.

Edit 1: Looking at the plots more carefully, it seems the latency for the previous AQL/VTBA may look smoother just because its latency scale range is bigger, so it flattens the fluctuations to some extent.

Edit 2: As we can see, the reason for the much bigger scale range for latency in the plots for the previous AQL/VTBA implementation is the random drastic spikes in latency during the test runs. These big spikes rendered the latency scale range quite larger in order to accommodate them in the plots. In other words, the previous AQL/VTBA may look superficially good but not necessarily better than the new AQL/RR. I will try much longer runs (each of the shown plots was about 70 seconds in length) to see if the large latency spike occurrences are more or less suppressed in the new AQL/RR.


Master snapshot (Sunday 26 2022) with Felix' commits (revert back to deficit round-robin scheduler)

======
1st try:
rr_tcp_download_1

rr_tcp_upload_1

======
2nd try:
rr_tcp_download_2

rr_tcp_upload_2




22.03 snapshot (May 8 2022) with default AQL/ATF enabled

======
1st try:

vtba_tcp_download_1

vtba_tcp_upload_1

=======
2nd try:

vtba_tcp_download_2

vtba_tcp_upload_2


4 Likes

The following plots were from longer run times: 6 minutes per download and per upload.

Image: Master snapshot (Sunday 26 2022) with Felix' commits (revert back to deficit round-robin scheduler)

tcp_download

tcp_upload

3 Likes

Hi Felix,

I have tried the master snapshot with all your most recent commits, and so far I have not observed any extreme latency (that might last a few minutes) when clients were leaving the WIFI network. I have no longer seen log messages similar to the following in such cases.

Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.211061] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.218685] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 7

However, I have still seen some of the following log messages with the latest commits (also seen in past builds prior to your recent commits). Unfortunately, I have not yet been able to associate them to any particular event.

[14509.223048] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode
[14509.223080] ath10k_pci 0000:01:00.0: received unexpected tx_fetch_ind event: in push mode

Thanks so much for all your great contributions to the betterment of FOSS.

1 Like

Not answering on behalf of @nbd here, but the recent commits are all on the mac80211 driver code. The logs shown are from the ath10k driver. So the issues or logs thrown by ath10k will continue until that has been identified and fixed.

For me, my R7800 (running 21.02 and mainline ath10k driver) and my E8450 (running master) is not experiencing high latency often (still see them once in a while) but it is not any difference from when before the switch to VTBA scheduler, so it looks like the patch is working. I have yet to try the latest patches committed into master, only those in @nbd's staging tree. Since then, there's additional fixes commited into master with a later mt76 commit as well, so I'll be trying that with E8450 and my R7800 as well.

2 Likes

Thanks for testing. I have now backported the changes to the 22.03 branch as well, since it's an obvious improvement compared to the VTBA code.

6 Likes

thx for using flent!!

those look a great deal better, also showing a clear periodicity in the longer trace. It looks to me that a 200sec tests will suffice to show that. (as to why it is there, don't know - multicast? probe? other traffic? aqm acting up? rate controller? Gravitational flux?)

you can get finer granularity with --step-size=.05. It's helpful for comparison plots if you either scale them the same or put the two plots on the same graph with different names via -t the_test_conditions when you take them.

flent-tools running on the router (setup with ssh), can poll the aqm stats. @toke_dk ?

1 Like

See how less noisy my data was 6 years ago?

Only had 2 machines at close range on that network. And that was only achieving 60Mbits. I am inclined to suggest lowering the max rate advertised to see if the perodicity remains. I started off this whole bug by only allowing mcs1 - mcs4 (I think!)

1 Like