AQL and the ath10k is *lovely*

Thanks, @sjpacket. The NULL pointer dereference crash that you pointed out also seems to be caused by OOM. It happens in a completely unrelated place, so I don't think it's directly related to my patch, except for the fact that my patch might slightly change memory usage patterns.
How is latency/throughput with my patch on mainline ath10k?

1 Like

Hi @vochong,
regarding ATF and AQL: AQL is essential for making ATF work well. If AQL significantly reduces throughput, then that's an important issue which needs to be fixed. Please let me know how well the current version with my patches works for you with AQL and ATF enabled.

3 Likes

A few questions about your testing:

  • When using iperf3 for your tests, are you streaming data from the AP out to the clients, vise versa, or trying both? Both ways can be interesting, but it's my understanding that AQL/ATF provides the most benefit (it's used the most?) for streaming data out from the AP to the clients.
  • where is the iperf3 server(s) running (i.e. AP, client via wire to AP, etc)? If your running your iperf3 servers on the AP, please keep an eye on your AP cpu usage when observing throughput. I don't recall ever observing throughput drops when other clients connect with only one client streaming. I'll see if i can reproduce that on my device.
  • When using the latest set of patches, is ATF disabled in the ath10k-ct driver (this is more a question for @nbd i.e. should it be)?
  • You reported your device as "IPQ8064, QCA9984, QCA9980", given the rates you can get with your wifi clients, I'm assuming your testing on 5GHz and that 5GHz is (only?) the QCA9984 - is that correct? I ask as my ipq8064 device has QCA9980 for 2.4 and 5GHz.

I've built images but I can't get them on my AP to test quickly due to it being in use by family most of the time.

When I do multi client tests, I prefer to use a netperf server running on a client wired to the AP. A single netperf server will let you stream data both directions (server to clients, clients to server). The down side of netperf is there does not seem to be a client for windows (I think running it from wsl will work tho).

Given your reported wifi rates above, a single netperf server might saturate even a 1 gbps wired connection with 3+ clients. Also even if the iperf/netperf server(s) are on a separate device, keeping an eye on your AP cpu is a good idea if testing with 3+ clients.

In the past I've looked into the flent rtt_fair test (mentioned by dtaht above), but I'm uncertain it would show much when connecting to multiple servers outside your local network if you have a slow isp connection (like mine). Perhaps there is a way to setup ones own local servers for this test? (Again, not a question for you.)

Thank you for testing without the apple devices. I suggested doing that as its possible there are other issues with the ath10k-ct driver/firmware that results in latency symptoms. Crashing due to OOM is not one of those symptoms tho.

As soon as I get an opportunity to test, I'll report back. Likely at least a couple of days to do that.

If I understand this correctly, the server instances really are machines (symbolic dns names or numeric IP addresses) which have netserver running and a firewall that allows accessing netperf from the outside. So you could simply distribute multiple servers in your internal network? The bigger challenge is that most local nodes will have similar RTTs but the first question about RTT fairness is 'are rates roughly equal at equal RTT', so this might be a decent test bed.

1 Like

What I am hoping for with more detailed testing is to see a detailed pattern of stalls that lead to an AHA! moment. I fear aircaps will also be needed. While we are establishing some groundwork for the crashes, what form of WPA is on or off for these tests?

For the historical record, this regression on this thread is turning out harder than http://blog.cerowrt.org/post/crypto_fq_bug/ was... :frowning:

(again, we have other problems, but in the "working" non -ct code, do you see a reasonable pattern of codel marks or drops?) This is what a single tcp upload looks like when I've locked the wifi to 6Mbits, using tcptrace -G and xplot.org on the resulting packet captures. Locking the wifi to a slow rate allows for testing of the wifi vs a vs an external server over a faster ISP.

1flow

A nice consistent set of drops and sack responses. I tend to ALSO use ecn as a debugging tool (is codel working?), sysctl -w net.ipv4.tcp_ecn=1 on both the server and client, and that ends up looking like this:

ecn

You can actually capture the CE marks if your capture point is after the router, but merely seeing CWR and no sacks is usually indicator enough. These plots are from a known working ath9k laptop to ath10k based turris, running an older version of openwrt.

2 Likes

It's never just one bug. Dumb AP with linksys e8450 / Belkin RT3200 occasionaly client associates but no connectivity - #49 by Lynx

1 Like

Hi Felix @nbd, thank you for your work in fixing AQL and also maintaining mt76 driver. I know this thread is mainly about ath10k, but some people are reporting connectivity issues on mt76 driver that are supposedly fixed by disabling AQL as mentioned at Belkin RT3200/Linksys E8450 WiFi AX discussion - #2359 by jj86 .

Does mt76 use AQL and does it directly do ATF like ath9k? Does the Virtual TIme-Based AQL (that is yet to be reverted in OpenWrt 22.03) affect mt76 like it affects ath10k?

I am using a Belkin RT3200 (mt7622, mt7915) with 5GHz 802.11ax AP mode and sometimes I have an issue with my Google Pixel 6 phone that occurs randomly (and nothing related in openwrt system log), wherein wifi remains connected but no data loads in any browser or app. Toggling wifi off/on on the phone bring back connectivity. Not sure if that is related to AQL.

I can offer a step back re ath10k vs -ct firmware - with every release update of 21.02 (.1, .2, .3 ...) I've updated my Archer A7-us router, and every time had to uninstall ath10k-ct firmware and kmod-ct and use ath10k (non -ct flavors) -- due to blatant network stuttering.

Sorry no further details on that stuttering are available, except that the A7 had 4 wifi clients (2 5GHz, 2 2.4GHz) and was a simple home accesspoint network; and at that when I'd hit stuttering my first thought was ... 'yes, just updated to the next release of openwrt'.

I've since replaced the A7 with a Netgear Nighthawk X4S R7800 -- with some of the same issues and solution.

Suggestion: stick with -non -ct ath10k bits when digging into these bugs (plural use deliberate) or the multiple use case scenarios may be unworkable. I'm not trying to throw rocks here ... just that I've had issues with -ct on multiple devices over a significant span of time.

In Addition to reverting to ath10k firmware/kmod my R7800 runs much better with 21.02.1 rather than the 21.02.3 I initially tried. round robin.

Hi @ka2107,
mt76 uses AQL, but I suspect if disabling it fixes some issues, it's only hiding the real bug. Not sure how much the VTBA scheduler affects it other than producing very bursty tx behavior.
I'm also looking into a driver regression related to wifi-6 that seem to result in tx stalls in my tests. That one is unrelated to AQL or ATF though.

4 Likes

@Mpilon,

Yes, I have similar experience. The non-ct ath10k + 21.02.1 seem to be the most reliable combination.

1 Like

@ka2107

Regarding "wifi remains connected but no data loads in any browser or app".

Did you notice if someone might just exit the house (with their phone) prior to encountering such problem? I had this same problem as the bug described as follows. Please note that the AQL disabling workaround only helped reduce the frequency of the problem. It did not fully get rid of it.

WiFi stalls for a few minutes after the 21.02.2 update · Issue #9455 · openwrt/openwrt · GitHub

Device: Netgear R7800 (IPQ8065, QCA9984)
WiFi driver + firmware: mainline (not CT)
Band: 5 GHz

When some devices such as phones leave the network, other devices left on the wifi network experience high latency and heavy packet loss for a few minutes.

This started happening after the update to 21.02.2, it was very stable before.

Others are reporting the same issue, see this thread:
https://forum.openwrt.org/t/ipq806x-nss-drivers/12613/2557

Workaround that appears to help:

echo 0 > /sys/kernel/debug/ieee80211/phy0/aql_enable
echo 0 > /sys/kernel/debug/ieee80211/phy1/aql_enable
Potentially related:
https://lists.infradead.org/pipermail/ath10k/2022-February/013341.html

I suspect it is some of the backports in hostapd/mac80211 that were introduced between 21.02.1 and 21.02.2

Not sure whether it affects only ath10k or other devices also.
2 Likes

+1 I have been banging on about this for some time.

I have seen many times iPhone connected to WiFi but no data. It reminds me of the mesh issues where reported Wi-Fi connection ul/dl bw would go down very low and the mesh node would become unresponsive even when it was shown as being still connected.

Whatever the cause I am hopeful that this will get resolved somehow to help improve the WAF with me using OpenWrt + RT3200. Didn't happen with broadcom-based Asus router.

Please try the latest mt76 update that I just pushed, preferably with the ATF changes from my staging tree.

@sjpacket,

Can you please post your image "openwrt 22.03 + ath10k + latest patches" somewhere so other people and I can use the same image to report any observations/issues to Felix. It's better to use the same image so we can prevent any other potential differences from affecting our observations. I think the tasks for us to validate Felix's fixes is to confirm whether the following main issues are still existing. Consistency in our observations is the key in helping Felix nail down the right culprit.

  • Disassociated WIFI clients (someone with phone leaving the house) don't cause very high latency or huge packet losses on currently associated clients (this is very easily observable during online meetings or games)

  • Multiple associated clients do not negatively the throughput.

  • No excessive memory consumption or crash over a few days.

Thanks!

1 Like

Thank you for the mt76 update. I have flashed the latest MASTER SNAPSHOT in my Belkin RT3200

OpenWrt SNAPSHOT r19873-a703f9ed0b / LuCI Master git-22.167.28356-8effea5

which contains the mt76 version https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=a703f9ed0b02896eb7bec51d5b5f809c01bc20e0

Since this issue occurs randomly, I do not have a reliable way to test it. If the issue occurs again with this snapshot build, I will report here.

I look forward to seeing the mt76 updates in OpenWrt 22.03 branch as well so that the next RC or stable release contains the fixes. The only reason I am using MASTER SNAPSHOT instead of 22.03-SNAPSHOT or 22.03-RC build is because of updated mt76 version.

EDIT: I am hoping that this issue is fixed in ath10k as well as I have a R7800 (currently running 22.03-rc4 with ath10k non-ct firmware and ath10k-smallbuffers non-ct driver) being used in a different location. I can only connect to the R7800 from the WAN side and therefore cannot test the R7800 WiFi myself. The users at that location are not familiar with OpenWrt but they will let me know if they face any connectivity / lag issues with R7800's WiFi. However they haven't reported any issues till now.

1 Like

It's just me and it's a single AP (Belkin RT3200) and pretty much a single client (Google Pixel 6) situation.

Apart from my Pixel 6, I also have:

  1. Laptop with "Intel Wireless-AC 9560 160MHz" WLAN card: Windows 11 22H2 and Arch Linux x86_64. Personal device.

  2. Laptop with "Intel Wireless-AC 8265" WLAN card: Windows 10 21H2. Work device.

However both the above devices are connected most of the time via Ethernet to the RT3200 and I almost never connect WiFi on them.

My WiFi config:
ISP: USA, Comcast Xfinity, DOCSIS 3.1, 50 Mbps download, 10 Mbps upload
Channel 165 (5 Ghz)
HE20 (802.11ax / WiFi 6)
Beacon Interval = 100

SSID_1 (for Personal devices)
DTIM Period = 5
WPA3-SAE (not WPA2/WPA3 mixed mode), only hash-to-element H2E, not hunting-and-pecking
802.11w = Mandatory / Required

SSID_2 (for Work device / Guests)
DTIM Period = 5
WPA2, AES/CCMP
802.11w = Disabled

I posted my /etc/config/wireless at 802.11r Fast Transition how to understand that FT works? - #105 by ka2107 if you are interested.

1 Like

I am unable to tinker with my router, as it's used in a production environment, but maybe an additional data point is useful. Especially since this is NOT a ath10k device, which points to this issue not being device specific.

I am using a rt-ac57u (MediaTek MT7603EN + MediaTek MT7612EN WiFi chips), and 21.02.1 is the last stable release for me. Version 21.02.2 will randomly start to experience extremely high latency and packetloss on 5ghz WiFi, and WiFi needs to be restarted to fix it.Furthermore, 2.4ghz WiFi needs to be disabled altogether under 21.02.1 because it will quite frequently result in a kernel oops that sometimes even takes down the whole router with it.

My only gripe with 21.02.1 is that it sometimes runs into a memory leak (doesn't slowly happen over time, RAM usage is normal, but suddenly balloons very quickly when the bug is triggered). But other than that, it's MUCH better than 21.02.2 or later versions are.

I have seen more people mention that 21.02.1 is the last stable release for them, so looking at the commits that happened between 21.02.1 and 21.02.2 is probably a good starting point to look for this bug.

1 Like

I have two R7800s that have max 3 clients that connect simultaneously on 5GHz and 2 clients on 2.4GHz.
I left them running with all the latest patches (I mean patches pushed up to this one) and they still haven't given any error. Currently both are on 5-th day. There are no Mac iOS devices.
Apart from both R7800s, I've tried the same master build on three other R7800s that usually have more than 5 clients sometimes reaching up to 10 devices. All of them quickly got
kern.warn kernel: ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 76 tid 0
in the log and the 5GHz WLAN become inoperable. No throughput until it was restarted. There are several iOS devices on two of the routers and none on the last one. Unfortunately I cannot monitor closely the routers with the iOS devices.
I know it's a dumb shot in the dark but I wonder why less clients do not cause the error to appear. At least so far. Maybe others can test this too.
Since this commit the Wi-Fi 5GHz is completely unusable on the three routers.
Huge latency, very low throughput and frequent errors "no connection" for all 5GHz clients on the routers with more than 5 clients.
Before this commit the above error shows up in the log but the WLANs continue working.
All routers use WPA2/WPA3 encryption. Most of the WLAN clients are WPA3 capable.

1 Like

I just tried to "baseline" my r7500v2 to see if i can reproduce any of your symptoms before more testing (i.e. I have not yet loaded an image with nbd's commit - hopefully i can do that later today).

The short answer is my set up is rock solid atm as far as I can see with the testing I've done.

Some details:

  • master build from a little more than a month ago (uptime on this build is 41+ days)
  • ath10k-ct with the htt firmware (not the "full" version)
  • AP only setup, 2 vlans, swconfig (not DSA); tried the netperf server on the r7500v2 AP, on my router (i.e. through the vlan), and one on a wired ubuntu box on the r7500v2 switch.
  • testing with 6+ wifi clients on 5GHz channel 36 (one macbook air, an iphone 7, a couple of android devices, and two ubuntu laptops)
  • AQL is enabled, ATF is not

I used irtt to monitor rtt times with the irtt server on the router (through the vlan). Unloaded rtt is about 3-10 ms (apple seems to like to idle at 10 ms, ubuntu will report 3-4 ms rtt on average). Loaded (using netperf -t TCP_MAERTS ... to stream data out from the AP to the wifi clients) rtt averages about 20 ms. Over the course of 30 minutes of testing, maybe once i saw a ~156 ms rtt which didn't seem to correlate with any of the disconnect/reconnecting I tried with the other clients.

Netperf using the macbook (alone - multiclient netperfs will see a drop in throughput but that is consistent) can hit 540 mbps (when the netperf server is on the r7500v2) but typically is just under 500 mbps. I never saw a dip in throughput that correlated with reconnecting/disconnecting clients (I went from only one client to the max 6 and then back down again several times). What did pop out at me is one r7500v2 cpu is maxed out. The other cpu is at 40% when running the netperf server on it, about 30% with the netperf server on a wired device (one cpu was always pegged at these speeds). If you really are using an ipq8064 device (like mine), then some of your throughput observations might be cpu limited/related.

HTH

1 Like