AQL and the ath10k is *lovely*

+1 I have been banging on about this for some time.

I have seen many times iPhone connected to WiFi but no data. It reminds me of the mesh issues where reported Wi-Fi connection ul/dl bw would go down very low and the mesh node would become unresponsive even when it was shown as being still connected.

Whatever the cause I am hopeful that this will get resolved somehow to help improve the WAF with me using OpenWrt + RT3200. Didn't happen with broadcom-based Asus router.

Please try the latest mt76 update that I just pushed, preferably with the ATF changes from my staging tree.

@sjpacket,

Can you please post your image "openwrt 22.03 + ath10k + latest patches" somewhere so other people and I can use the same image to report any observations/issues to Felix. It's better to use the same image so we can prevent any other potential differences from affecting our observations. I think the tasks for us to validate Felix's fixes is to confirm whether the following main issues are still existing. Consistency in our observations is the key in helping Felix nail down the right culprit.

  • Disassociated WIFI clients (someone with phone leaving the house) don't cause very high latency or huge packet losses on currently associated clients (this is very easily observable during online meetings or games)

  • Multiple associated clients do not negatively the throughput.

  • No excessive memory consumption or crash over a few days.

Thanks!

1 Like

Thank you for the mt76 update. I have flashed the latest MASTER SNAPSHOT in my Belkin RT3200

OpenWrt SNAPSHOT r19873-a703f9ed0b / LuCI Master git-22.167.28356-8effea5

which contains the mt76 version https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=a703f9ed0b02896eb7bec51d5b5f809c01bc20e0

Since this issue occurs randomly, I do not have a reliable way to test it. If the issue occurs again with this snapshot build, I will report here.

I look forward to seeing the mt76 updates in OpenWrt 22.03 branch as well so that the next RC or stable release contains the fixes. The only reason I am using MASTER SNAPSHOT instead of 22.03-SNAPSHOT or 22.03-RC build is because of updated mt76 version.

EDIT: I am hoping that this issue is fixed in ath10k as well as I have a R7800 (currently running 22.03-rc4 with ath10k non-ct firmware and ath10k-smallbuffers non-ct driver) being used in a different location. I can only connect to the R7800 from the WAN side and therefore cannot test the R7800 WiFi myself. The users at that location are not familiar with OpenWrt but they will let me know if they face any connectivity / lag issues with R7800's WiFi. However they haven't reported any issues till now.

1 Like

It's just me and it's a single AP (Belkin RT3200) and pretty much a single client (Google Pixel 6) situation.

Apart from my Pixel 6, I also have:

  1. Laptop with "Intel Wireless-AC 9560 160MHz" WLAN card: Windows 11 22H2 and Arch Linux x86_64. Personal device.

  2. Laptop with "Intel Wireless-AC 8265" WLAN card: Windows 10 21H2. Work device.

However both the above devices are connected most of the time via Ethernet to the RT3200 and I almost never connect WiFi on them.

My WiFi config:
ISP: USA, Comcast Xfinity, DOCSIS 3.1, 50 Mbps download, 10 Mbps upload
Channel 165 (5 Ghz)
HE20 (802.11ax / WiFi 6)
Beacon Interval = 100

SSID_1 (for Personal devices)
DTIM Period = 5
WPA3-SAE (not WPA2/WPA3 mixed mode), only hash-to-element H2E, not hunting-and-pecking
802.11w = Mandatory / Required

SSID_2 (for Work device / Guests)
DTIM Period = 5
WPA2, AES/CCMP
802.11w = Disabled

I posted my /etc/config/wireless at 802.11r Fast Transition how to understand that FT works? - #105 by ka2107 if you are interested.

1 Like

I am unable to tinker with my router, as it's used in a production environment, but maybe an additional data point is useful. Especially since this is NOT a ath10k device, which points to this issue not being device specific.

I am using a rt-ac57u (MediaTek MT7603EN + MediaTek MT7612EN WiFi chips), and 21.02.1 is the last stable release for me. Version 21.02.2 will randomly start to experience extremely high latency and packetloss on 5ghz WiFi, and WiFi needs to be restarted to fix it.Furthermore, 2.4ghz WiFi needs to be disabled altogether under 21.02.1 because it will quite frequently result in a kernel oops that sometimes even takes down the whole router with it.

My only gripe with 21.02.1 is that it sometimes runs into a memory leak (doesn't slowly happen over time, RAM usage is normal, but suddenly balloons very quickly when the bug is triggered). But other than that, it's MUCH better than 21.02.2 or later versions are.

I have seen more people mention that 21.02.1 is the last stable release for them, so looking at the commits that happened between 21.02.1 and 21.02.2 is probably a good starting point to look for this bug.

1 Like

I have two R7800s that have max 3 clients that connect simultaneously on 5GHz and 2 clients on 2.4GHz.
I left them running with all the latest patches (I mean patches pushed up to this one) and they still haven't given any error. Currently both are on 5-th day. There are no Mac iOS devices.
Apart from both R7800s, I've tried the same master build on three other R7800s that usually have more than 5 clients sometimes reaching up to 10 devices. All of them quickly got
kern.warn kernel: ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 76 tid 0
in the log and the 5GHz WLAN become inoperable. No throughput until it was restarted. There are several iOS devices on two of the routers and none on the last one. Unfortunately I cannot monitor closely the routers with the iOS devices.
I know it's a dumb shot in the dark but I wonder why less clients do not cause the error to appear. At least so far. Maybe others can test this too.
Since this commit the Wi-Fi 5GHz is completely unusable on the three routers.
Huge latency, very low throughput and frequent errors "no connection" for all 5GHz clients on the routers with more than 5 clients.
Before this commit the above error shows up in the log but the WLANs continue working.
All routers use WPA2/WPA3 encryption. Most of the WLAN clients are WPA3 capable.

1 Like

I just tried to "baseline" my r7500v2 to see if i can reproduce any of your symptoms before more testing (i.e. I have not yet loaded an image with nbd's commit - hopefully i can do that later today).

The short answer is my set up is rock solid atm as far as I can see with the testing I've done.

Some details:

  • master build from a little more than a month ago (uptime on this build is 41+ days)
  • ath10k-ct with the htt firmware (not the "full" version)
  • AP only setup, 2 vlans, swconfig (not DSA); tried the netperf server on the r7500v2 AP, on my router (i.e. through the vlan), and one on a wired ubuntu box on the r7500v2 switch.
  • testing with 6+ wifi clients on 5GHz channel 36 (one macbook air, an iphone 7, a couple of android devices, and two ubuntu laptops)
  • AQL is enabled, ATF is not

I used irtt to monitor rtt times with the irtt server on the router (through the vlan). Unloaded rtt is about 3-10 ms (apple seems to like to idle at 10 ms, ubuntu will report 3-4 ms rtt on average). Loaded (using netperf -t TCP_MAERTS ... to stream data out from the AP to the wifi clients) rtt averages about 20 ms. Over the course of 30 minutes of testing, maybe once i saw a ~156 ms rtt which didn't seem to correlate with any of the disconnect/reconnecting I tried with the other clients.

Netperf using the macbook (alone - multiclient netperfs will see a drop in throughput but that is consistent) can hit 540 mbps (when the netperf server is on the r7500v2) but typically is just under 500 mbps. I never saw a dip in throughput that correlated with reconnecting/disconnecting clients (I went from only one client to the max 6 and then back down again several times). What did pop out at me is one r7500v2 cpu is maxed out. The other cpu is at 40% when running the netperf server on it, about 30% with the netperf server on a wired device (one cpu was always pegged at these speeds). If you really are using an ipq8064 device (like mine), then some of your throughput observations might be cpu limited/related.

HTH

1 Like

While I do not have any Apple devices myself (yet), I have family living in another country with the following Apple devices:

  1. Macbook Air (2017, Intel) running macOS 12 Monterey
  2. iPhone 8 Plus running iOS 15
  3. iPhone SE (1st Gen) running iOS 15
  4. iPad Mini 2 running iOS 12

My family complained about WiFi issues and while doing web search on WiFi issues in Apple devices, I came across AWDL protocol that is used by newer macOS and iOS / iPadOS for Airdrop, Airplay, Universal Control etc.

Apple seems to create a peer-2-peer WiFi (like Ad-Hoc / IBSS) network between Apple devices but using its own proprietary AWDL protocol. The problem is not he AWDL protocol itself, but rather that Apple uses existing WiFi Channels 6 (2.4 GHz band, 20 MHz width), 44 (5 GHz band, 40 MHz width) or 149 (5 GHz band, 40 MHz width) which causes issues for existing AP and clients that use regular WiFi Infrastructure mode on those channels.

The solution is to either disable AWDL (easier in macOS, harder/impossible in iOS / iPadOS) or use WiFi Channels other than 6, 44, 149 for your regular WiFi network.

I have set the Router/AP (Belkin RT3200) at that location to use Channels 11 (2.4 GHz) and 36 (5 GHz) and I have heard not heard any complaints about WiFi after that change.

https://fusecommunity.fortinet.com/blogs/jim1/2021/03/19/appletv-airplay-and-awdl-and-wifi

https://medium.com/@mariociabarra/wifried-ios-8-wifi-performance-issues-3029a164ce94

https://medium.com/@mariociabarra/wifriedx-in-depth-look-at-yosemite-wifi-and-awdl-airdrop-41a93eb22e48

3 Likes

Hi @anon98444528,

Just for clarification, your setup is:

  1. master branch
  2. kernel 5.15
  3. ath10k-ct-htt (PKG_SOURCE_VERSION:=f808496fcc6b1f68942914117aebf8b3f8d52bb3)
  4. you are running netperf server on the AP
  5. How did you disable to ATF?
  6. What was the distance between macbook and the AP?
  7. Is the macbook air 2x2 ac?
  8. Are you running the AP 5GHz radio at 40MHz or 80MHz channel width?

Thank you so much for testing! It gives me a good baseline for my test setup. Just to confirm: I am using ipq8064.

1 Like

1 yes. built about 41 or 42 day's ago, up continuously since booting it.
2 kernel 5.10
3

r7500v2 # opkg list-installed | grep ath10k
ath10k-board-qca99x0 - 20220209-1
ath10k-firmware-qca99x0-ct-htt - 2020-11-08-1
kmod-ath10k-ct - 5.10.104+2021-11-28-dc350bbf-6

4 tested with netperf server on AP (but also did replicate tests with netperf server on clients wired to the AP). I generally get faster speeds with the netperf server on the AP for one client tests.
5 I don't disable ATF. ATF with the VTBS was never supported on the r7500v2 (QCA 9980 wifi). Hence I consider my results above consistent with what quarky observes - disable ATF VTBS and you don't see latency issues. (But yes, if you search around the forums you will find that I now can get ATF with the VTBS I believe with a similar implementation to mt76 devices, I can enable/disable it at runtime via the ath10k-ct fwcfg api. Like quarky's mt76 based device, when i use this I also have not observed latency issues but I've only done limited testing.)
6 all clients tested with line of site, maybe 1.5 meters away
7 the 2019 macbook air has a broadcom bcm4355 wifi card (good luck finding specs - I haven't)
8 80MHz

FWIW I characterize one of my "wifi issues" as throughput jitter. Rapid changes in throughput when multiple clients stream simultaneously and intermittently (which is why your results got my attention) I believe this causes issues with video conferencing but I can't show it conclusively yet. This started about the time AQL got implemented but I don't know how much of it is related to AQL (and the AQL round robbin implementation). I do think this is rate related. Speed kills, especially on 5GHz in wifi challenged locations (10+ m, through walls). Slow down clients and the network gets better for all.

I will test nbd's patches to make sure they function on my device, but I suspect the latency issues are more a QCA9984 issue. I don't want to add too much to the noise this thread currently getting.

Still waiting for children to stop gaming so I can test. Might have to wait until they go off to college at this point.

HTH

EDIT your QCA9984 wifi card does support ATF, your QCA9980 likely does not. The ath10k-ct driver likely can handle this, but it may be that the mac80211 implementation chokes on this - you may have to get into the weeds to know for sure.

@anon98444528

As you mentioned, your r7500v2 does not support ATF/VTBA and you did not notice any latency issue, just like quarky trying to roll-back VTBA to round-robin and no more latency issue for him. Can you please show the output from "iw phy | grep -i fairness" on your r7500v2?

I wish Felix would implement the ATF setting in a way that we can enable/disable it on the fly via some kernel setting (e.g. /sys/kernel/debug/ieee80211/phyX/atf_enable/disable, or something like that). That would allow people to try and use the method that is best suited/most reliable for their WIFI network. I do believe that various combinations of different WIFI clients do play a role in triggering different problems as reported by people in their WIFI network.

That already exists at /sys/kernel/debug/ieee80211/phy0/airtime_flags where you can disable/enable airtime fairness at both the TX and RX level.

3 Likes

@Mushoz: Thanks a lot!

With the current code, does disabling ATF imply a switch back to the round-robin scheduler?

I do not know the exact details, but as far as I know it should. Probably best to wait for a confirmation from someone that does know the details, though :slight_smile:

From what I can understand of the ATF code, the flags control the use of TX and RX airtime accounting. If set, the TX/RX time will be added to the ATF airtime consumption of a client transmit queue, which will then be used to determine if the client's transmit queue has used up it's share of transmit time. So clearing of the flag means the transmit queue will always be eligible for transmit, as the code will think that no airtime has been consumed.

But because the new VTAF queue data structure is organised in a red-black tree structure instead of a circular queue buffer (i.e. round-robin), I think the behaviour will be different. Only way is to test it out I suppose.

2 Likes

Already did above and in another thread (my posts can be a bit verbose).

EDIT: this command

cat /sys/kernel/debug/ieee80211/phy0/ath10k/wmi_services | grep -E '(AIRTIME|PEER_STATS)'

is useful if you start digging into different devices/drivers.

If anyone is interested, skim through this thread starting at Dec 21. There are some details there about how ATF is different between devices (and drivers), how to disable it if you compile your own driver (but nbd already showed you how to do that above - see also this post in this thread one to two years ago...), how to get ATF on the r7500v2 with the ath10k-ct and how to enable/disable ATF via the ath10k-ct fwcfg api (but you will have to adapt that patch for the r7800 and compile your own driver). This thread is not a guide - just me trying to figure out why wifi sucks.

FWIW AQL/ATF can make wifi better - but it is a bit bugged atm.

As mentioned above /sys/kernel/debug/ieee80211/phy0/airtime_flags apparently does not disable it revert it to the round robin scheduler, nor does playing with ATF settings in hostapd (currently). I hope this changes in the future as well.

1 Like

From yesterday's tests on master 5.15.45 and added last patches @nbd user I can say that it is better - it does not throw errors in the log above 5 active clients and thus does not block traffic. However, the downside I've noticed for a few days is a mediocre transfer at long distances (in my case a heavily supressed room, 10 meters from the device).

There is another way to increase or decrease the range - the board.bin files used for calibration along with the router's firmware.bin. These board.bin files can also (depending on their settings) degrade or improve not only the transfer rate, but also the access times and the maximum range. Unfortunately I haven't found a way to edit those files, but only a dozen or so already made, which differ in the above properties.

As for the tests from the first paragraph - firmware and kmod without -ct, board.bin file the newest standard. Transfers from 1-2 meters 38-42Mbps, ping 23-27, probably the best so far as I have an ea6350v3. With a cable the ping is 14-17 and transfers up to 45. With 10 meters it is already worse because the ping is from 25 to 32 (but this is understandable) but the transfers hit 5-8Mbps. The range has also decreased dramatically.

However, the tests are the same as above, but after replacing the board.bin file with the one previously used (prepared, found in web) it improves the maximum range by up to 100%, and thus the speed at long distances, but it worsens by up to 50% at short distances. On the stable previous release (21.02) in spite of using the same board.bin file the results from far away were comparable with those from near - i.e. about 20-25Mbps, pings were already different. I believe that AQL is not solely, or even mostly, to blame for the poor wifi results.

3 Likes

r7500v2 (QCA9980),
ath10k-ct with htt firmware,
master at commit 24e27bec9 (jun 21),
plus your commit 8c042341e4b (all patches)

Builds, loads, and runs fine so far. I've only tested a few minutes but I'll let this one run unless there is another change to test later.

I cannot reproduce @sjpacket's throughput reduction observations upon devices disconnecting or reconnecting either with or without your latest commit. Throughput on a device doing netperf/iperf will change drastically when other clients transmit data; however, throughput always recovers for me if the data transfer from other clients is transient.

I also do not have the latency symptoms others report with QCA9984 either with or without your latest commit - not surprising given my device does not support ATF and the VTBS (without modification at least).

HTH

1 Like

Not sure if this can help to get you started (but I suspect you might have already seen that post). What to change in those files is likely the real challenge. Good luck.

2 Likes