AQL and the ath10k is *lovely*

It is not on CT, it is on mainline ath10k firmware and driver. It may have to do with devices leaving the network, e.g. phone leaving the house, it feels like the AP is then stuck perhaps still trying to send to that device, but after a few minutes the device is dropped from the AP and the rest of the devices seem to recover.

Have we ruled out that this WiFi stall-issue only occurs on ath10k (old, mainline) driver and not on ath10k-ct drivers?

i do wonder if a simple fix would be bypass aql for device with very low signal... i mean the problem seems really to be when a device is disconnected when it's too far from the router. In that specific case the algo try to optimize everything and fail as the device never actually respond back with the correct stats... @quarky any hint about this?

I would like to do some testing, is it possible to reduce CoDel target to 10 ms (currently is 20 ms). I know this might cause unnecessary drops, but I'm running a base rate of 24 Mbps without issues and this is for testing and don't mind reduced throughput at this stage.

Can it be done changing parameters through debugfs?

Sorry to bring this back to life. Was this change made at compile time? Or, is it exposed to be configured?

Well, my test with my E8450 didn't unveil anything in particular tho. I do notice that when my device is connected to the 2.4GHz interface, it tends to lag more than the 5GHz interface. Maybe we should have different AQL limits for different speed interfaces?

When I'm testing on my R7800, the signal strength is strong and I'm testing on the 5GHz interface only, and after a few days uptime, with client connecting and disconnecting, I hit extreme lag, and only a restart of WiFi or router will fix it.

So it looks like a combination of AQL and the new virtual time-based scheduler and looks like it's particularly bad with the ath10k driver. Switching back to the round-robin scheduler stopped all complains about Wi-Fi connectivity for the R7800 (21 days uptime and counting.)

So at the moment, it is still a mystery as far as root cause is concerned, well, as least for me.

2 Likes

are we sure the different algo is the only change done to the code?

Yup, that's the only change I did to my R7800 with the latest pull from 21.02. AQL is enabled.

Edit: In addition to my NSS additions that is.

I think I may have stumbled across why the new virtual time-base airtime scheduler is causing problems for ath10k routers. I think the issue may be traced to one method ( ath10k_mac_op_wake_tx_queue) in the ath10k driver that tries to schedule packets to wireless clients.

From what I can understand of this method, it will be called by mac80211 to make a transmit queue (txq) active in the firmware so that it can start transmitting packets in that queue. Instead of activating the txq that mac80211 wants, it looks for another txq and activates it. This method also updates the firmware with accounting information regarding transmit duration of the txq. Now this method updates the accounting information for both the txq that mac80211 sends in and also the new txq that it found.

So net effect is that the firmware may try to send out packets for txq that mac80211 did not intend to or have nothing to send, and artificially starving the txq with data to transmit, due to incorrect accounting. This probably will not affect routers with small number of clients. Will probably be more pronounced with more clients or when time drags on. That may explain why the WiFi interface needed to be restarted every once in a while.

I think the following patch should solve the problem.

--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -4601,17 +4601,15 @@ static void ath10k_mac_op_wake_tx_queue(
 {
 	struct ath10k *ar = hw->priv;
 	int ret;
-	u8 ac;
 
-	ath10k_htt_tx_txq_update(hw, txq);
 	if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
 		return;
 
-	ac = txq->ac;
-	ieee80211_txq_schedule_start(hw, ac);
-	txq = ieee80211_next_txq(hw, ac);
 	if (!txq)
-		goto out;
+		return;
+
+	if (!ieee80211_txq_may_transmit(hw, txq))
+		return;
 
 	while (ath10k_mac_tx_can_push(hw, txq)) {
 		ret = ath10k_mac_tx_push_txq(hw, txq);
@@ -4620,8 +4618,6 @@ static void ath10k_mac_op_wake_tx_queue(
 	}
 	ieee80211_return_txq(hw, txq, false);
 	ath10k_htt_tx_txq_update(hw, txq);
-out:
-	ieee80211_txq_schedule_end(hw, ac);
 }
 
 /* Must not be called with conf_mutex held as workers can use that also. */

My R7800 is currently not available for test. Is anyone willing to test the patch? @KONG fancy a try as you seem to have a setup that could simulate this problem?

2 Likes

I can give it a test this evening (in ~5 hours or so) on my R7800

1 Like

@quarky I am using Belkin RT3200 which uses mediatek mt76 wifi. I am using OpenWrt 22.03-rc1. Do you know if mt76 is also affected by AQL virtual time-base airtime scheduler issue similar to ath10k?

To all that have Wi-Fi issues. Have you checked the WLAN statistics graphs. Are there any unusual records for signal quality drops, phy rates, associated clients at the time you experienced the issues.

My test with the Linksys E8450 suggest it’s not affected.

2 Likes

Just to follow up on my last post, I did create a build with ath10k and this patch applied, and it built without issues and flashed fine onto my R7800. I was experiencing issues, however, with the build in that maybe 5 or so of my IoT devices were not connecting. I don't know if the root cause of that was the patch, or if it was just switching to ath10k - as I usually just use ath10k-ct. Long story short though is that I don't think I'll be of much use to you in testing this, as I've already switched back to my "-ct" build.

1 Like

No worries. Appreciate you taking the time to help test the patch.

can't we test this on ath10k-ct too?

On ea6350v3, on kmod-ath10k-ct, with 3.6.140 firmware, on snapshot (on 5.10.111 kernel), not work :slight_smile: ... propably.

Looks like the ct driver have the same issue. The patch should apply cleanly to the ct driver as well.

1 Like

Applies without errors on latest master.

Are you trying on ct driver?