I would like to do some testing, is it possible to reduce CoDel target to 10 ms (currently is 20 ms). I know this might cause unnecessary drops, but I'm running a base rate of 24 Mbps without issues and this is for testing and don't mind reduced throughput at this stage.
Can it be done changing parameters through debugfs?
Well, my test with my E8450 didn't unveil anything in particular tho. I do notice that when my device is connected to the 2.4GHz interface, it tends to lag more than the 5GHz interface. Maybe we should have different AQL limits for different speed interfaces?
When I'm testing on my R7800, the signal strength is strong and I'm testing on the 5GHz interface only, and after a few days uptime, with client connecting and disconnecting, I hit extreme lag, and only a restart of WiFi or router will fix it.
So it looks like a combination of AQL and the new virtual time-based scheduler and looks like it's particularly bad with the ath10k driver. Switching back to the round-robin scheduler stopped all complains about Wi-Fi connectivity for the R7800 (21 days uptime and counting.)
So at the moment, it is still a mystery as far as root cause is concerned, well, as least for me.
I think I may have stumbled across why the new virtual time-base airtime scheduler is causing problems for ath10k routers. I think the issue may be traced to one method ( ath10k_mac_op_wake_tx_queue) in the ath10k driver that tries to schedule packets to wireless clients.
From what I can understand of this method, it will be called by mac80211 to make a transmit queue (txq) active in the firmware so that it can start transmitting packets in that queue. Instead of activating the txq that mac80211 wants, it looks for another txq and activates it. This method also updates the firmware with accounting information regarding transmit duration of the txq. Now this method updates the accounting information for both the txq that mac80211 sends in and also the new txq that it found.
So net effect is that the firmware may try to send out packets for txq that mac80211 did not intend to or have nothing to send, and artificially starving the txq with data to transmit, due to incorrect accounting. This probably will not affect routers with small number of clients. Will probably be more pronounced with more clients or when time drags on. That may explain why the WiFi interface needed to be restarted every once in a while.
I think the following patch should solve the problem.
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -4601,17 +4601,15 @@ static void ath10k_mac_op_wake_tx_queue(
{
struct ath10k *ar = hw->priv;
int ret;
- u8 ac;
- ath10k_htt_tx_txq_update(hw, txq);
if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
return;
- ac = txq->ac;
- ieee80211_txq_schedule_start(hw, ac);
- txq = ieee80211_next_txq(hw, ac);
if (!txq)
- goto out;
+ return;
+
+ if (!ieee80211_txq_may_transmit(hw, txq))
+ return;
while (ath10k_mac_tx_can_push(hw, txq)) {
ret = ath10k_mac_tx_push_txq(hw, txq);
@@ -4620,8 +4618,6 @@ static void ath10k_mac_op_wake_tx_queue(
}
ieee80211_return_txq(hw, txq, false);
ath10k_htt_tx_txq_update(hw, txq);
-out:
- ieee80211_txq_schedule_end(hw, ac);
}
/* Must not be called with conf_mutex held as workers can use that also. */
My R7800 is currently not available for test. Is anyone willing to test the patch? @KONG fancy a try as you seem to have a setup that could simulate this problem?
@quarky I am using Belkin RT3200 which uses mediatek mt76 wifi. I am using OpenWrt 22.03-rc1. Do you know if mt76 is also affected by AQL virtual time-base airtime scheduler issue similar to ath10k?
To all that have Wi-Fi issues. Have you checked the WLAN statistics graphs. Are there any unusual records for signal quality drops, phy rates, associated clients at the time you experienced the issues.
Just to follow up on my last post, I did create a build with ath10k and this patch applied, and it built without issues and flashed fine onto my R7800. I was experiencing issues, however, with the build in that maybe 5 or so of my IoT devices were not connecting. I don't know if the root cause of that was the patch, or if it was just switching to ath10k - as I usually just use ath10k-ct. Long story short though is that I don't think I'll be of much use to you in testing this, as I've already switched back to my "-ct" build.