Netgear R7800 exploration (IPQ8065, QCA9984)

I observed a similar problem on Xiaomi R3D. The WiFi upload speed was less than 100 Mbps.
We managed to find the reason for this: too high txpower was indicated.
Try setting the txpower to no higher than 23 dBm.

My device is the r7500v2 which is ipq8064 and qca9980. So yes, different than r7800.

@quarky, yes this is a r7800 thread (and i appreciate everyone being patient with me when i post here) but other ipq806x users might not see this file at all depending on their wifi chipset. Airtime fairness (on openwrt) will likely still be present and working (EDIT see this), just some functionality is lost if the ath10k wifi driver does not set NL80211_EXT_FEATURE_AIRTIME_FAIRNESS. Interested ath10k users can

cat /sys/kernel/debug/ieee80211/phy0/ath10k/wmi_services | grep -E '(AIRTIME|PEER_STATS)'

if they don't see any "enabled" (one is enough) the file you mention will not show up and iw will not function wrt ATF (only). I'll submit a bug report to openwrt for this (EDIT I'm holding off on this for now) - I did submit one to candela, but i think that is the wrong place to fix it.

There is something to that - I haven't written much about my wifi observations wrt rates but they definitely matter - this is not the right thread for it tho.

1 Like

From what I understand from the mac80211 source code, this is Wi-Fi chipset independent. It is created by mac80211. All drivers using mac80211 will have this debugfs file implemented. Whether the Wi-Fi driver set the ATF flag is another matter.

Trust me :). Or try un-setting NL80211_EXT_FEATURE_AIRTIME_FAIRNESS and you will see iw cant set ATF weights and some ATF debug files disappear.

But you are correct ATF is still present and (i think) working with or without the that variable being set see this.

1 Like

it seems i'm ok..

Statistiche Ping per 192.168.1.1:
    Pacchetti: Trasmessi = 1000, Ricevuti = 1000,
    Persi = 0 (0% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
    Minimo = 1ms, Massimo =  58ms, Medio =  8ms
1 Like

A couple of notes on how ATF (is supposed to) work:

  • It will account airtime from transmissions in both directions to each station, but it can only throttle traffic in the AP->client direction.
  • It is only active for drivers that opt-in to it by setting that EXT_FEATURE flag you mentioned above; just running iw phy should tell you if it's enabled or not (look for AIRTIME_FAIRNESS in the "supported extended features" list at the end of the output for each phy)
  • For ath10k in particular there's an odd interaction with the firmware scheduler in some cases: newer ath10k chipsets will switch to 'pull/push mode' in some cases where the firmware has its own notion of which stations to schedule when. Unfortunately, this being in firmware, I don't have a lot of insight into how this actually work, but it may be what's causing the issues with the virt time-based scheduler. I had an email exchange with someone who noticed this recently, will quote my reply to them below.

(Note that the below only applies to the pull/push mode of ath10k and the context was slightly different; but may be relevant here anyway):

So the main change with the virtual-time scheduler that's relevant here (I think) is that ieee80211_txq_may_transmit() is better at saying no now. The assumption for this mode of operation (which is only used by ath10k, BTW), is that the firmware will cycle through all the scheduled stations, ask the system if each of them is allowed to send (through ieee80211_txq_may_transmit()), and if the answer is no, move on to the next one.

Now, the "move on to the next one" bit is central here; if the firmware just keep asking about the same station (or a subset of all scheduled stations), it will keep getting a "no" if there's another station that is "behind" (in terms of fairness) until that other station catches up. Whereas with the old code, the round-robin scheduler could artificially advance the other stations, allowing them to transmit (but effectively disabling the fairness).

The latency spikes you describe sound like they could be caused by this throttling mechanism. Now, that throttling is obviously not supposed to happen; we're only supposed to enforce fairness between active stations, so if a station does not have any data outstanding (at the mac80211 level), it should be removed from the schedule, and the other stations should be allowed to continue.

So the question is what it is that goes wrong here; my guess is that it is one of the following:

  • The firmware has its own notion of which subset of stations it wants to schedule, so it gets deadlocked with the fairness mechanism as explained above (I don't have a lot of insight into how the push/pull mechanism of ath10k is actually supposed to work).

  • There's a bug somewhere so stations with no outstanding packets don't get de-scheduled properly (and so are still part of the rotation blocking progress).

4 Likes

8320 packets transmitted, 8318 packets received, 0.0% packet loss
round-trip min/avg/max/std-dev = 1.323/2.937/36.368/1.142 ms

Perhaps. Note, others have found latency spikes that are client/os dependent (here for one example that i believe predates the patch you are investigating).

The ieee80211_txq_may_transmit() API is the one that I've modified for my R7800 firmware and am currently testing it now. And yes, only ath10k is calling this API from the source code scan I did to the source tree. I added debug printout codes for that APIs and found that the difference between the old (based on the round-robin algo) and new API is that when the txqi->schedule_order structure is NULL, the old API returns true to the caller (i.e. ath10k) while the new API returns false, effectively throttling the transmission. And this condition happens a lot.

Besides this change, I also noticed that the behaviour of the ieee80211_return_txq() also changed. The old API tries to reschedule the returned txq, while the new API unschedule it. I have to admit that I do not understand why this change is done, so I'm testing the code behaviour in piecemeal. This API is called by ath10k and the mt76 drivers.

Ah, this is interesting! Nice find :slight_smile:

So the reason it returns false is that this shouldn't happen, in theory. Which, if it does, maybe that's the culprit; it might be because maybe the firmware is asking about TXQs that only have packets outstanding in the firmware but not in the host queues? Could you add a check and see if txq_has_queue(txq) is really false when this condition happens, and also theck the AQL backlog (air_info->aql_tx_pending) for that TXQ (when schedule_order is NULL)?

This should be benign and is mostly a difference in data structures: the RR schedule removes a queue entirely from the rotation when handing it off to the driver (with next_txq), whereas the virt-time scheduler keeps the queue on the rbtree while it's being handled by the driver...

1 Like

So I was just about to submit a bug report about this. I'll hold off.

For ipq8064 and qca9980, this flag is not set by either ct or non-ct drivers. I can force it and ATF does seem to work, at least on the AP-> client direction; however I don't know if this should be set. Note that AQL stil shows up. I assume its ok to have AQL enabled but not ATF.

BTW If there is basic testing that can be done on my device I'm willing to do so.

Thank you for the response.

Will add in this check and report back.

There is no such file. But there is another file:

root@Xiaomi-R3D:~# cat /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/ec:41:18:f5:25:82/airtime
RX: 0 us
TX: 30669 us
Weight: 256
Deficit: VO: -7 us VI: 256 us BE: 57 us BK: 1 us
root@Xiaomi-R3D:~# iw phy
....
Wiphy phy0
...
        max # scan plans: 1
        max scan plan interval: -1
        max scan plan iterations: 0
        Maximum associated stations in AP mode: 32
        Supported extended features:
                * [ VHT_IBSS ]: VHT-IBSS
                * [ RRM ]: RRM
                * [ SET_SCAN_DWELL ]: scan dwell setting
                * [ CQM_RSSI_LIST ]: multiple CQM_RSSI_THOLD records
                * [ CONTROL_PORT_OVER_NL80211 ]: control port over nl80211
                * [ TXQS ]: FQ-CoDel-enabled intermediate TXQs
                * [ AIRTIME_FAIRNESS ]: airtime fairness scheduling
                * [ AQL ]: Airtime Queue Limits (AQL)
                * [ CONTROL_PORT_NO_PREAUTH ]: disable pre-auth over nl80211 control port support
                * [ DEL_IBSS_STA ]: deletion of IBSS station support
                * [ SCAN_FREQ_KHZ ]: scan on kHz frequency support
                * [ CONTROL_PORT_OVER_NL80211_TX_STATUS ]: tx status for nl80211 control port support

Your router is running the old round-robin scheduler, not the new virtual time-based scheduler.

couldn't check it today, but it seems i have the new scheduler, right?

root@RUTTO:~# cat /sys/kernel/debug/ieee80211/phy0/airtime
        VO         VI         BE         BK
Virt-t  63891101   34643286   1920972764 5013180
Weight  512        256        256        256

let me know if i check something since strangely my wifi seems to work :slight_smile:

@quarky when you get degraded speeds over wifi, what does the RX/TX rates say in luci? I got low speeds on one of the clients I tested on, I checked and the TX rate was low at the time of testing...
I switched to another client and the speeds were fine, the TX rate also.

Looks like the new airtime scheduler is working great for you! Must be my setup and combination of Wi-Fi clients that's causing it. I did disable the 2G Wi-Fi interface tho. but I don't think it matters in this instance.

If memory serves, PHY rate is full 80Mhz 2x2 rate, i.e. 866MHz. Latency is off the roof tho. I don't think speed/thruput is the issue. It is the latency that is bad.

@tohojo I have some logs from my R7800 output after adding the txq_has_queue() check and reading the aql_tx_pending value from air_info.

I can see the following:

  1. All logs printed (when schedule_order is NULL) has the txq_has_queue() API returning false;
  2. Majority of those has non-zero aql_tx_pending for the txq that's passed in.

Sample output of the logs:

[74527.141093] net_ratelimit: 1363 callbacks suppressed
[74527.141100] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=4]
[74527.145239] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=4]
[74527.152900] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=4]
[74527.161374] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=0]
[74585.316336] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=4]
[74585.316419] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=4]
[74585.323181] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=4]
[74585.331140] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=4]
[74585.339706] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=0]
[74585.346975] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=4]
[74585.354857] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=12]
[74585.362665] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=12]
[74585.370962] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=0]
[74585.378648] ieee80211_txq_may_transmit: &txqi->schedule_order [has_queue=0, tx_pend=8]

The above should be representative of the throttled logs I would think.

I also changed the behaviour of the ieee80211_txq_may_transmit() API to also return true under the following conditions:

  1. When ieee80211_txq_airtime_check() returns false. I figured we should allow transmit if there's no need to check for airtime?
  2. When txqi->txq.sta is NULL (this is consistent with the old API behaviour)

The above two conditions does happen occasionally, but by far the most common logs I see are from the schedule_order NULL (or to be precise, empty RB_NODE) condition.

With all 3 changes added, it doesn't seem to affect my R7800's Wi-Fi so far.

1 Like

@quarky are these code changes on a public git repo? I'd like to dig into this a bit more and perhaps try this myself. No worries if not.

It's in my local build machine. The changes are trivial tho. Here's the patch:

--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -4135,19 +4135,35 @@ bool ieee80211_txq_may_transmit(struct i
        struct airtime_sched_info *air_sched;
        struct airtime_info *air_info;
        struct rb_node *node = NULL;
-       bool ret = false;
+       bool ret = true;
        u64 now;


-       if (!ieee80211_txq_airtime_check(hw, txq))
-               return false;
+       if (!ieee80211_txq_airtime_check(hw, txq)) {
+               air_info = to_airtime_info(txq);
+               net_info_ratelimited("%s: airtime_check [has_queue=%d, tx_pend=%d]\n",
+                       __func__, txq_has_queue(txq), atomic_read(&air_info->aql_tx_pending));
+               return true;
+       }

        air_sched = &local->airtime[txq->ac];
        spin_lock_bh(&air_sched->lock);

-       if (RB_EMPTY_NODE(&txqi->schedule_order))
+       if (!txqi->txq.sta) {
+               air_info = to_airtime_info(txq);
+               net_info_ratelimited("%s: txqi->txq.sta [has_queue=%d, tx_pend=%d]\n",
+                       __func__, txq_has_queue(txq), atomic_read(&air_info->aql_tx_pending));
                goto out;
+       }

+       if (RB_EMPTY_NODE(&txqi->schedule_order)) {
+               air_info = to_airtime_info(txq);
+               net_info_ratelimited("%s: &txqi->schedule_order [has_queue=%d, tx_pend=%d]\n",
+                       __func__, txq_has_queue(txq), atomic_read(&air_info->aql_tx_pending));
+               goto out;
+       }
+
+       ret = false;
        now = ktime_get_coarse_boottime_ns();

        /* Like in ieee80211_next_txq(), make sure the first station in the
1 Like