Netgear R7800 exploration (IPQ8065, QCA9984)

Not much input from me. I use wifi for my mobile devices, but the high-speed volumes are from my PC with wired connection.

1 Like

is this limited to CT firmware/drivers or not?
I'm on OpenWrt 21.02-SNAPSHOT r16416-ecab623a38 with "old" ath10k firmware and drivers, uptime 3 days (i lost power in my home) and wifi performance seems not to have any issue

I'm not using the ath10k-ct drivers as I wanted the encap offload capabilities that ath10k supports.

I'm not sure if OpenWrt 21.02-SNAPSHOT r16416-ecab623a38 is build with or without the patch that I suspect is causing the issue. If the snapshot build is done after this commit, then it should affect Wi-Fi stability after some time.

When the router reboots, or if the Wi-Fi interfaces are restarted, throughput would be good. After prolonged use, latency starts to climb until it becomes unusable. At that point I either have to reboot the router, or somehow log in to the router and trigger a Wi-Fi restart.

On this, sadly i can't help you, but we can ask @ACwifidude (i'm using his repo)
Or can i check this from a live system?

Edit to add: i can see those patche files in acwifidude's repo, so it seems i have them:

Can you guess after how many days this should happen? i'm not planning any reboot this week, and i'm using my wifi daily, so if it's a matter of one week i can see if things get bad..

If you see this file, then your router should be using the virtual time-based airtime scheduler, instead of the older round-robin scheduler:

/sys/kernel/debug/ieee80211/phy0/airtime

And the output should give you someting like this:

         VO         VI         BE         BK
 Virt-t  19723      1262118    179701010  1660
 Weight  256        256        256        256

From memory, I start to see issues as soon as 3 days into use. When I play online games with my iPad, I frequently see Wi-Fi issues as well, and that would be soon after restart. The online games does not need high bandwidth, but high latencies will kill game play. Also from memory, I start seeing this since Nov. 2021.

I'm not sure if that patch is causing the issue tho. Reverting my latest build to the old round-robin solves my online game issue, and it's stable for a week.

Now I'm testing the new time-based scheduler with behaviour that is consistent with the old rr also. I'll report back in about a week and see how it goes.

One test you can try is to continuously ping the router with a wireless client that is connected to it. If you see occasional ping spike into the 100s to 1000s ms, then you have the issue as well.

2 Likes

I observed a similar problem on Xiaomi R3D. The WiFi upload speed was less than 100 Mbps.
We managed to find the reason for this: too high txpower was indicated.
Try setting the txpower to no higher than 23 dBm.

My device is the r7500v2 which is ipq8064 and qca9980. So yes, different than r7800.

@quarky, yes this is a r7800 thread (and i appreciate everyone being patient with me when i post here) but other ipq806x users might not see this file at all depending on their wifi chipset. Airtime fairness (on openwrt) will likely still be present and working (EDIT see this), just some functionality is lost if the ath10k wifi driver does not set NL80211_EXT_FEATURE_AIRTIME_FAIRNESS. Interested ath10k users can

cat /sys/kernel/debug/ieee80211/phy0/ath10k/wmi_services | grep -E '(AIRTIME|PEER_STATS)'

if they don't see any "enabled" (one is enough) the file you mention will not show up and iw will not function wrt ATF (only). I'll submit a bug report to openwrt for this (EDIT I'm holding off on this for now) - I did submit one to candela, but i think that is the wrong place to fix it.

There is something to that - I haven't written much about my wifi observations wrt rates but they definitely matter - this is not the right thread for it tho.

1 Like

From what I understand from the mac80211 source code, this is Wi-Fi chipset independent. It is created by mac80211. All drivers using mac80211 will have this debugfs file implemented. Whether the Wi-Fi driver set the ATF flag is another matter.

Trust me :). Or try un-setting NL80211_EXT_FEATURE_AIRTIME_FAIRNESS and you will see iw cant set ATF weights and some ATF debug files disappear.

But you are correct ATF is still present and (i think) working with or without the that variable being set see this.

1 Like

it seems i'm ok..

Statistiche Ping per 192.168.1.1:
    Pacchetti: Trasmessi = 1000, Ricevuti = 1000,
    Persi = 0 (0% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
    Minimo = 1ms, Massimo =  58ms, Medio =  8ms
1 Like

A couple of notes on how ATF (is supposed to) work:

  • It will account airtime from transmissions in both directions to each station, but it can only throttle traffic in the AP->client direction.
  • It is only active for drivers that opt-in to it by setting that EXT_FEATURE flag you mentioned above; just running iw phy should tell you if it's enabled or not (look for AIRTIME_FAIRNESS in the "supported extended features" list at the end of the output for each phy)
  • For ath10k in particular there's an odd interaction with the firmware scheduler in some cases: newer ath10k chipsets will switch to 'pull/push mode' in some cases where the firmware has its own notion of which stations to schedule when. Unfortunately, this being in firmware, I don't have a lot of insight into how this actually work, but it may be what's causing the issues with the virt time-based scheduler. I had an email exchange with someone who noticed this recently, will quote my reply to them below.

(Note that the below only applies to the pull/push mode of ath10k and the context was slightly different; but may be relevant here anyway):

So the main change with the virtual-time scheduler that's relevant here (I think) is that ieee80211_txq_may_transmit() is better at saying no now. The assumption for this mode of operation (which is only used by ath10k, BTW), is that the firmware will cycle through all the scheduled stations, ask the system if each of them is allowed to send (through ieee80211_txq_may_transmit()), and if the answer is no, move on to the next one.

Now, the "move on to the next one" bit is central here; if the firmware just keep asking about the same station (or a subset of all scheduled stations), it will keep getting a "no" if there's another station that is "behind" (in terms of fairness) until that other station catches up. Whereas with the old code, the round-robin scheduler could artificially advance the other stations, allowing them to transmit (but effectively disabling the fairness).

The latency spikes you describe sound like they could be caused by this throttling mechanism. Now, that throttling is obviously not supposed to happen; we're only supposed to enforce fairness between active stations, so if a station does not have any data outstanding (at the mac80211 level), it should be removed from the schedule, and the other stations should be allowed to continue.

So the question is what it is that goes wrong here; my guess is that it is one of the following:

  • The firmware has its own notion of which subset of stations it wants to schedule, so it gets deadlocked with the fairness mechanism as explained above (I don't have a lot of insight into how the push/pull mechanism of ath10k is actually supposed to work).

  • There's a bug somewhere so stations with no outstanding packets don't get de-scheduled properly (and so are still part of the rotation blocking progress).

4 Likes

8320 packets transmitted, 8318 packets received, 0.0% packet loss
round-trip min/avg/max/std-dev = 1.323/2.937/36.368/1.142 ms

Perhaps. Note, others have found latency spikes that are client/os dependent (here for one example that i believe predates the patch you are investigating).

The ieee80211_txq_may_transmit() API is the one that I've modified for my R7800 firmware and am currently testing it now. And yes, only ath10k is calling this API from the source code scan I did to the source tree. I added debug printout codes for that APIs and found that the difference between the old (based on the round-robin algo) and new API is that when the txqi->schedule_order structure is NULL, the old API returns true to the caller (i.e. ath10k) while the new API returns false, effectively throttling the transmission. And this condition happens a lot.

Besides this change, I also noticed that the behaviour of the ieee80211_return_txq() also changed. The old API tries to reschedule the returned txq, while the new API unschedule it. I have to admit that I do not understand why this change is done, so I'm testing the code behaviour in piecemeal. This API is called by ath10k and the mt76 drivers.

Ah, this is interesting! Nice find :slight_smile:

So the reason it returns false is that this shouldn't happen, in theory. Which, if it does, maybe that's the culprit; it might be because maybe the firmware is asking about TXQs that only have packets outstanding in the firmware but not in the host queues? Could you add a check and see if txq_has_queue(txq) is really false when this condition happens, and also theck the AQL backlog (air_info->aql_tx_pending) for that TXQ (when schedule_order is NULL)?

This should be benign and is mostly a difference in data structures: the RR schedule removes a queue entirely from the rotation when handing it off to the driver (with next_txq), whereas the virt-time scheduler keeps the queue on the rbtree while it's being handled by the driver...

1 Like

So I was just about to submit a bug report about this. I'll hold off.

For ipq8064 and qca9980, this flag is not set by either ct or non-ct drivers. I can force it and ATF does seem to work, at least on the AP-> client direction; however I don't know if this should be set. Note that AQL stil shows up. I assume its ok to have AQL enabled but not ATF.

BTW If there is basic testing that can be done on my device I'm willing to do so.

Thank you for the response.

Will add in this check and report back.

There is no such file. But there is another file:

root@Xiaomi-R3D:~# cat /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/ec:41:18:f5:25:82/airtime
RX: 0 us
TX: 30669 us
Weight: 256
Deficit: VO: -7 us VI: 256 us BE: 57 us BK: 1 us
root@Xiaomi-R3D:~# iw phy
....
Wiphy phy0
...
        max # scan plans: 1
        max scan plan interval: -1
        max scan plan iterations: 0
        Maximum associated stations in AP mode: 32
        Supported extended features:
                * [ VHT_IBSS ]: VHT-IBSS
                * [ RRM ]: RRM
                * [ SET_SCAN_DWELL ]: scan dwell setting
                * [ CQM_RSSI_LIST ]: multiple CQM_RSSI_THOLD records
                * [ CONTROL_PORT_OVER_NL80211 ]: control port over nl80211
                * [ TXQS ]: FQ-CoDel-enabled intermediate TXQs
                * [ AIRTIME_FAIRNESS ]: airtime fairness scheduling
                * [ AQL ]: Airtime Queue Limits (AQL)
                * [ CONTROL_PORT_NO_PREAUTH ]: disable pre-auth over nl80211 control port support
                * [ DEL_IBSS_STA ]: deletion of IBSS station support
                * [ SCAN_FREQ_KHZ ]: scan on kHz frequency support
                * [ CONTROL_PORT_OVER_NL80211_TX_STATUS ]: tx status for nl80211 control port support

Your router is running the old round-robin scheduler, not the new virtual time-based scheduler.

couldn't check it today, but it seems i have the new scheduler, right?

root@RUTTO:~# cat /sys/kernel/debug/ieee80211/phy0/airtime
        VO         VI         BE         BK
Virt-t  63891101   34643286   1920972764 5013180
Weight  512        256        256        256

let me know if i check something since strangely my wifi seems to work :slight_smile: