Netgear R7800 exploration (IPQ8065, QCA9984)

I'm trying understand the results i get.

I'm saying that simultaneous netperf from clients to the netperf server through that AP suck (i.e. one client netperf can completely stop). No amount of adjusting ATF weights in this case seems to make a change.

If I simultaenous netperf clients from the netperf server to the clients, I don't see one of the clients come to a grinding halt and I can adjust the ATF weights (but only using a custom mod to the ath10k-ct driver for my device due to a bug) and see that netperf rates go down (or up) as I'd expect.

I may be wrong here, but I believe ATF can only control what's happening when the clients are downloading.

That thought had occurred to me, but I'm a non-expert and documentation is sparse. Still I suspect this behavior/observations may correlate with ATF being implemented.

I.e. netperf from clients to the netperf server through an AP did not use to suck (at least this bad) before ATF.

Maybe @tohojo can provide some input on the inner workings of ATF and also take a look at your test results to see if there's something wrong.

1 Like

This thread (i've been trying to figure this one out for some time). Skip to the latest tests starting here.

Sorry flint is too much trouble atm, but if this turns out to be something worth digging into, I'll go further.

Crashed again this morning. Same stack trace in the ramoops, but this time the error was to do with kernel paging. So again "random errors" potentially caused by the issues with frequency jumping/scaling.

I've just set the governors to performance and will try that out. Unfortunately it is a wait and see type error so i guess if i don't have a crash in 1-2 weeks that's a pretty good sign as a workaround.

Are you testing the master branch or the 21.02 branch?

I have reverted the ATF alfo for both. It’s e asier for 21.02, as it’s just removing patch files. For master, I have to create another patch file to patch back the old codes.

master - i have some ability to work out merge conflicts and minor run time errors provided I don't have to go to deep into it. Anything you can provide would be nice.

For the record, i'm not interested in (permanently) going back from ATF/AQL. I do want to eliminate them as a possible contributing factors or if they involved perhaps that might help to make them better.

At the least I'd like to understand why my wifi behaves as it does - e.g. I no longer test buffer bloat via wifi - it's almost always "bad" now if multiple clients are on the network. The same test from wired clients and it looks fine if not great. Buffer bloat tested from wifi use to be great for me - a long time ago now.

Ah ... IC.

I did not study how I can disable ATF. From what I can understand from the mac80211 sources, it looks like ATF is embedded into it, and will be difficult to yank it out.

So far tho, it looks like for the master branch, since Jun 2021, the mac80211 with ath10k's ATF behaviour has changed. This change for the 21.02 branch happened in Nov 2021. This change made my R7800's Wi-Fi unusable after a few days of up-time.

Anyway, from my understanding, it will be impossible to make Wi-Fi perform with low buffer bloat, as it largely depends how 'noisy' it is at your location. For me, at any one time, I can see more than 10 APs broadcasting in the same channel as my AP, at least for the non-DFS 80MHz channels (36, 149). Also, ATF only really affect the AP -> Client direction. Cannot really control Client -> AP, other than signalling to the clients that they can send to AP.

I think that the change I suspected to be causing issue for me may be causing issue for you as well. From what I understood of that change, there's an mac80211 API that the ath10k driver is using that dictates if ath10k may transmits to clients, and that change somehow is not sending traffic if it comes from the AP. I suspect this is causing issue with the mac80211 MAC signalling. Why it causes issues after a few days for me, I have no idea.

Alright, I'll keep following what you do.

In the mean time, i'm going to try some old builds that might predate the first time I started to notice this (Aug. 2020). My hope is that i can show i just didn't test enough with earlier builds with multiple clients (i.e. the "problem" has always been there, i just noticed it when more clients were on the AP).

Obviously I don't care so much about a buffer bloat test - I only tried testing due to family members complaining about video conferencing which didn't use to be a problem.

Some counter that increase for every bad report till it does reach an unreal value that compromise the entire network?

I would suggest to try builds from as early as June 2021, as this commit in master is definitely causing problem for ath10k.

1 Like

Could be. It's just that I'm not familiar with mac80211 and ath10k enough to conclude what is the root cause.

@quarky
As for the r7800 shouldn't more people have this problem with wifi slowing down?
I can't confirm it myself right now as I'm struggling with reboots.

@hnyman Any input here? Is wifi slowing down over time anything you noticed "recently"?

@anon98444528
The d7800 is obviously different than the r7800 so your test case scenario is a bit harder....

I'm wondering this myself. But I guess most folks are not running the latest builds? From what I can tell, 21.02 should start behaving weird since Nov 2021, while master should start behaving weird since July 2021. My R7800 starts having reboot fits as well, since Nov 2021 if memory serves, that when I start flashing my R7800 with newer 21.02 builds.

Not much input from me. I use wifi for my mobile devices, but the high-speed volumes are from my PC with wired connection.

1 Like

is this limited to CT firmware/drivers or not?
I'm on OpenWrt 21.02-SNAPSHOT r16416-ecab623a38 with "old" ath10k firmware and drivers, uptime 3 days (i lost power in my home) and wifi performance seems not to have any issue

I'm not using the ath10k-ct drivers as I wanted the encap offload capabilities that ath10k supports.

I'm not sure if OpenWrt 21.02-SNAPSHOT r16416-ecab623a38 is build with or without the patch that I suspect is causing the issue. If the snapshot build is done after this commit, then it should affect Wi-Fi stability after some time.

When the router reboots, or if the Wi-Fi interfaces are restarted, throughput would be good. After prolonged use, latency starts to climb until it becomes unusable. At that point I either have to reboot the router, or somehow log in to the router and trigger a Wi-Fi restart.

On this, sadly i can't help you, but we can ask @ACwifidude (i'm using his repo)
Or can i check this from a live system?

Edit to add: i can see those patche files in acwifidude's repo, so it seems i have them:

Can you guess after how many days this should happen? i'm not planning any reboot this week, and i'm using my wifi daily, so if it's a matter of one week i can see if things get bad..

If you see this file, then your router should be using the virtual time-based airtime scheduler, instead of the older round-robin scheduler:

/sys/kernel/debug/ieee80211/phy0/airtime

And the output should give you someting like this:

         VO         VI         BE         BK
 Virt-t  19723      1262118    179701010  1660
 Weight  256        256        256        256

From memory, I start to see issues as soon as 3 days into use. When I play online games with my iPad, I frequently see Wi-Fi issues as well, and that would be soon after restart. The online games does not need high bandwidth, but high latencies will kill game play. Also from memory, I start seeing this since Nov. 2021.

I'm not sure if that patch is causing the issue tho. Reverting my latest build to the old round-robin solves my online game issue, and it's stable for a week.

Now I'm testing the new time-based scheduler with behaviour that is consistent with the old rr also. I'll report back in about a week and see how it goes.

One test you can try is to continuously ping the router with a wireless client that is connected to it. If you see occasional ping spike into the 100s to 1000s ms, then you have the issue as well.

2 Likes