Netgear R7800 exploration (IPQ8065, QCA9984)

Hmm, I thought that went "if it's not broken, break it" Perhaps that's why I have so much trouble.

You know, it you switch back to ct you can turn on/off the aql, switch between virtual atf, round robin, or even try the ATF algo proposed by @castiel652 all at the flick of a fwcfg variable and rmmod ath10k_pci && modprobe ath10k_pci.

Just saying if your going to complicate your life, do it right.

1 Like

Haha ... I'm more inclined for stability rather than cutting edge. That's why my R7800 is on 21.02.

1 Like

Something weird going on since I upgraded to the official 21.02.2 build: it appears there is at least one device (Pixel phone) that when it leaves the house causes the wifi network to have very high latency and packet loss for a few minutes, unusable by the rest of the clients. Then it recovers on its own or I can force it to if I just log in to the R7800 (via wired which still works fine) and just type wifi which restarts wifi.

The log looks like this (MAC redacted):

Wed Mar  9 09:45:50 2022 daemon.notice hostapd: wlan0: AP-STA-DISCONNECTED xx:xx:xx:xx:xx:xx
Wed Mar  9 09:45:50 2022 daemon.info hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due to inactivity
Wed Mar  9 09:45:51 2022 daemon.info hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.197590] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.197631] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.203782] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.211061] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 0
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.218685] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 7
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.225663] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 7
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.232951] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 7
Wed Mar  9 09:45:51 2022 kern.warn kernel: [150007.240228] ath10k_pci 0000:01:00.0: failed to lookup txq for peer_id 16 tid 7

Note that I always run mainline driver + firmware (not CT). This setup was solid on 21.02.1 and before, problems started with the upgrade to 21.02.2. I'll go back to 21.02.1 to make sure that version is still solid.

Does this ring any bells? Potential hostapd issues, which gets patched with backports?

WAG: android devices like ipv6. Is software offloading activated? If it is, you missed the fine print.

There is no IPv6 on my LAN and software offloading is not enabled, the R7800 just runs in AP mode.

1 Like

This solution may help. No harm trying it out IMHO.

1 Like

Sorry to interrupt the majestic workflow of this topic. Really, I'm a noob and hope not to disturb too much.
I'd like to know only if a stable build with activated NSS is present somewhere, and if you suggest it to an average user like me (just simple opkg update, install and setup).
The only NSS ones I stumbled on are masters and snapshots, and I'm really not quite familiar with owrt to make myself confortable.
Thank you so much for your immense knowledge and work.

Feel free to try it out. You can flash back and forth from stable.

1 Like

It's simply using mac80211 to calculate aritime.
Not an algo proposed by me.

1 Like

I suppose i could use "implementation" in place of "algo," but perhaps it's best just to reference your patch.

EDIT: castiel652's patch is only for ath10k-ct and they originally suggested it for the r7500v2 which apparently does not support virtual time based ATF without such a modification. I'm not sure if/how well it will run on other ipq806x systems like the r7800.

I'm currently running an adapted version of this patch on the r7500v2 which allows me to use the ath10k-ct fwcfg API to "turn on/off" this alternate airtime calculation and, I hope, "turns off" any other airtime calculation when using castiel652's patch.

Should I work on nss for 5.15 or continue my war of pushing patches upstream?

  • nss for 5.15
  • send patch upstream

0 voters

The current state of pushing the patch upstream is:

  • working on tcsr (low hope i will manage to find a way)
  • pushed gcc fixes for rpm (merged)
  • proposed dtsi changes
  • proposed spm patch
  • working on improving qca8k
  • cpufreq driver is a mistery... no idea if it will ever be merged
  • have to refresh all the krait scale driver
  • work on a correct nss scale driver

For nss

  • drop all the qsdk shittery and investigate how to enable nss offload directly in the gmac driver
  • make all the offload code work on 5.15 (for 5.10 there is already some code and minimum support but i'm full or exam to do and i hope @ACwifidude can work on adding support for 5.10 with openwrt 22.0 that will be based on 5.10)
3 Likes

Hard choice, if you figure out why some fw rules require br-lan to be in promisc mode and fix the bug that makes the router crash after a day or two because of this, I'd say go for NSS. But do whatever you want really.

I still have to investigate it but I don't know if it's a placebo, currently my router now lasts 7 days...
Also I'm not sure about something... (that i can't actually test)

In the original firmware they set some clk controlled by rpm to the normal value... this can be idle normal or turbo... I have no idea if the clk is set to turbo by default... Wonder if i should test also that... Aside from that the crash dump can be caused only by a hardware defect or some problem in the mux

5.10 with NSS is working well for most ipq806x devices. 22.xx should be easy to support since it is branching off here shortly.

If upstream is a pain I’d love to get 5.15 working, at least in a custom build sense. Especially if we eventually have access to newer qsdk versions eventually.

oh you actually use the qsdk11 firmware?

problem of upstreaming patches is that it's slow and sometimes they ask very strange change...

Looking at the .bin files I think I made an error and I’m still on 10.0. Need to change the naming convention back.

1 Like

:frowning: i advise to put some effort in switching to the """"leaked""""" qsdk11 at least we doesn't use a too ancient version

1 Like

@robimarko funny story for you

I'm doing some work with the clk drivers and i found something that is incredible...

image

In all this time we were lucky that the driver was shit and just ignored the provided clocks...
PXO_SRC is not defined in the gcc driver but only in the include...
I'm converting everything to a sane implementation and i just notice this... I got to enable earlycon to discover this... never expected a definition that wrong...

So in short

  • Documentation document something that nobody ever followed.
  • DTS contains phandle to something never defined in gcc
  • The driver ignore the clock and works as it does have hardcoded parent names

Now that i'm converting everything to parent_data api and dropping every hardcoded values all this shit comes up LOL...

Anyway fun story... I'm playing with the mux and it's so funny setting the core to source clk from the pxo clok (that is 25mhz). It's too fun how the system works that slow but still it does work. AHAHAH

2 Likes

Also real problem now is how to fix this mess... this bad value is pushed upstream and they already said that doing this kind of change is a NONO... considering fixing the driver cause the router panic and not boot at all... i'm stuck now...
testing if i can manage to add a pxo definition to the gcc driver but no idea if it will work...

That just screams QCA, it would be way too easy otherwise.
BTW, I still have not figured out how was the IPQ40xx SDCC clock ever supposed to work as one of its parents is XO which is a fixed reference clock that gets divided down for the 140 and 400kHZ and that really upsets the set_clk_rate() as it gets NULL back from the function which is supposed to find the topmost parent that needs changing and then it just throws and error and that's it.
And it all boils down to fixed clocks not having a determine rate or round OP, since they are obviously just fixed.

Is that PXO even needed?
Cause if not just get rid of it