Netgear R7800 exploration (IPQ8065, QCA9984)

Ansuel · March 16, 2022, 2:30am

mhhh then why we have wfi disabled in dts? spc is disabled for r7800 btw

this is the current state... it didn't remember if r7800 didn't work with cpu_spc set to okay
(i fixed all the spm status tho)

anon98444528 · March 16, 2022, 10:46am

I can't answer that. Since wfi seems to work for all ipq806x but spc only works for certain devices, my suggested method to handle that was this. IIRC you choose a different route to accomplish the same thing in the dts - both dts methods work.

Ansuel · March 16, 2022, 3:41pm

Ok something is not right here... the arm wfi is init and set anyway even if it's not declared in dts (set to status = disabled)
And on top of that the system does actually power collapse instead of simple wfi (aka clock gating)... wth????
wht it looks like the spm is all wrong here... or i'm missing something (again this is very complex even more complex than mux and clocks o.O)

anon98444528 · March 16, 2022, 3:59pm

My recollection is that wfi is "enabled" by adding qcom,apq8064-saw2-v1.1-cpu to the saw node compatible entry (i.e. no dts entry is needed to enable or disable). status = disabled was only for spc and ipq8065 so that the r7800 would not crash on boot (one can disable spc at run time otherwise).

That's interesting. I don't think anyone noticed that before.

As to spm being all wrong, I guess that wouldn't surprise me given what you found from your efforts with the cpu freq and nss. It is complex - the hard coded (and not well documented) values for ipq8064 in the spm.c are still a mystery to me.

I suspect those that did the original coding were faced with time constraints (as well as lack of proper documentation) and just did the best they could with what they had. It would be nice if more documentation was released so that those that have the time and interest can improve it.

Ansuel · March 16, 2022, 4:11pm

the thing is that wfi is arm standard and still should be enabled only with the common arm-idle-state compatible. (still somehow with no idle-state node defined, we have wfi set on the system)

The bad thing is that the logic is
arm,idle-state declare generic wfi (clock gating) (in the generic code there is a comment that say that if a driver require to do special thing for wfi, the wfi generic idle state can be overwrite)

in theory this is what should happen but instead we overwrite wfi but set the enter ops to the spc function... so in reality it looks like we enter spc instead of wfi.

A big test would be to understand how to disable idle state and check if this cause any difference to the stability of the system.

Ansuel · March 16, 2022, 4:15pm

Well lets try with

echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state0/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state0/disable

If only i didn't run the router overclocked....

root@Ansuel-Router:~# cat /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table

   From  :    To
         :    384000    600000    800000   1000000   1400000   1725000   1900000
   384000:         0         7         4         5         6         1         7
   600000:        10         0    555351     56829         2         2     28099
   800000:         8    559528         0     45665         1         6      7418
  1000000:         5     50766     50630         0        17         5      5804
  1400000:         3         0         4        16         0        17         8
  1725000:         1         2         2         9        10         0        32
  1900000:         3     29991      6634      4703        12        25         0

anon98444528 · March 16, 2022, 4:16pm

From our 4.19 experience, there is still a fall back cpu idle so I think you can safely test. I would not try to bypass the fall back cpu idle tho if that's your intention. I think something needs to be there to keep the cpu "happy" when it has nothing to do.

anon98444528 · March 16, 2022, 5:34pm

Your understanding about how the cpu's for ipq806x should work has increased considerably over the past 2.5 years.

If you have not already done so, try looking back at what we did trying to understand what changed wrt cpuidle between 4.14 and 4.19. In particular, this post and a few below that.

Given your observations above, it sounds like the "past fix" might only be appropriate for 8064 cpu's and not 8065. It seems like something more sophisticated is needed to properly enable cpu idle for both 8064 and 8065 cpu's.

Ansuel · March 16, 2022, 6:04pm

the old days of blogic working on qca8k wonderful days.... why i was late to the part god damn...

Anyway to me it looks like all wrong implementation

boot cpu (cpu0) have a different sequence to enter power collapse
the other cpu have the sequence of the apq8064
l2 sequence is missing and i think is required as it need to be set on power retain.
wfi set to enter power collapse when enter idle (i assume wrong function?)
missing wfi assembly instruction have to check if they are present in the kernel
missing spc assembly instruction

I don't think there is a difference between ipq8064 and ipq8065 but just how the bootloader init stuff. Could be that ipq8064 init stuff correctly preventing a crash and ipq8065 doesn't. But in any case in theory spc should never be triggered as in the original code wasn't enabled. Could be that it's just not supported. If that's the case could really be that the crash are caused by an incorrect configuration of the idle states... to be more precise l2 going sleep instead of in power retention and this cause freze / kernel panics

anon98444528 · March 16, 2022, 6:25pm

@nmrh
There is a (undocumented?) different cpu "bring up" mechanism specified in the qcom-ipq8064.dtsi

Not sure if the other one would help. In the absence of documentation, experimentation is likely your only option.

Don't be too quick to discard spc. It might reduce the heat load on the cpu's making it more likely that users can overclock. If you like conspiracy theories, I remember a case of a computer manufacturer disabling the math co-processor on their cpu's (i.e. all cpu's were the same, they had to spend additional money in manufacturing to disable the co-processor on a fraction of the total produced) just so they could sell the versions with it enabled for a higher margin.

How many users would buy a r7800 if they could safely overclock a r7500v2 to 1.7GHz? How many users will buy the next model if they can safely overclock a r7800 to 1.9GHz? Anyway glad to see you looking at this again and keep an eye on your temperatures when you test with disabled wfi/spc cpuidle.

slh · March 16, 2022, 11:52pm

Much of the clock rate bump between ipq8064 and ipq8065 seems to be down to cooling to begin with, we can see several ipq8064 devices (e.g. ASRock g10) without any heat sinks at all, while ipq8065 needs them (and still runs hot).

sppmaster · March 17, 2022, 3:23pm

Can I add to this thread a bug for R7800.
I've posted my final findings here.

The issue was reported on Github

github.com/openwrt/openwrt

FS#4221 - R7800 poor download rate when client connects with 100FD

opened 10:13AM - 09 Jan 22 UTC

openwrt-bot

target/ipq806x flyspray bug

*Ernie63:* Netgear Nighthawk X4S R7800 with OpenWrt 21.02.1 freshly installed … ISP line is 250/25. Wan port is 1G (auto). Steps to reproduce: Laptop connected on LAN-side with 1G (auto): Ookla speedtest is 250/25. Laptop connected on LAN-side with 100FD: Ookla speedtest is only 30/25. This performance degration only occurs, when the R7800 is routing and natting the traffic. "Switched only" traffic inside the LAN is not affected. With stock ROM the download rate is as expected: Laptop connected on LAN-side with 100FD: Ookla speedtest is 80/25.

I hope the developers will be able to fix this bug soon.

Ansuel · March 17, 2022, 5:26pm

if you can build your own image... did you test this with the dsa driver?

sppmaster · March 17, 2022, 5:37pm

Yesterday I've tried @hnyman test-DSA build but I had no Laptop with cable connection to fully test using iperf3. I've tested with Android TV box that has only 100Mbps NIC and I think the results were bad. I can build my own image but I may need some help regarding the DSA part. I don't know if the reason for the low WAN speed is swconfig or DSA driver. Maybe more users can test and confirm. It's really simple and would take a few minutes.

quarky · March 18, 2022, 12:17am

Incidentally, there’s also a similar issue with mt7622 DSA drivers recently that’s causing link speed to throttle down to the lowest client link speed. For my E8450, upload speed will be limited to 10mbps when the TV connected to it is in standby mode where it’s LAN port will switch to 10mbps link speed. Disconnecting the TV gives me back full gigabit speed.

I sort of traced it down to the code changes where the switch port auto learning feature was disabled.

Wonder if both issues are similar.

sppmaster · March 18, 2022, 8:49am

@quarky @Ansuel
Quarky, is this the same issue you talk about?

github.com/openwrt/openwrt

MT7622: Fix FDB learning bugs when VLAN filtering is enabled causes performance loss

opened 09:22PM - 07 Mar 22 UTC

dietcoke73

kernel target/mediatek bug

Fix FDB learning bugs when VLAN filtering is enabled was merged in ee6ba216d8ba1…b02154c287e64d709a8bc7b0054 but causes a performance drop on LAN ports. Multiple users discussing this issue in Forum topic https://forum.openwrt.org/t/belkin-rt3200-linksys-e8450-wifi-ax-discussion/94302/1785 TL;DR summary - when mulitple devices are connected to the switch, the performance of the switch drops to the lowest speed of the connected devices. So for example a 1000mbps capable device receives data only at the 10mbps data rate. Reverting this commit ee6ba216d8ba1b02154c287e64d709a8bc7b0054 tested by multiple users corrects the problem as the cost of reintroducing the original issue.

quarky · March 18, 2022, 9:16am

Yup, this is the issue I was referring to.

quarky · March 18, 2022, 9:25am

Did you happen to try 21.02 builds without NSS acceleration? I remember you encounter performance issues with the 21.02 NSS builds?

I've now replaced by E8450 (planning to 'fix' the mt7530.c driver) with my R7800 running my custom 21.02 builds with NSS acceleration enabled. I do not see any performance hit even with a 10mbps link connected to the R7800.

sppmaster · March 18, 2022, 9:59am

I only have WAN performance degradation when device/s is/are connected at 100Mbps and there is LAN traffic between LAN connected devices at the same time (probably because there are two tricky conditions that need to be present at the same time many users do not notice this). But at least one of the devices involved in the LAN traffic should be connected at 100Mbps. I simulated this with a 1Gbps laptop connected with 100Mbps cable and with a device that has 100Mbps NIC.
If the LAN traffic is between two devices and both are at 1Gbps the WAN performance degradation doesn't occur. I see full WAN performance in this case. My link is 1Gbps/700Mbps.
To reproduce this I run iperf3 with a predefined throughput of only 60Mbps between a Laptop connected at 100Mbps! (this is the culprit) and a PC connected at 1Gbps. The result - the PC can only download/upload from/to WAN at really low speeds. Just 15-20 to rarely 100-200 Mbps with ping above 100ms. This depends on the current LAN traffic. Larger LAN transfers (as several 4K LAN streams that are total 100-200Mbps) cause almost compete inability for the devices to receive/send to WAN.
But if several devices are connected to an additional switch (gigabit, as in my case) and there is a 100Mbps device (even one among them) involved in even moderate LAN traffic at the same time, then all devices see huge WAN performance drop to even 5-6Mbps.
If there is third device (probably fourth one too) connected at 1Gbps on another R7800 port it can download/upload from/to WAN at full speed as long as it currently doesn't take part in LAN transfers with other 100Mbps only. And this is no matter if another pair of devices 1G/100Mbps are affected at the same time by the bug.

I have to stress that two compulsory conditions should be met at the same time in order to reproduce the WAN performance drop. A client connected at 100Mbps (let's call it "Problem Client") and a LAN traffic between the "Problem Client" and any other device connected to LAN by cable.

shelterx · March 21, 2022, 7:57am

Weird bugs that surfaced recently...
I'm certain now that the CPU frequency scaling bug can be worked around by setting a higher min clock, I have 23 days uptime with a min clock of 800Mhz.

EDIT: Now uptime is 34d 16h 37m 0s