Netgear R7800 exploration (IPQ8065, QCA9984)

anon98444528 · March 16, 2022, 4:16pm

From our 4.19 experience, there is still a fall back cpu idle so I think you can safely test. I would not try to bypass the fall back cpu idle tho if that's your intention. I think something needs to be there to keep the cpu "happy" when it has nothing to do.

anon98444528 · March 16, 2022, 5:34pm

Your understanding about how the cpu's for ipq806x should work has increased considerably over the past 2.5 years.

If you have not already done so, try looking back at what we did trying to understand what changed wrt cpuidle between 4.14 and 4.19. In particular, this post and a few below that.

Given your observations above, it sounds like the "past fix" might only be appropriate for 8064 cpu's and not 8065. It seems like something more sophisticated is needed to properly enable cpu idle for both 8064 and 8065 cpu's.

Ansuel · March 16, 2022, 6:04pm

the old days of blogic working on qca8k wonderful days.... why i was late to the part god damn...

Anyway to me it looks like all wrong implementation

boot cpu (cpu0) have a different sequence to enter power collapse
the other cpu have the sequence of the apq8064
l2 sequence is missing and i think is required as it need to be set on power retain.
wfi set to enter power collapse when enter idle (i assume wrong function?)
missing wfi assembly instruction have to check if they are present in the kernel
missing spc assembly instruction

I don't think there is a difference between ipq8064 and ipq8065 but just how the bootloader init stuff. Could be that ipq8064 init stuff correctly preventing a crash and ipq8065 doesn't. But in any case in theory spc should never be triggered as in the original code wasn't enabled. Could be that it's just not supported. If that's the case could really be that the crash are caused by an incorrect configuration of the idle states... to be more precise l2 going sleep instead of in power retention and this cause freze / kernel panics

anon98444528 · March 16, 2022, 6:25pm

@nmrh
There is a (undocumented?) different cpu "bring up" mechanism specified in the qcom-ipq8064.dtsi

Not sure if the other one would help. In the absence of documentation, experimentation is likely your only option.

Don't be too quick to discard spc. It might reduce the heat load on the cpu's making it more likely that users can overclock. If you like conspiracy theories, I remember a case of a computer manufacturer disabling the math co-processor on their cpu's (i.e. all cpu's were the same, they had to spend additional money in manufacturing to disable the co-processor on a fraction of the total produced) just so they could sell the versions with it enabled for a higher margin.

How many users would buy a r7800 if they could safely overclock a r7500v2 to 1.7GHz? How many users will buy the next model if they can safely overclock a r7800 to 1.9GHz? Anyway glad to see you looking at this again and keep an eye on your temperatures when you test with disabled wfi/spc cpuidle.

slh · March 16, 2022, 11:52pm

Much of the clock rate bump between ipq8064 and ipq8065 seems to be down to cooling to begin with, we can see several ipq8064 devices (e.g. ASRock g10) without any heat sinks at all, while ipq8065 needs them (and still runs hot).

sppmaster · March 17, 2022, 3:23pm

Can I add to this thread a bug for R7800.
I've posted my final findings here.

The issue was reported on Github

github.com/openwrt/openwrt

FS#4221 - R7800 poor download rate when client connects with 100FD

opened 10:13AM - 09 Jan 22 UTC

openwrt-bot

target/ipq806x flyspray bug

*Ernie63:* Netgear Nighthawk X4S R7800 with OpenWrt 21.02.1 freshly installed … ISP line is 250/25. Wan port is 1G (auto). Steps to reproduce: Laptop connected on LAN-side with 1G (auto): Ookla speedtest is 250/25. Laptop connected on LAN-side with 100FD: Ookla speedtest is only 30/25. This performance degration only occurs, when the R7800 is routing and natting the traffic. "Switched only" traffic inside the LAN is not affected. With stock ROM the download rate is as expected: Laptop connected on LAN-side with 100FD: Ookla speedtest is 80/25.

I hope the developers will be able to fix this bug soon.

Ansuel · March 17, 2022, 5:26pm

if you can build your own image... did you test this with the dsa driver?

sppmaster · March 17, 2022, 5:37pm

Yesterday I've tried @hnyman test-DSA build but I had no Laptop with cable connection to fully test using iperf3. I've tested with Android TV box that has only 100Mbps NIC and I think the results were bad. I can build my own image but I may need some help regarding the DSA part. I don't know if the reason for the low WAN speed is swconfig or DSA driver. Maybe more users can test and confirm. It's really simple and would take a few minutes.

quarky · March 18, 2022, 12:17am

Incidentally, there’s also a similar issue with mt7622 DSA drivers recently that’s causing link speed to throttle down to the lowest client link speed. For my E8450, upload speed will be limited to 10mbps when the TV connected to it is in standby mode where it’s LAN port will switch to 10mbps link speed. Disconnecting the TV gives me back full gigabit speed.

I sort of traced it down to the code changes where the switch port auto learning feature was disabled.

Wonder if both issues are similar.

sppmaster · March 18, 2022, 8:49am

@quarky @Ansuel
Quarky, is this the same issue you talk about?

github.com/openwrt/openwrt

MT7622: Fix FDB learning bugs when VLAN filtering is enabled causes performance loss

opened 09:22PM - 07 Mar 22 UTC

dietcoke73

kernel target/mediatek bug

Fix FDB learning bugs when VLAN filtering is enabled was merged in ee6ba216d8ba1…b02154c287e64d709a8bc7b0054 but causes a performance drop on LAN ports. Multiple users discussing this issue in Forum topic https://forum.openwrt.org/t/belkin-rt3200-linksys-e8450-wifi-ax-discussion/94302/1785 TL;DR summary - when mulitple devices are connected to the switch, the performance of the switch drops to the lowest speed of the connected devices. So for example a 1000mbps capable device receives data only at the 10mbps data rate. Reverting this commit ee6ba216d8ba1b02154c287e64d709a8bc7b0054 tested by multiple users corrects the problem as the cost of reintroducing the original issue.

quarky · March 18, 2022, 9:16am

Yup, this is the issue I was referring to.

quarky · March 18, 2022, 9:25am

Did you happen to try 21.02 builds without NSS acceleration? I remember you encounter performance issues with the 21.02 NSS builds?

I've now replaced by E8450 (planning to 'fix' the mt7530.c driver) with my R7800 running my custom 21.02 builds with NSS acceleration enabled. I do not see any performance hit even with a 10mbps link connected to the R7800.

sppmaster · March 18, 2022, 9:59am

I only have WAN performance degradation when device/s is/are connected at 100Mbps and there is LAN traffic between LAN connected devices at the same time (probably because there are two tricky conditions that need to be present at the same time many users do not notice this). But at least one of the devices involved in the LAN traffic should be connected at 100Mbps. I simulated this with a 1Gbps laptop connected with 100Mbps cable and with a device that has 100Mbps NIC.
If the LAN traffic is between two devices and both are at 1Gbps the WAN performance degradation doesn't occur. I see full WAN performance in this case. My link is 1Gbps/700Mbps.
To reproduce this I run iperf3 with a predefined throughput of only 60Mbps between a Laptop connected at 100Mbps! (this is the culprit) and a PC connected at 1Gbps. The result - the PC can only download/upload from/to WAN at really low speeds. Just 15-20 to rarely 100-200 Mbps with ping above 100ms. This depends on the current LAN traffic. Larger LAN transfers (as several 4K LAN streams that are total 100-200Mbps) cause almost compete inability for the devices to receive/send to WAN.
But if several devices are connected to an additional switch (gigabit, as in my case) and there is a 100Mbps device (even one among them) involved in even moderate LAN traffic at the same time, then all devices see huge WAN performance drop to even 5-6Mbps.
If there is third device (probably fourth one too) connected at 1Gbps on another R7800 port it can download/upload from/to WAN at full speed as long as it currently doesn't take part in LAN transfers with other 100Mbps only. And this is no matter if another pair of devices 1G/100Mbps are affected at the same time by the bug.

I have to stress that two compulsory conditions should be met at the same time in order to reproduce the WAN performance drop. A client connected at 100Mbps (let's call it "Problem Client") and a LAN traffic between the "Problem Client" and any other device connected to LAN by cable.

shelterx · March 21, 2022, 7:57am

Weird bugs that surfaced recently...
I'm certain now that the CPU frequency scaling bug can be worked around by setting a higher min clock, I have 23 days uptime with a min clock of 800Mhz.

EDIT: Now uptime is 34d 16h 37m 0s

shelterx · March 21, 2022, 5:58pm

Well, WiFi interfaces just went into disabled state now, I had to restart them, the router didn't reboot tho'.

[2069691.801451] device wlan0 left promiscuous mode
[2069691.801588] br-lan: port 3(wlan0) entered disabled state
[2069691.890988] ath10k_pci 0000:01:00.0: could not get mac80211 beacon
[2069692.095787] ath10k_pci 0000:01:00.0: could not get mac80211 beacon
[2069692.300577] ath10k_pci 0000:01:00.0: could not get mac80211 beacon
[2069692.505379] ath10k_pci 0000:01:00.0: could not get mac80211 beacon
[2069692.519704] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
[2069692.519778] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
[2069692.525600] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
[2069692.616753] device wlan1 left promiscuous mode
[2069692.616916] br-lan: port 2(wlan1) entered disabled state
[2069692.770377] ath10k_pci 0001:01:00.0: could not get mac80211 beacon
[2069692.808612] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 1
[2069692.808724] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 1

EDIT:
DFS shouldn't cause wifi to be disabled right? It's weird that both interfaces went down.

quarky · March 22, 2022, 1:21pm

It will, if I'm not wrong, if you set the interface to a DFS channel and radar was detected for that channel. Was the channel set to a fixed channel or was it auto?

shelterx · March 22, 2022, 2:16pm

Fixed... But still weird that 2.4Ghz went down.
I'll change 5Ghz back to non-DFS.

dpvb · March 26, 2022, 2:59am

How do I get 160 mhz working for my 5.0 ghz wifi? I'm running latest stable hnyman's build

hnyman · March 26, 2022, 6:30am

For me VHT160 works quite normally, with FI country code, both channel 36 and 100 work for me. DFS detection takes a minute at startup, but completes nicely.

hawkeye217 · March 27, 2022, 3:12am

New openwrt user here (but experienced software dev) on an XR500 (same hardware as the R7800) and I had this same __krait_mux_set_sel crash twice in the last 24 hours. I'm running the latest snapshot from 2022-03-24.

When it crashed last night, it happened shortly after a decent amount of load followed by a very low amount of load. So if the CPU core frequency ramped downward, perhaps that's what triggered the crash.

I had not set up any min frequency parameters in the local startup script, but it seems like that might be the best workaround at this point from what others have said?

Let me know if there's any way I can help track down the issue.