Netgear R7800 exploration (IPQ8065, QCA9984)

From our 4.19 experience, there is still a fall back cpu idle so I think you can safely test. I would not try to bypass the fall back cpu idle tho if that's your intention. I think something needs to be there to keep the cpu "happy" when it has nothing to do.

Your understanding about how the cpu's for ipq806x should work has increased considerably over the past 2.5 years.

If you have not already done so, try looking back at what we did trying to understand what changed wrt cpuidle between 4.14 and 4.19. In particular, this post and a few below that.

Given your observations above, it sounds like the "past fix" might only be appropriate for 8064 cpu's and not 8065. It seems like something more sophisticated is needed to properly enable cpu idle for both 8064 and 8065 cpu's.

the old days of blogic working on qca8k wonderful days.... why i was late to the part god damn...

Anyway to me it looks like all wrong implementation

  • boot cpu (cpu0) have a different sequence to enter power collapse
  • the other cpu have the sequence of the apq8064
  • l2 sequence is missing and i think is required as it need to be set on power retain.
  • wfi set to enter power collapse when enter idle (i assume wrong function?)
  • missing wfi assembly instruction have to check if they are present in the kernel
  • missing spc assembly instruction

I don't think there is a difference between ipq8064 and ipq8065 but just how the bootloader init stuff. Could be that ipq8064 init stuff correctly preventing a crash and ipq8065 doesn't. But in any case in theory spc should never be triggered as in the original code wasn't enabled. Could be that it's just not supported. If that's the case could really be that the crash are caused by an incorrect configuration of the idle states... to be more precise l2 going sleep instead of in power retention and this cause freze / kernel panics

1 Like

@nmrh
There is a (undocumented?) different cpu "bring up" mechanism specified in the qcom-ipq8064.dtsi

Not sure if the other one would help. In the absence of documentation, experimentation is likely your only option.

Don't be too quick to discard spc. It might reduce the heat load on the cpu's making it more likely that users can overclock. If you like conspiracy theories, I remember a case of a computer manufacturer disabling the math co-processor on their cpu's (i.e. all cpu's were the same, they had to spend additional money in manufacturing to disable the co-processor on a fraction of the total produced) just so they could sell the versions with it enabled for a higher margin.

How many users would buy a r7800 if they could safely overclock a r7500v2 to 1.7GHz? How many users will buy the next model if they can safely overclock a r7800 to 1.9GHz? Anyway glad to see you looking at this again and keep an eye on your temperatures when you test with disabled wfi/spc cpuidle.

Much of the clock rate bump between ipq8064 and ipq8065 seems to be down to cooling to begin with, we can see several ipq8064 devices (e.g. ASRock g10) without any heat sinks at all, while ipq8065 needs them (and still runs hot).

Can I add to this thread a bug for R7800.
I've posted my final findings here.

The issue was reported on Github

I hope the developers will be able to fix this bug soon.

if you can build your own image... did you test this with the dsa driver?

Yesterday I've tried @hnyman test-DSA build but I had no Laptop with cable connection to fully test using iperf3. I've tested with Android TV box that has only 100Mbps NIC and I think the results were bad. I can build my own image but I may need some help regarding the DSA part. I don't know if the reason for the low WAN speed is swconfig or DSA driver. Maybe more users can test and confirm. It's really simple and would take a few minutes.

Incidentally, there’s also a similar issue with mt7622 DSA drivers recently that’s causing link speed to throttle down to the lowest client link speed. For my E8450, upload speed will be limited to 10mbps when the TV connected to it is in standby mode where it’s LAN port will switch to 10mbps link speed. Disconnecting the TV gives me back full gigabit speed.

I sort of traced it down to the code changes where the switch port auto learning feature was disabled.

Wonder if both issues are similar.

1 Like

@quarky @Ansuel
Quarky, is this the same issue you talk about?

Yup, this is the issue I was referring to.

1 Like

Did you happen to try 21.02 builds without NSS acceleration? I remember you encounter performance issues with the 21.02 NSS builds?

I've now replaced by E8450 (planning to 'fix' the mt7530.c driver) with my R7800 running my custom 21.02 builds with NSS acceleration enabled. I do not see any performance hit even with a 10mbps link connected to the R7800.

I only have WAN performance degradation when device/s is/are connected at 100Mbps and there is LAN traffic between LAN connected devices at the same time (probably because there are two tricky conditions that need to be present at the same time many users do not notice this). But at least one of the devices involved in the LAN traffic should be connected at 100Mbps. I simulated this with a 1Gbps laptop connected with 100Mbps cable and with a device that has 100Mbps NIC.
If the LAN traffic is between two devices and both are at 1Gbps the WAN performance degradation doesn't occur. I see full WAN performance in this case. My link is 1Gbps/700Mbps.
To reproduce this I run iperf3 with a predefined throughput of only 60Mbps between a Laptop connected at 100Mbps! (this is the culprit) and a PC connected at 1Gbps. The result - the PC can only download/upload from/to WAN at really low speeds. Just 15-20 to rarely 100-200 Mbps with ping above 100ms. This depends on the current LAN traffic. Larger LAN transfers (as several 4K LAN streams that are total 100-200Mbps) cause almost compete inability for the devices to receive/send to WAN.
But if several devices are connected to an additional switch (gigabit, as in my case) and there is a 100Mbps device (even one among them) involved in even moderate LAN traffic at the same time, then all devices see huge WAN performance drop to even 5-6Mbps.
If there is third device (probably fourth one too) connected at 1Gbps on another R7800 port it can download/upload from/to WAN at full speed as long as it currently doesn't take part in LAN transfers with other 100Mbps only. And this is no matter if another pair of devices 1G/100Mbps are affected at the same time by the bug.

I have to stress that two compulsory conditions should be met at the same time in order to reproduce the WAN performance drop. A client connected at 100Mbps (let's call it "Problem Client") and a LAN traffic between the "Problem Client" and any other device connected to LAN by cable.

Weird bugs that surfaced recently...
I'm certain now that the CPU frequency scaling bug can be worked around by setting a higher min clock, I have 23 days uptime with a min clock of 800Mhz.

EDIT: Now uptime is 34d 16h 37m 0s

2 Likes

Well, WiFi interfaces just went into disabled state now, I had to restart them, the router didn't reboot tho'.

[2069691.801451] device wlan0 left promiscuous mode
[2069691.801588] br-lan: port 3(wlan0) entered disabled state
[2069691.890988] ath10k_pci 0000:01:00.0: could not get mac80211 beacon
[2069692.095787] ath10k_pci 0000:01:00.0: could not get mac80211 beacon
[2069692.300577] ath10k_pci 0000:01:00.0: could not get mac80211 beacon
[2069692.505379] ath10k_pci 0000:01:00.0: could not get mac80211 beacon
[2069692.519704] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
[2069692.519778] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
[2069692.525600] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
[2069692.616753] device wlan1 left promiscuous mode
[2069692.616916] br-lan: port 2(wlan1) entered disabled state
[2069692.770377] ath10k_pci 0001:01:00.0: could not get mac80211 beacon
[2069692.808612] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 1
[2069692.808724] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 1

EDIT:
DFS shouldn't cause wifi to be disabled right? It's weird that both interfaces went down.

It will, if I'm not wrong, if you set the interface to a DFS channel and radar was detected for that channel. Was the channel set to a fixed channel or was it auto?

Fixed... But still weird that 2.4Ghz went down.
I'll change 5Ghz back to non-DFS.

1 Like

How do I get 160 mhz working for my 5.0 ghz wifi? I'm running latest stable hnyman's build

For me VHT160 works quite normally, with FI country code, both channel 36 and 100 work for me. DFS detection takes a minute at startup, but completes nicely.

1 Like

New openwrt user here (but experienced software dev) on an XR500 (same hardware as the R7800) and I had this same __krait_mux_set_sel crash twice in the last 24 hours. I'm running the latest snapshot from 2022-03-24.

When it crashed last night, it happened shortly after a decent amount of load followed by a very low amount of load. So if the CPU core frequency ramped downward, perhaps that's what triggered the crash.

I had not set up any min frequency parameters in the local startup script, but it seems like that might be the best workaround at this point from what others have said?

Let me know if there's any way I can help track down the issue.

1 Like