Netgear R7800 exploration (IPQ8065, QCA9984)

Feel free to switch to a stable firm when you want. Don't be forced. Anyway thanks for the log they are really useful. They confirm that there is a problem with mux switching.
Now I need to find a way to repro by hammering the code

1 Like

up 2+ days with the updated pr10703. No issues.

I see a couple similar reports of the crash @D43m0n reported with k515+pr10703 Feb. time frame in this thread. These are for the r7800 (not sure which might be using NSS).

Have there been any reports on an ipq8064 based system?

@anon98444528 you are on r7500 right? so ipq8064

r7500v2, ipq8064. I don't recall any ipq8064 user complaining about this crash and my stability has been phenomenal. Maybe I've just had good luck and others haven't reported or been able to capture a crash log.

Otherwise, it makes me wonder if there is something specific to ipq8065 that causes this.

1 Like

@anon98444528 imagine if the problem is only related to ipq8065...

Anyway i notice there is a driver for l1/l2 cache stats... will try to port it to new kernel so we can check if we trigger some error...

1 Like

After 3 days up, I started to "play" with wifi (multiple client netperf both to the AP and to the router through a wire and VLAN). The AP remained stable aside from losing 5G connection on a device (which I was able to get back).

It is a challenge to get my device cpu's up to 1.4 GHz stressing wifi alone. Even with running netserver on the device, two devices utilizing that (one client can do 580 mbps in the client to device direction), I only see one cpu infrequently hitting 1.4 GHZ. the other just stays at 800 MHz (the minimum i allow).

I did reboot the AP just now so I can play with wifi interrupts. Once you have a patch for cache stats, I'll use cpu "hogs" and/or the mbw tool to stress it a bit.

I have been encouraging folk to try running the flent rtt_fair test to multiple devices over wifi to verify that the fq-codel code is scaling correctly to multiple stations, and trying to improve that behavior on multiple threads here...

1 Like

I know. There may still be something funky with wifi/networking on kernel 5.15. I can't put my finger on it just yet plus i'm playing with moving ath10k_pci from PCI-MSI controlled interrupts back to GIC and using irqbalance. All this tweaking might confound results.

When I go back to 5.10, I'll run an rtt_fair from a wire client out to several wifi clients and post the result in an AQL related thread. I've run this before so I'll have something to compare with in the event recent changes make a difference.

1 Like

@ansuel, I'm getting funky results when "playing" around.

For earlier 5.15 tests, anyway I tried a "high speed" wifi test, I was rate limited to about half the rate I could get on 5.10 and one cpu was pegged. That seems "fixed," except I've notice that sometimes (not always) when doing a similar high speed test from a wifi client through WAN interface to my router box (2 VLANs), I've observed the same symptom (rate limited and one cpu pegged). I need to dig more into this and see if I can find a way to consistently reproduce it. A wired lan client does 950+ mbps to the router through the same VLANs.

The other funky thing is when I switch my wifi driver interrupts over from PCI-MSI to GIC and use irqbalence. A "high speed" wifi netperfs will start out on cpu0, both cpus go to 1.4 GHz from 0.8, ~540 mbps, and cpu0 at 95-100% utilized.

After a few seconds, it looks irqbalance switches the process to cpu1, both cpu frequencies drop from 1.4GHz to 0.8 GHz, wifi speed drops to say 350 mbps, cpu1 is pegged at 100%. After a few seconds on cpu1, the process is moved back to cpu0, frequences go up to 1.4 GHz, netperf speed returns to 540 mbps and the cycle repeats.

Anyway, I know I'm playing around with non standard configurations so maybe nothing to worry about. Just thought you might like to know.

@anon98444528 I wonder if you can do a favor for me... I found a strange bit set for the mux related to l2 cache... This is not documented and this is set by the bl or even is set on from the start... Wonder if you can test a firmware and give me the output of such values :smiley:

Of course. What do I need to do?

you should build a custom image with the code i provide you and give me the syslog. It will be very chatty so be aware that the firmware will be slow so get the data and revert to an old build. Can I pass you the files? Ideally the quickest way is to replace the file in the kernel build_dir... (i'm too lazy to provide a good patch)

pass the file(s) and some basic config instructions if needed (or point me to a gist/git repo to fetch it)

EDIT is it time to start an ipq806x kernel 5.15 testing forum thread?

these are the files

just replace them in drivers/clk/qcom/ and recompile

ideally you should check output from serial it's ok if you are not able to get the full syslog from the start. Thanks a lot for the help!

This is to understand if ipq8065 does some strange thing and ipq8064 doesn't have that option set

1 Like

A child is gaming for the next 45 min to an hour. May not be able to report back for about 12-14 hours.

1 Like

np take your time... in the meantime I think I will push an experimental commit with a cleared patch with no debug log.

Anyway we need to find other user like @D43m0n that can use a firmware that crash every 3-4 days and can help in testing each change I do. The idea now is trying to setup everything as stock as possible...


Example we have lpl reg set... that wasn't set on original firmware. Also the div2 regs is broken and looks to reset the mux configuration... that can be problematic... (and is another bug to fix)

1 Like

@hnyman I pushed a new commit with some patches, also do you know if on your custom firmware topic there were some user that suffer from stability problem?

built and booted fine. output from serial console is here.

I'm back to a 5.10 build but I can test further if needed (and your patient).

HTH

1 Like

thanks a lot... nope same thing! also your router have for some reason bit 5 set for l2 mux... we know what bit 4 is but no idea what is bit 5... you don't suffer from crash so it's not the culprit.

Anyway another big thing we don't have is a delay in the regulator setting the voltage... in the original firmware there is a 10us delay to stabilize the voltage... We don't have anything like that... Was prepared to implement a generic implementation for that but guess what? it's already there so it's really a simple ops that I need to implement... Another free fix that should improve things...

Again all of these thing can be only related to ipq8065 as it does use a different regulator/cpu so it can be more sensible to this kind of difference...

5 Likes