Netgear R7800 exploration (IPQ8065, QCA9984)

Ansuel · February 3, 2022, 8:12pm

thx for the report... I should have fixed it.

facboy · February 3, 2022, 8:53pm

384mhz. i'll try it for a bit, though my r7800 sees reduced duty now as it is mainly used as a WAP - i'm trying out a nanopi r4 as the main router.

quarky · February 3, 2022, 10:00pm

qca-rfs is not compatible with ipq806x. It requires the ess switch which ipq806x is not using, IIRC.

anon98444528 · February 3, 2022, 11:29pm

I've already pulled this build off and gone back to a k510+dsa (pr4036 only).

I did a netperf from a 5g wifi client that is capable of doing 450+ mbps to a linux box attached via wire to the r7500v2 (wan port, using vlan's) and, on the 5.15 pr 4748 build, I could only do 200 mbps at best. One r7500v2 cpu is maxed out and the other is at 50+%. Going back to a k510+dsa build, i regain the netperf rates for this client with much much less cpu usage.

I seem to recall that vlans might be an issue - so perhaps nothing new here.

Unfortunately, I confounded these results by also changing from wpad (full) to wpad-wolfssl (full, pr4748 build), iw to iw-full, aht10k-ct htt-full to ath10k-htt. I don't think these changes would explain the excessive cpu usage with pr 4748.

EDIT: Not to mention that my 5.10 builds are from aug-nov time frame last year - I'll update a 5.10 build with pr 4036 to eliminate any such differences.

Ansuel · February 4, 2022, 12:10am

@anon98444528 can you test this https://github.com/Ansuel/openwrt/tree/5.15-improve (there is an extra commit)

anon98444528 · February 4, 2022, 12:22am

i will, but i'm going to do an updated 5.10+dsa build first as a baseline otherwise I'll always be second guessing the results.

May take a few days, as the r7500v2 will be in use for a bit.

Ansuel · February 4, 2022, 12:23am

If you want to have some fun in both tests, comment ds->pcs_poll = true; from qca8k.c

Anyway from what I observed the slowdown seems to be with the internal phy part.
The mdio patch was introduced to enable the assisted learning for multicpu that require multiple write/read to access the fdb... with this we now do it in one go directly. Probably the mdio is faster for the phy part... We can totally both way

I didn't trust the phy part and it would makes sense if it does cause some slowdown (not 200mbps but should be fixed by the extra patch i posted)
In theory disabling the pcs_pool should give our perf back... at worst we can just disable the eth mgmt for the phy mdio part... the mdio part is now optimized and use one write/read instead of 3 so should be fine...

In fact... @nmrh thx for the bench that's exactly what i needed... As it looks stable now...

tell me if you need some instruction on how to disable eth for phy

anon98444528 · February 4, 2022, 12:32am

best tell me what you want. But sleep first, it's got to be late for you and I won't get to this quickly.

Ansuel · February 4, 2022, 12:39am

Anyway to disable the mdio for phy

in qca8k_internal_mdio_write and qca8k_internal_mdio_read just comment the qca8k_phy_eth_command part and the if.

This way phy will use mdio instead of eth mgmt

quarky · February 4, 2022, 12:50am

Looks like the CPU core undergoing a frequency change should not be context switching then?

Without a definite way to test this 'bug', it'll be difficult to fix, as the issue doesn't manifest itself often. My R7800 can sometimes goes weeks before rebooting by itself. I always thot. it's caused by the NSS cores, but now it looks like it's due to the change in the Krait core's frequency? With the NSS cores, the Krait CPU will likely be mostly idle (for LAN based traffic anyway) and it would likely be more prone to wild swings of the CPU core frequency. Maybe the frequency change should be in discreet next step (e.g. 600MHz -> 800 MHz -> 1.4GHz -> 1.7 GHz), instead of a big jump, like from 600MHz to 1.7GHz?

You think it's feasible to write a test kernel module to change the CPU core frequency to try simulating this bug?

Ansuel · February 4, 2022, 1:04am

I think a good way to test this would be hammer the mux change freq logic. Or run 2 thread and make the cpu freq change to random value AT THE SAME TIME and stress this very hard. Nothing incremental...

Also in that scheme i posted, the l2 cache is absent and some comments from qsdk said that the l2 can't run at idle freq when the cpu core is NOT at idle makes me suspect that l2 freq is connected to one of the 2 core muxes... Considering that most of the time the glitch are about memory... wonder if it's the cache that is scaled to an abnormal clk and produce all sort of random operation/data...

Anyway i will check now if that "coordinated clk" series apply so we can check if the clk logic is just wrong and has some flaw.

about the kernel module... in theory should be easy to write one extra point to collect tons of data about the muxes and clk states... And check if they are correct. Also adding some debug values to the krait notifier wouldn't hurt.

quarky · February 4, 2022, 1:29am

The L2 cache is shared between the two Krait cores right? This would be tricky since both cores can operate at different frequencies.

Ansuel · February 4, 2022, 1:32am

yes... the current cpufreq driver address this and scale based on the max freq across the 2 cores...

Anyway I managed to rebase the coordinate clks series now i just need to understand how to change the krait clk to use it...
the amount of changes done to rockchip looks scary tho...
https://patchwork.kernel.org/project/linux-clk/patch/20190305044936.22267-7-dbasehore@chromium.org/
anyway from

[    0.736660] L2 @ QSB rate. Forcing new rate.
[    0.737277] L2 @ 384000 KHz
[    0.737741] CPU0 @ 800000 KHz
[    0.737767] CPU1 @ QSB rate. Forcing new rate.
[    0.738445] CPU1 @ 384000 KHz

looks like l2 is connected to the cpu0 mux as qsb is set only one time for l2 and cpu0

I mean another way to test this would be just remove all the safe_sel logic and check the crash log produced...
If they are more then we add some debug log to the clk so we can check what it does happen.

the krait-cc driver have lots of info in the comments... the code is a mess btw... very ancient qcom stuff self reviewed/accepted... this thing would be NACK directly today...

quarky · February 4, 2022, 2:01am

Could we be encountering the edge case where the CPU core clocks got scaled to say 1.7GHz while the L2 cache clock is staying at a lower clock frequecy, and this corrupted the data the the 1.7GHz core is fetching from the L2 cache? Would the solution be as simple as getting the L2 cache clock go up first before setting the CPU core clock?

There were reports that 800MHz min CPU clock is somehow stable. I suppose the L2 cache clock at 800MHz CPU clock is compatible with 1.7GHz CPU clock?

Peacefuleight · February 4, 2022, 2:06am

On my side with archer c2600, it runs but Everytime I reboot the router, the hardware is reseted. But even after I update my setting with backup , the router is reseted after that too

anon98444528 · February 4, 2022, 2:15am

are you upgrading from a non-dsa build? If so, you may have to update your configs in /etc/config for the change. Not sure exactly what tho as i've been using dsa for a while now.

EDIT: From a community build for the r7800, i used this to get started with what needed to be changed for DSA config migration. See also the "mini-tutorial for dsa network config." This is from some time ago and likely not entirely relavent for your device. Luci seems to do a good job so perhaps just start with a totally new config - just do the basics first and work your way into it. I find dsa easier than swconfig but i might be... different.

Ansuel · February 4, 2022, 2:16am

If i'm not wrong we scale l2 first and then core. Anyway fun free time project understand all the mess that is qsb, pll and hfpll... first thing i'm noticing is that we never use pll and only source freq from qsb and hfpll.

I need to find the series that introduced support for all of this to understand the logic of this driver...

Ok first piece of the puzzle...

Due to the design, QSB and the top PLL are always a fixed rate and thus
only support one frequency each.

And from there...

When switching rates we can't leave the CPU clocked by the HFPLL because
we need to turn off the output of the PLL when changing its frequency.
This means we have to switch over to the secondary mux and use one of the
fixed sources. This is why we need something like the safe parent patch.

Peacefuleight · February 4, 2022, 2:51am

I used a dsa build since a while, 5.10.builds.
And I sysupgraded but did not keep the setting.

anon98444528 · February 4, 2022, 2:54am

This is different than my experience with the r7500v2 (very similar device to the r7800). Not sure what is happening for you - my best suggestions are try a forum search related to configurations not being retained for you device and just starting a new config and see if very simple changes (like just the host name) persist between reboots.

anon98444528 · February 4, 2022, 6:27pm

k510+pr4036 (pcs_poll commented out), 450+mbps netperf, cpu0 80%, cpu1 50%
k515+pr4036+pr4748+cherry pick e972109 5.15-improve (pcs_poll commented out) 200 mbps netperf max, cpu0 100%, cpu1 50+%

not sure i got the workflow right for the k515 build to pick up the patch you indicate - you tell me.

I'll try disabling the "mdio for phy" as you suggest above on the k515 build as described above and test that.

EDIT: k510+pr4036 (pcs_poll not commented out) 450+mbps netperf, cpu0 80%, cpu1 50% i.e. no detectable change with or without pcs_poll.