Netgear R7800 performance drop in 23.05.0-rc1

I installed 23.05.0-rc1 today on my R7800. Network throughput for ethernet-connected LAN devices dropped substantially.
I reinstalled 22.03.5 and performance went back to normal.
ISP is Verizon Fios (Boston MA area), provisioned at 300Mbps symmetric.

From https://www.speakeasy.net/speedtest/ results, a sample below. I see similar degradation with speedtest ookla app, and with bufferbloat at https://www.waveform.com/tools/bufferbloat.

Switching back and forth between 22.03.5 and 23.05.0-rc1 consistently shows 23.05.0-rc1 with a throughput loss.

How can I help investigate further?

release clienttimestamp downloadspeed uploadspeed latency jitter testserver
23.05.0-rc1 06/11/23 08:08 AM 125.98 Mbps 133.58 Mbps 57 ms 6 ms speedtest-server.starry.com.prod.hosts.ooklaserver.net
22.03.5 06/11/23 08:13 AM 292.92 Mbps 272.50 Mbps 7 ms 2 ms speedtest-server.starry.com.prod.hosts.ooklaserver.net
  • Did you keep settings?
  • Was packet steering enabled?

It is likely about the CPU cache frequency scaling changes in kernel 5.15 and 6.1.
@ansuel has been trying to fix it.

Yes the problem is with the cache... we bisect this in an issue

I should consider adding that patch or maybe just set the cache to run at max speed always...

1 Like

So for now, just sit tight while you awesome developers work on the (known) issue?

Thanks.

Eh I really need to take a decision on what hack is worse...

@hnyman can you help me in testing one patch or you moved to ipq807x? (patch is simple and it's just commenting l2 cache opp to force it to run at max freq all the time) (I still don't trust L2 scaling at all)

I am all in favor running max frequency.

I also run performance governor for cpu freq. and it hardly effects the energy bill and hardly effects temp.

I am away at this moment and the coming weeks so cannot test.
I only brought a Dynalink DL-WRX36 with me

1 Like

Sure, I can test it. Just send the patch.

(I have been running your https://github.com/openwrt/openwrt/commit/849ea62e4340f331b95384187f3ea03f55d0559d , but I guess that for testing I need to then remove it, and just have your "new patch" on top of a normal clean repo.)

How much you used that commit? Are you still testing that? Any uptime?

Anyway this is the patch (replace an already present patch that just disable cache opp)

1 Like

Ok, I have only used my R7800 sporadically, but no crashes with that patch.

I made a build for master with that.

Did you really mean that otherwise unchanged target/ but this one 107-xx patch slightly changed?

That means that also the two devfreq lines stay "disabled":

-# CONFIG_ARM_IPQ806X_FAB_DEVFREQ is not set
-# CONFIG_ARM_KRAIT_CACHE_DEVFREQ is not set

I am asking because I seem to get a bit crazy flent results with "your new patch on clean master", and I wonder what is the reason.

With 23.05 with your 849ea62 applied I get smooth flent results and low latency:
150/130 Mbit SQM simple fq_codel:

Ondemand cpufreq 23.05:
(note: flent graphs show 1/4 of the speed, as it uses 4 traffic types simultaneously)

Performance cpufreq 23.05:

Using performance provides a clear benefit and much smoother throughput than ondemand at this speed level.

But with "master with this new test patch" the results are screwy and latency is high:
Ondemand:
REMOVED

Performance:
REMOVED

So I wonder if the reason is the new patch, do I have applied it right (clean repo, just the small change), was it just a test fluke, or if there is something in master that affects R7800 badly.

I will likely compile a new master version just for verification, posbbily also with the old 849ea62 (which seemed to work).

But I wonder if the easiest approach would just be the "performance as default", and just forget abut scaling. (In any case scaling is bad with bursty traffic.)

nice stats anyway sorry the devfreq driver has to be enabled.

Apparently my guess yesterday was wrong...

So, it was not "clean repo + this new patch",
but clean repo + devfreq two lines from 849ea62 + this patch",
right?

1 Like

Ok, enabling those two config lines helped to smoothen things out.

There is not much difference compared to the "849ea62" in 23.05

master with the new patch + 2 config lines:
ondemand:

performance:

Still, the "perfomance" governor provides much smoother throughput and lower latency.

3 Likes

So far your patch has been stable for me.

I suggest that we

  • enable the cache top speed via your patch
  • change the default governor to "performance".
    (either via the cpufreq init script, or
    via the ipq806x kernel config file and then also remove the cpufreq init script hack as it only serves the ondemand governor )

For me, with R7800, the performance governor increases the idle CPU temp by 2 degrees from 52'C to 54'C.

3 Likes

well if we intend to use the performance governor then we can keep opp cache levels...

my concern is not for cpu temp but regulators one and overall system stress... but honestly fk this... I will just set performance by default and be done with this... the system is just not stable... and we have a similar problem with mvebu

5 Likes

Aside from the hiccups in problem reverting stuff I pushed the change and backported to 23.05

4 Likes

Performance with main/master with the new kernel 6.1 with DSA test build: