R7800 cache scaling issue

facboy · September 7, 2019, 10:56am

I've been playing around with testing the frequency scaling on my R7800...seeing as we seem to think there's an issue with the l2-cache scaling, i used a simple memory bandwidth benchmark to try out various transitions through the speed bins: https://github.com/facboy/openwrt-r7800-freq-test

i'd be interested in other people's results. this is from my build which is largely hnyman's build with a patched https://github.com/openwrt/openwrt/pull/2280, i haven't tried with eg a vanilla hnyman build yet.

root@worstwrt:~/bin/freq-test# ./test_mbw.sh

*** My defaults (ondemand)

AVG     Method: MEMCPY  Elapsed: 0.00154        MiB: 1.00000    Copy: 648.803 MiB/s
AVG     Method: DUMB    Elapsed: 0.00576        MiB: 1.00000    Copy: 173.656 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00153        MiB: 1.00000    Copy: 652.443 MiB/s

*** Setting performance governor

Setting scaling_max_freq to 384000
AVG     Method: MEMCPY  Elapsed: 0.00290        MiB: 1.00000    Copy: 345.304 MiB/s
AVG     Method: DUMB    Elapsed: 0.02870        MiB: 1.00000    Copy: 34.843 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00282        MiB: 1.00000    Copy: 355.051 MiB/s

Setting scaling_max_freq to 600000
AVG     Method: MEMCPY  Elapsed: 0.00254        MiB: 1.00000    Copy: 393.391 MiB/s
AVG     Method: DUMB    Elapsed: 0.01808        MiB: 1.00000    Copy: 55.304 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00255        MiB: 1.00000    Copy: 391.803 MiB/s

Setting scaling_max_freq to 800000
AVG     Method: MEMCPY  Elapsed: 0.00185        MiB: 1.00000    Copy: 540.044 MiB/s
AVG     Method: DUMB    Elapsed: 0.01165        MiB: 1.00000    Copy: 85.823 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00185        MiB: 1.00000    Copy: 540.511 MiB/s

Setting scaling_max_freq to 1000000
AVG     Method: MEMCPY  Elapsed: 0.00177        MiB: 1.00000    Copy: 564.557 MiB/s
AVG     Method: DUMB    Elapsed: 0.00947        MiB: 1.00000    Copy: 105.619 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00182        MiB: 1.00000    Copy: 548.908 MiB/s

Setting scaling_max_freq to 1400000
AVG     Method: MEMCPY  Elapsed: 0.00163        MiB: 1.00000    Copy: 615.006 MiB/s
AVG     Method: DUMB    Elapsed: 0.00677        MiB: 1.00000    Copy: 147.721 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00161        MiB: 1.00000    Copy: 619.387 MiB/s

Setting scaling_max_freq to 1725000
AVG     Method: MEMCPY  Elapsed: 0.00251        MiB: 1.00000    Copy: 398.279 MiB/s
AVG     Method: DUMB    Elapsed: 0.00587        MiB: 1.00000    Copy: 170.216 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00229        MiB: 1.00000    Copy: 436.758 MiB/s

*** Now seems to be busted

Setting scaling_max_freq to 800000
AVG     Method: MEMCPY  Elapsed: 0.00241        MiB: 1.00000    Copy: 415.628 MiB/s
AVG     Method: DUMB    Elapsed: 0.01181        MiB: 1.00000    Copy: 84.680 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00241        MiB: 1.00000    Copy: 414.542 MiB/s

Setting scaling_max_freq to 1000000
AVG     Method: MEMCPY  Elapsed: 0.00237        MiB: 1.00000    Copy: 421.745 MiB/s
AVG     Method: DUMB    Elapsed: 0.00955        MiB: 1.00000    Copy: 104.723 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00241        MiB: 1.00000    Copy: 415.800 MiB/s

Setting scaling_max_freq to 1400000
AVG     Method: MEMCPY  Elapsed: 0.00229        MiB: 1.00000    Copy: 435.787 MiB/s
AVG     Method: DUMB    Elapsed: 0.00705        MiB: 1.00000    Copy: 141.832 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00229        MiB: 1.00000    Copy: 436.186 MiB/s

Setting scaling_max_freq to 1725000
AVG     Method: MEMCPY  Elapsed: 0.00227        MiB: 1.00000    Copy: 440.199 MiB/s
AVG     Method: DUMB    Elapsed: 0.00596        MiB: 1.00000    Copy: 167.706 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00229        MiB: 1.00000    Copy: 436.987 MiB/s

*** Restored by setting to 600 first

Setting initial scaling_max_freq to 600000
Setting scaling_max_freq to 1000000
AVG     Method: MEMCPY  Elapsed: 0.00183        MiB: 1.00000    Copy: 546.001 MiB/s
AVG     Method: DUMB    Elapsed: 0.00949        MiB: 1.00000    Copy: 105.374 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00186        MiB: 1.00000    Copy: 537.057 MiB/s

Setting initial scaling_max_freq to 600000
Setting scaling_max_freq to 1400000
AVG     Method: MEMCPY  Elapsed: 0.00158        MiB: 1.00000    Copy: 634.357 MiB/s
AVG     Method: DUMB    Elapsed: 0.00688        MiB: 1.00000    Copy: 145.319 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00161        MiB: 1.00000    Copy: 619.655 MiB/s

Setting initial scaling_max_freq to 600000
Setting scaling_max_freq to 1725000
AVG     Method: MEMCPY  Elapsed: 0.00156        MiB: 1.00000    Copy: 641.519 MiB/s
AVG     Method: DUMB    Elapsed: 0.00572        MiB: 1.00000    Copy: 174.706 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00156        MiB: 1.00000    Copy: 639.550 MiB/s

Setting initial scaling_max_freq to 600000
Setting scaling_max_freq to 1400000
AVG     Method: MEMCPY  Elapsed: 0.00185        MiB: 1.00000    Copy: 541.389 MiB/s
AVG     Method: DUMB    Elapsed: 0.00689        MiB: 1.00000    Copy: 145.140 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00162        MiB: 1.00000    Copy: 617.627 MiB/s

*** Breaks when jumping from 1000

Setting initial scaling_max_freq to 1000000
Setting scaling_max_freq to 1000000
AVG     Method: MEMCPY  Elapsed: 0.00178        MiB: 1.00000    Copy: 560.852 MiB/s
AVG     Method: DUMB    Elapsed: 0.00957        MiB: 1.00000    Copy: 104.471 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00181        MiB: 1.00000    Copy: 551.359 MiB/s

Setting initial scaling_max_freq to 1000000
Setting scaling_max_freq to 1400000
AVG     Method: MEMCPY  Elapsed: 0.00160        MiB: 1.00000    Copy: 626.881 MiB/s
AVG     Method: DUMB    Elapsed: 0.00682        MiB: 1.00000    Copy: 146.660 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00160        MiB: 1.00000    Copy: 625.078 MiB/s

Setting initial scaling_max_freq to 1000000
Setting scaling_max_freq to 1725000
AVG     Method: MEMCPY  Elapsed: 0.00229        MiB: 1.00000    Copy: 435.939 MiB/s
AVG     Method: DUMB    Elapsed: 0.00619        MiB: 1.00000    Copy: 161.540 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00227        MiB: 1.00000    Copy: 440.529 MiB/s

Setting back to my defaults (ondemand)
Setting scaling_max_freq to 1725000
AVG     Method: MEMCPY  Elapsed: 0.00231        MiB: 1.00000    Copy: 432.769 MiB/s
AVG     Method: DUMB    Elapsed: 0.00598        MiB: 1.00000    Copy: 167.090 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00226        MiB: 1.00000    Copy: 442.693 MiB/s

fantom-x · September 7, 2019, 11:47am

Did you see the comment there that the patch slows down the router?

facboy · September 7, 2019, 11:59am

yes, i was trying to work out why. i am going to flash a vanilla hnyman build and see what the results are like.

facboy · September 8, 2019, 2:19pm

hmm...so it doesn't seem like there are any errors being thrown by the functions to set the L2 regulator voltage/L2 clock rate. As far as I can tell it just seems like it can't transition cleanly from the middle L2 bin (1g) to the top L2 bin (1.2g). If I drop back down to the idle bin (384m) in between and then back up to the target bin, it seems to be fine.

fantom-x · September 8, 2019, 2:42pm

What is the definition of fine? Do you see a performance improvement?

facboy · September 8, 2019, 3:50pm

in the memory bandwidth yes. when it doesn't transition correctly then the memory bandwidth never gets above about 400-410MiB/s. if it does transition correctly then the memory bandwidth is 630-670MiB/s with the cpu freq at 1.7g.

whether this will translate into performance improvements in eg SQM i couldn't say, my line isn't fast enough to tax it.

facboy · September 8, 2019, 11:54pm

hmm...i might have fixed it. it seems like a bit of a hack, i just reset the L2 cache to the idle freq prior to setting it to the target freq: https://github.com/facboy/openwrt/tree/pr2280. now the memory bandwidth seems to scale up and down appropriately as expected.

i made some other changes too, including removing the caching just to eliminate race conditions as a potential source of the problem. i could put it back, but this is well outside my area of knowledge - should we be concerned about thread-safety when storing the frequencies in static variables?

fantom-x · September 9, 2019, 2:22am

Check out the original PR: it did not have caching. I added it as I suspected a lock contention could cause higher cpu usage.
One way to test it is to run a speed test with 32 streams and cake and watch CPU usage: the increase of 30% was quite noticeable and the cores were at 100% with a 50M downlink.

facboy · September 9, 2019, 8:11am

To be honest I've never seen the 100% behaviour on master or on 'my' various changes on my branch. I have been running speedtest with cake and 32 concurrent up/down on my 70/20 link. When the L2 cache is in the invalid 'slow' state top shows about 60% sirq, when it is in the correct 'fast' state it's about 30% sirq. After my patch it seems to correctly scale the L2 cache for all transitions. I tested various transitions using the performance governor, it's a bit harder with the on-demand governor but I didn't observe any 'slow' behaviour.

Chief · September 9, 2019, 10:48am

Did you grep the current openwrt kernel sources for these properties in your patch?

"l2-cpufreq"
"fab-scaling"

are not used by current qcom drivers, these are properties used in newer qcom drivers:-)
The patch is just a placebo.

facboy · September 9, 2019, 11:10am

not sure what you mean, the patch doesn't rely on them being used by the kernel sources, the patch adds the code that uses them.

one of the points of this exercise was to come up with something that could reliably measure any changes, not hand-wavey 'it feels faster'. hence the use of mbw. whether the increased memory bandwidth translates into tangible increases in the real world i couldn't say, but the bandwidth fixes are real and not a 'placebo'.

Chief · September 9, 2019, 11:59am

Sorry my bad, something got screwed up when looking at the pr, this is against master not 19.07.
I checked again, I recall some other observation with some of the different eval boards I worked on some time ago, one change in those patches would explain that behavior.

facboy · September 9, 2019, 1:02pm

which behaviour specifically?

Pedro · September 9, 2019, 6:34pm

I can try this patch on my r7800.

0049-cpufreq-dt-support-l2-cache-and-rpm-clocks-scaling.patch

This is the only thing that I need to add to my master clone, right?

Pedro · September 9, 2019, 6:39pm

my master clone has this file instead:
0049-PM-OPP-Support-adjusting-OPP-voltages-at-runtime.patch

facboy · September 9, 2019, 10:58pm

um, i think you should apply the patch as a patchset, not cherry-pick individual files. the patch deletes that file.

https://github.com/facboy/openwrt/compare/master...facboy:pr2280.patch

Pedro · September 10, 2019, 12:02am

I applied the patch to a clean master clone, built it and flashed it, maintaining my configuration.

the configuration sets the cpu governor to performance

when using piece of cake with cake, I see no difference with or without the patch, meaning:

one cpu gets to 95-100% without saturating my 200mbit downlink. ( as shown by htop )
this shows in top as 45% utilization by ksoftirq process.
the maximum download speed I can get is about 13000 KB/s

If I use a less cpu intensive shaper ( simplest with fq_codel ) I can get 22000 KB/s

facboy · September 10, 2019, 8:48am

do you have this in your config? it's from KONG's build: https://www.desipro.de/openwrt/sources/startup

for me at least, i found that this does actually distribute the load across both cores, without it the sirq does all end up on one core. i also have his additions to qcom-ipq8064.dtsi on my local build but i am not sure what, if anything, they do.

	#utilize both cpu cores for network processing
	for file in /sys/class/net/*
	do
			echo 3 > $file"/queues/rx-0/rps_cpus"
			echo 3 > $file"/queues/tx-0/xps_cpus"
	done

that said, mine uses about 25-30% sirq on a 70Mbit link, so if it scales linearly i expect using both cores we could probably saturate your 200Mbit link, but we're not going to get anywhere near the claimed 450Mbps (and now 650Mbps) that KONG's build can do. i have no idea what changes he has made to make that possible.

Pedro · September 10, 2019, 10:01am

wow!
that change made piece of cake go from around 13000 KB/sec to 20500KB/sec with about 80% cpu utilization on both cores!!!!

Pedro · September 10, 2019, 10:18am