facboy
1
I've been playing around with testing the frequency scaling on my R7800...seeing as we seem to think there's an issue with the l2-cache scaling, i used a simple memory bandwidth benchmark to try out various transitions through the speed bins: https://github.com/facboy/openwrt-r7800-freq-test
i'd be interested in other people's results. this is from my build which is largely hnyman's build with a patched https://github.com/openwrt/openwrt/pull/2280, i haven't tried with eg a vanilla hnyman build yet.
root@worstwrt:~/bin/freq-test# ./test_mbw.sh
*** My defaults (ondemand)
AVG Method: MEMCPY Elapsed: 0.00154 MiB: 1.00000 Copy: 648.803 MiB/s
AVG Method: DUMB Elapsed: 0.00576 MiB: 1.00000 Copy: 173.656 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00153 MiB: 1.00000 Copy: 652.443 MiB/s
*** Setting performance governor
Setting scaling_max_freq to 384000
AVG Method: MEMCPY Elapsed: 0.00290 MiB: 1.00000 Copy: 345.304 MiB/s
AVG Method: DUMB Elapsed: 0.02870 MiB: 1.00000 Copy: 34.843 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00282 MiB: 1.00000 Copy: 355.051 MiB/s
Setting scaling_max_freq to 600000
AVG Method: MEMCPY Elapsed: 0.00254 MiB: 1.00000 Copy: 393.391 MiB/s
AVG Method: DUMB Elapsed: 0.01808 MiB: 1.00000 Copy: 55.304 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00255 MiB: 1.00000 Copy: 391.803 MiB/s
Setting scaling_max_freq to 800000
AVG Method: MEMCPY Elapsed: 0.00185 MiB: 1.00000 Copy: 540.044 MiB/s
AVG Method: DUMB Elapsed: 0.01165 MiB: 1.00000 Copy: 85.823 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00185 MiB: 1.00000 Copy: 540.511 MiB/s
Setting scaling_max_freq to 1000000
AVG Method: MEMCPY Elapsed: 0.00177 MiB: 1.00000 Copy: 564.557 MiB/s
AVG Method: DUMB Elapsed: 0.00947 MiB: 1.00000 Copy: 105.619 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00182 MiB: 1.00000 Copy: 548.908 MiB/s
Setting scaling_max_freq to 1400000
AVG Method: MEMCPY Elapsed: 0.00163 MiB: 1.00000 Copy: 615.006 MiB/s
AVG Method: DUMB Elapsed: 0.00677 MiB: 1.00000 Copy: 147.721 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00161 MiB: 1.00000 Copy: 619.387 MiB/s
Setting scaling_max_freq to 1725000
AVG Method: MEMCPY Elapsed: 0.00251 MiB: 1.00000 Copy: 398.279 MiB/s
AVG Method: DUMB Elapsed: 0.00587 MiB: 1.00000 Copy: 170.216 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00229 MiB: 1.00000 Copy: 436.758 MiB/s
*** Now seems to be busted
Setting scaling_max_freq to 800000
AVG Method: MEMCPY Elapsed: 0.00241 MiB: 1.00000 Copy: 415.628 MiB/s
AVG Method: DUMB Elapsed: 0.01181 MiB: 1.00000 Copy: 84.680 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00241 MiB: 1.00000 Copy: 414.542 MiB/s
Setting scaling_max_freq to 1000000
AVG Method: MEMCPY Elapsed: 0.00237 MiB: 1.00000 Copy: 421.745 MiB/s
AVG Method: DUMB Elapsed: 0.00955 MiB: 1.00000 Copy: 104.723 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00241 MiB: 1.00000 Copy: 415.800 MiB/s
Setting scaling_max_freq to 1400000
AVG Method: MEMCPY Elapsed: 0.00229 MiB: 1.00000 Copy: 435.787 MiB/s
AVG Method: DUMB Elapsed: 0.00705 MiB: 1.00000 Copy: 141.832 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00229 MiB: 1.00000 Copy: 436.186 MiB/s
Setting scaling_max_freq to 1725000
AVG Method: MEMCPY Elapsed: 0.00227 MiB: 1.00000 Copy: 440.199 MiB/s
AVG Method: DUMB Elapsed: 0.00596 MiB: 1.00000 Copy: 167.706 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00229 MiB: 1.00000 Copy: 436.987 MiB/s
*** Restored by setting to 600 first
Setting initial scaling_max_freq to 600000
Setting scaling_max_freq to 1000000
AVG Method: MEMCPY Elapsed: 0.00183 MiB: 1.00000 Copy: 546.001 MiB/s
AVG Method: DUMB Elapsed: 0.00949 MiB: 1.00000 Copy: 105.374 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00186 MiB: 1.00000 Copy: 537.057 MiB/s
Setting initial scaling_max_freq to 600000
Setting scaling_max_freq to 1400000
AVG Method: MEMCPY Elapsed: 0.00158 MiB: 1.00000 Copy: 634.357 MiB/s
AVG Method: DUMB Elapsed: 0.00688 MiB: 1.00000 Copy: 145.319 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00161 MiB: 1.00000 Copy: 619.655 MiB/s
Setting initial scaling_max_freq to 600000
Setting scaling_max_freq to 1725000
AVG Method: MEMCPY Elapsed: 0.00156 MiB: 1.00000 Copy: 641.519 MiB/s
AVG Method: DUMB Elapsed: 0.00572 MiB: 1.00000 Copy: 174.706 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00156 MiB: 1.00000 Copy: 639.550 MiB/s
Setting initial scaling_max_freq to 600000
Setting scaling_max_freq to 1400000
AVG Method: MEMCPY Elapsed: 0.00185 MiB: 1.00000 Copy: 541.389 MiB/s
AVG Method: DUMB Elapsed: 0.00689 MiB: 1.00000 Copy: 145.140 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00162 MiB: 1.00000 Copy: 617.627 MiB/s
*** Breaks when jumping from 1000
Setting initial scaling_max_freq to 1000000
Setting scaling_max_freq to 1000000
AVG Method: MEMCPY Elapsed: 0.00178 MiB: 1.00000 Copy: 560.852 MiB/s
AVG Method: DUMB Elapsed: 0.00957 MiB: 1.00000 Copy: 104.471 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00181 MiB: 1.00000 Copy: 551.359 MiB/s
Setting initial scaling_max_freq to 1000000
Setting scaling_max_freq to 1400000
AVG Method: MEMCPY Elapsed: 0.00160 MiB: 1.00000 Copy: 626.881 MiB/s
AVG Method: DUMB Elapsed: 0.00682 MiB: 1.00000 Copy: 146.660 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00160 MiB: 1.00000 Copy: 625.078 MiB/s
Setting initial scaling_max_freq to 1000000
Setting scaling_max_freq to 1725000
AVG Method: MEMCPY Elapsed: 0.00229 MiB: 1.00000 Copy: 435.939 MiB/s
AVG Method: DUMB Elapsed: 0.00619 MiB: 1.00000 Copy: 161.540 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00227 MiB: 1.00000 Copy: 440.529 MiB/s
Setting back to my defaults (ondemand)
Setting scaling_max_freq to 1725000
AVG Method: MEMCPY Elapsed: 0.00231 MiB: 1.00000 Copy: 432.769 MiB/s
AVG Method: DUMB Elapsed: 0.00598 MiB: 1.00000 Copy: 167.090 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00226 MiB: 1.00000 Copy: 442.693 MiB/s
Did you see the comment there that the patch slows down the router?
facboy
3
yes, i was trying to work out why. i am going to flash a vanilla hnyman build and see what the results are like.
1 Like
facboy
4
hmm...so it doesn't seem like there are any errors being thrown by the functions to set the L2 regulator voltage/L2 clock rate. As far as I can tell it just seems like it can't transition cleanly from the middle L2 bin (1g) to the top L2 bin (1.2g). If I drop back down to the idle bin (384m) in between and then back up to the target bin, it seems to be fine.
What is the definition of fine? Do you see a performance improvement?
facboy
6
in the memory bandwidth yes. when it doesn't transition correctly then the memory bandwidth never gets above about 400-410MiB/s. if it does transition correctly then the memory bandwidth is 630-670MiB/s with the cpu freq at 1.7g.
whether this will translate into performance improvements in eg SQM i couldn't say, my line isn't fast enough to tax it.
facboy
7
hmm...i might have fixed it. it seems like a bit of a hack, i just reset the L2 cache to the idle freq prior to setting it to the target freq: https://github.com/facboy/openwrt/tree/pr2280. now the memory bandwidth seems to scale up and down appropriately as expected.
i made some other changes too, including removing the caching just to eliminate race conditions as a potential source of the problem. i could put it back, but this is well outside my area of knowledge - should we be concerned about thread-safety when storing the frequencies in static variables?
1 Like
Check out the original PR: it did not have caching. I added it as I suspected a lock contention could cause higher cpu usage.
One way to test it is to run a speed test with 32 streams and cake and watch CPU usage: the increase of 30% was quite noticeable and the cores were at 100% with a 50M downlink.
facboy
9
To be honest I've never seen the 100% behaviour on master or on 'my' various changes on my branch. I have been running speedtest with cake and 32 concurrent up/down on my 70/20 link. When the L2 cache is in the invalid 'slow' state top shows about 60% sirq, when it is in the correct 'fast' state it's about 30% sirq. After my patch it seems to correctly scale the L2 cache for all transitions. I tested various transitions using the performance governor, it's a bit harder with the on-demand governor but I didn't observe any 'slow' behaviour.
Chief
10
Did you grep the current openwrt kernel sources for these properties in your patch?
"l2-cpufreq"
"fab-scaling"
are not used by current qcom drivers, these are properties used in newer qcom drivers:-)
The patch is just a placebo.
facboy
11
not sure what you mean, the patch doesn't rely on them being used by the kernel sources, the patch adds the code that uses them.
one of the points of this exercise was to come up with something that could reliably measure any changes, not hand-wavey 'it feels faster'. hence the use of mbw. whether the increased memory bandwidth translates into tangible increases in the real world i couldn't say, but the bandwidth fixes are real and not a 'placebo'.
1 Like
Chief
12
Sorry my bad, something got screwed up when looking at the pr, this is against master not 19.07.
I checked again, I recall some other observation with some of the different eval boards I worked on some time ago, one change in those patches would explain that behavior.
facboy
13
which behaviour specifically?
Pedro
14
I can try this patch on my r7800.
0049-cpufreq-dt-support-l2-cache-and-rpm-clocks-scaling.patch
This is the only thing that I need to add to my master clone, right?
Pedro
15
my master clone has this file instead:
0049-PM-OPP-Support-adjusting-OPP-voltages-at-runtime.patch
facboy
16
um, i think you should apply the patch as a patchset, not cherry-pick individual files. the patch deletes that file.
https://github.com/facboy/openwrt/compare/master...facboy:pr2280.patch
Pedro
17
I applied the patch to a clean master clone, built it and flashed it, maintaining my configuration.
- the configuration sets the cpu governor to performance
when using piece of cake with cake, I see no difference with or without the patch, meaning:
- one cpu gets to 95-100% without saturating my 200mbit downlink. ( as shown by htop )
- this shows in top as 45% utilization by ksoftirq process.
- the maximum download speed I can get is about 13000 KB/s
If I use a less cpu intensive shaper ( simplest with fq_codel ) I can get 22000 KB/s
facboy
18
do you have this in your config? it's from KONG's build: https://www.desipro.de/openwrt/sources/startup
for me at least, i found that this does actually distribute the load across both cores, without it the sirq does all end up on one core. i also have his additions to qcom-ipq8064.dtsi on my local build but i am not sure what, if anything, they do.
#utilize both cpu cores for network processing
for file in /sys/class/net/*
do
echo 3 > $file"/queues/rx-0/rps_cpus"
echo 3 > $file"/queues/tx-0/xps_cpus"
done
that said, mine uses about 25-30% sirq on a 70Mbit link, so if it scales linearly i expect using both cores we could probably saturate your 200Mbit link, but we're not going to get anywhere near the claimed 450Mbps (and now 650Mbps) that KONG's build can do. i have no idea what changes he has made to make that possible.
Pedro
19
wow!
that change made piece of cake go from around 13000 KB/sec to 20500KB/sec with about 80% cpu utilization on both cores!!!!