MVEBU: Low performance on Armada 370, Level 2 cache disabled?

I noticed a performance abnormally lower than expected on a Buffalo Linkstation LS421DE , equipped with a Marvel Armada 370 SoC. So I checked the CPU Sub-system Registers with devmem enabled in an Openwrt custom firmware, and it seems the Level 2 cache isn't enabled

root@OpenWrt:/# devmem 0xd0008100
0x00000000

According to the SoC datasheet this bit should be 1 when the L2 cache is enabled

To be sure, I also checked the bit running the stock firmware, and indeed it should be enabled:

[root@LS421DE7CC ~]# devmem 0xd0008100
0x00000001

To check the performance I made a simple test with wget in Openwrt compiled without netfilter/ipt modules and with the stock firmware:

  • Openwrt:
Linux version 5.4.75 (dani@tool) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r0-c78e123)) #0 SMP Tue Nov 10 12:11:32 2020
root@OpenWrt:~# devmem 0xd0008100
0x00000000
root@OpenWrt:~# devmem 0xd0008104
0x12086300
root@OpenWrt:~# wget -O- http://192.168.1.7:8000/bigfile.bin > /dev/null
--2020-11-10 04:48:08--  http://192.168.1.7:8000/bigfile.bin
Connecting to 192.168.1.7:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2602501255 (2.4G) [application/octet-stream]
Saving to: 'STDOUT'
      100%[======================>]   2.42G  40.2MB/s    in 58s     

2020-11-10 04:49:06 (42.6 MB/s) - written to stdout [2602501255/2602501255]
root@OpenWrt:~#
  • Stock firmware
[root@LS421DE7CC ~]# cat /proc/version 
Linux version 3.3.4 (root@nasbuild) (gcc version 4.6.2 (Linaro GCC branch-4.6.2. Marvell GCC 201201-883.01c949de) ) #1 Tue Nov 19 11:22:20 JST 2019

[root@LS421DE7CC /]# devmem 0xd0008100
0x00000001
[root@LS421DE7CC /]# devmem 0xd0008104
0x12086302

[root@LS421DE7CC /]# wget -O- http://192.168.1.7:8000/bigfile.bin > /dev/null
--2020-11-10 04:15:03--  http://192.168.1.7:8000/bigfile.bin
Connecting to 192.168.1.7:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2602501255 (2.4G) [application/octet-stream]
Saving to: `STDOUT'

100%[=========================================>] 2,602,501,255 83.3M/s   in 30s     

2020-11-10 04:15:33 (82.8 MB/s) - written to stdout [2602501255/2602501255]

As you can see the performance is almost half in Openwrt compared with the stock firmware, 42 MB/s vs 82 MB/s

I made more checkups, and all signs points to a disabled cache:

  • Manually enabling the L2Enable bit with devmem makes nothing, the readback is always 0.
  • I compiled Openwrt with the L2 cache completely disabled in the kernel, the performance is exactly the same, same boot up times and same speed with wget. Again I can't enable this bit.
  • I can enable the L2Enable bit in Uboot with the command: mw.l 0xd0008100 0x1 1
    But again when Openwrt is loaded and running the bit is disabled.
  • I can disable the bit in the stock firmware, as expected it locks the system.
  • I never got a system lockup while manipulating the cache registers in Openwrt, this indicates the cache isn't enabled.

The Openwrt bootlog https://pastebin.com/YVBvEAXH says:

[ 0.000000] Aurora cache controller enabled, 4 ways, 256 kB
[ 0.000000] Aurora: CACHE_ID 0x00000100, AUX_CTRL 0x12086302

But no apparent cache is working. L2Enable equals to 0 and low performance.

It would be nice if someone could check this bit in his Marvell Armada based device.:

  • Ie WRT19000 or WRT3200AC, where the registers are mapped into a different address, so the command should be:
devmem 0xf1008100
devmem 0xf1008104

Any Idea on what's going on with the cache?

2 Likes
root@mamba:/# devmem 0xf1008100
0x00000001
root@mamba:/# devmem 0xf1008104
0x1A69EF12

Thanks @anomeome for checking the registers on your device.

I compared the mamba device tree file with the LS421DE one, to see if it had something missing. And I found this node not defined in my device:


&coherencyfab {
	broken-idle;
};

I added that node to my dts, compiled again the firmware and these are the results:

root@OpenWrt:~# cat /proc/version
Linux version 5.4.75 (dani@tool) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r0-c78e123)) #0 SMP Tue Nov 10 12:11:32 2020
root@OpenWrt:~# devmem 0xd0008100
0x00000001
root@OpenWrt:~# devmem 0xd0008104
0x1A086302
root@OpenWrt:~# wget -O- http://192.168.1.7:8000/bigfile.bin > /dev/null
--2020-11-10 09:23:14--  http://192.168.1.7:8000/bigfile.bin
Connecting to 192.168.1.7:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2602501255 (2.4G) [application/octet-stream]
Saving to: 'STDOUT'

-                    100%[=====================>]   2.42G  76.9MB/s    in 31s     

2020-11-10 09:23:45 (80.4 MB/s) - written to stdout [2602501255/2602501255]

root@OpenWrt:~#

Now the cache is indeed enabled and I get the expected performance, almost identical as with the stock firmware.

The apparently innocuous broken-idle property for the coherency fabric solved the issue.

I'll send a patch in next hours to fix this problem.

Regards

3 Likes

Another performance issue:

Now this is the affected register:

I tried to reenable the I/O coherency with this patch:

--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -453,15 +453,9 @@
 		ecc_mask = 0;
 	}
 
-	if (is_smp()) {
-		if (cachepolicy != CPOLICY_WRITEALLOC) {
-			pr_warn("Forcing write-allocate cache policy for SMP\n");
-			cachepolicy = CPOLICY_WRITEALLOC;
-		}
-		if (!(initial_pmd_value & PMD_SECT_S)) {
-			pr_warn("Forcing shared mappings for SMP\n");
-			initial_pmd_value |= PMD_SECT_S;
-		}
+	if (cachepolicy != CPOLICY_WRITEALLOC) {
+		pr_warn("Forcing write-allocate cache policy for Armada 370\n");
+		cachepolicy = CPOLICY_WRITEALLOC;
 	}
 
 	/*
--- a/arch/arm/mach-mvebu/coherency.c
+++ b/arch/arm/mach-mvebu/coherency.c
@@ -226,8 +226,6 @@
 	 * where we don't know yet on which SoC we are running.
 
 	 */
-	if (!is_smp())
-		return COHERENCY_FABRIC_TYPE_NONE;
 
 	np = of_find_matching_node_and_match(NULL, of_coherency_table, &match);
 	if (!np)
--- a/arch/arm/mm/cache-l2x0.c
+++ b/arch/arm/mm/cache-l2x0.c
@@ -1497,7 +1497,7 @@
 	l2_wt_override = of_property_read_bool(np, "wt-override");
 
 	if (l2_wt_override) {
-		val |= AURORA_ACR_FORCE_WRITE_THRO_POLICY;
+		val |= AURORA_ACR_FORCE_WRITE_POLICY_DIS;
 		mask |= AURORA_ACR_FORCE_WRITE_POLICY_MASK;
 	}
 
--- a/arch/arm/include/asm/hardware/cache-aurora-l2.h
+++ b/arch/arm/include/asm/hardware/cache-aurora-l2.h
@@ -29,7 +29,7 @@
 #define AURORA_ACR_REPLACEMENT_TYPE_LFSR     \
 	(1 << AURORA_ACR_REPLACEMENT_OFFSET)
 #define AURORA_ACR_REPLACEMENT_TYPE_SEMIPLRU \
-	(3 << AURORA_ACR_REPLACEMENT_OFFSET)
+	(2 << AURORA_ACR_REPLACEMENT_OFFSET)
 
 #define AURORA_ACR_PARITY_EN	(1 << 21)
 #define AURORA_ACR_ECC_EN	(1 << 20)

OpenWrt compiled again, wget test:

root@OpenWrt:~# cat /proc/version
Linux version 5.4.75 (dani@tool) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r0-c78e123)) #0 SMP Tue Nov 10 12:11:32 2020
root@OpenWrt:~# devmem 0xd0008100
0x00000001
root@OpenWrt:~# devmem 0xd0008104
0x12086300
root@OpenWrt:~# devmem 0xd0020200
0x01000002

root@OpenWrt:~# wget -O- http://192.168.1.7:8000/bigfile.bin > /dev/null
--2020-11-11 14:14:06--  http://192.168.1.7:8000/bigfile.bin
Connecting to 192.168.1.7:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2602501255 (2.4G) [application/octet-stream]
Saving to: 'STDOUT'

-                  100%[===============>]   2.42G  99.4MB/s    in 25s     

2020-11-11 14:14:31 (98.9 MB/s) - written to stdout [2602501255/2602501255]

Now with the CPU coherency enabled I get another 17 MB/s extra of throughput.

Resuming the wget tests:

  • Aurora cache disabled: 42 MB/s
  • Aurora cache enabled: 82 MB/s
  • Aurora cache enabled and CPU coherency enabled: 99 MB/s

The Hardware coherency cannot be enabled in kernel upstream. But fortunatelly it won't be a problem in Openwrt due to the nature of targets.

I'll send another patch for the coherency fabric after more testing on the Buffalo LS421DE.

Regards

3 Likes

Another way to fix the L2 Cache issue can be made at runtime with these commands:

root@OpenWrt:~# echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state1/disable 
root@OpenWrt:~# devmem 0xd0008104 32 0x12086302
root@OpenWrt:~# devmem 0xd0008100 32 0x1

which is equivalent to add in the dts file:

&coherencyfab {
	broken-idle;
};

In the first case we are only disabling the Deep Idle state, responsible of powering down the L2 cache. But we are still using the WFI state (wait for interrupts) power save mechanism.

These states can be reviewed at:
https://github.com/torvalds/linux/blob/master/drivers/cpuidle/cpuidle-mvebu-v7.c#L71

Removing or disabling the states[1] at that file also makes the job.

1 Like

Another optimization: this time I compiled the kernel with CONFIG_THUMB2_KERNEL, which isn't enabled as default in the Openwrt kernel config.

We may also want to compile the applications with the thumb optimization, see:

  • wget test:
root@OpenWrt:~# cat /proc/version 
Linux version 5.4.79 (dani@tool) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r0-c78e123)) #0 SMP Wed Nov 25 12:58:29 2020
root@OpenWrt:~# devmem 0xd0008100
0x00000001
root@OpenWrt:~# devmem 0xd0008104
0x12086300
root@OpenWrt:~# devmem 0xd0020200
0x01000002
root@OpenWrt:~# wget -O- http://192.168.1.7:8000/bigfile.bin > /dev/null
--2020-11-24 18:15:08--  http://192.168.1.7:8000/bigfile.bin
Connecting to 192.168.1.7:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2602501255 (2.4G) [application/octet-stream]
Saving to: 'STDOUT'

-                  100%[================>]   2.42G   111MB/s    in 22s     

2020-11-24 18:15:30 (112 MB/s) - written to stdout [2602501255/2602501255]

root@OpenWrt:~#

Now I get another 13 MB extra compared with the last test. But this time I reached the limit the of the Gigabit ethernet.

  • Aurora cache disabled: 42 MB/s
  • Aurora cache enabled: 82 MB/s
  • Aurora cache enabled and CPU coherency enabled: 99 MB/s
  • Aurora cache enabled, CPU coherency enabled and Thumb-2 kernel: 112 MB/s

I noticed if I compile the kernel in Thumb-2 mode and without the Aurora L2 fix, the kernel hangs during its initialization at some point.

After reaching the limits of the Gigabit Ethernet, I compiled again Openwrt but with the netfilter modules to make the final test. This time a kernel without SMP support and with all previous fixes/optimizations, which indeed brings more performance compared with the SMP build.

Does anyone know if there is a reason to not compile the kernel in Thumb-2 mode in the mvebu target?

Regards

iirc, there was an issue with GCC-8.x and Thumb-2, might want to try pushing forward a version or two.

Size constrains, the kernel for wrt1900ac v1 and wrt32x can only have 3MB. This is size with latest master:
-rwxr-xr-x 1 tomek users 3057400 11-27 19:33 zImage
and this is after compiling with thumb2 enabled:
-rwxr-xr-x 1 tomek users 3350184 11-27 19:12 zImage

You can help with this PR https://github.com/openwrt/openwrt/pull/3205, which moves some drivers to kernel modules. I myself can't help with it since I don't have any mvebu cortexa9 hardware for testing.
Maybe then compiling with Thumb2 could be considered.

You probably also want to at least include the Multi-core scheduler. I don't think the corttex-a9 has SMT.

You can flash wrt3200acm builds onto the wrt32x with a serial cable. Identical hardware. The only difference is the flash rom partitioning.

https://community.linksys.com/t5/Wireless-Routers/WRT32X-amp-OpenWRT-LEDE-Reboot-17-01-4-success-story/td-p/1294968

It does not make any sense to do this, considering a working 32x image exists - which only differs where it needs to (partitioning, etc.), but runs the same kernel and rootfs otherwise.

The newer images based off the 5.X kernel are too big for the 3MB partition of the WRT32X. The WRT3200ACM is 6MB so you can have a bigger kernel. There are no snapshot builds of the latest master in git because of this reason.

Only the bot builds currently have an issue due to config settings required, but for individuals things are getting tighter with each kernel push on the mamba and venom.

In context to the thread, there is an issue here as the Thumb-2 compilation creates a much larger kernel compared to what the current defaults generate. As far as this finding its way into OpenWrt, it probably falls into the same camp as other architecture and optimisation decisions; i.e. vfpv3-d16 - vfpv3 - NEON, Os - O2, cater to the lowest common denominator. If you want these things, DIY.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.