MVEBU: idivt instruction only available in THUMB2 mode?

danitool · March 13, 2021, 7:59pm

I've noticed in the kernel early bootlog a new line when I compiled the kernel in THUMB2 mode for the Buffalo LS421DE :

[    0.000000] CPU: div instructions available: patching division code

Then I checked the cpuinfo:

root@LinkStation:~# cat /proc/cpuinfo 
processor       : 0
model name      : ARMv7 Processor rev 1 (v7l)
BogoMIPS        : 37.50
Features        : half thumb fastmult vfp edsp vfpv3 vfpv3d16 tls idivt 
CPU implementer : 0x56
CPU architecture: 7
CPU variant     : 0x1
CPU part        : 0x581
CPU revision    : 1

Hardware        : Marvell Armada 370/XP (Device Tree)
Revision        : 0000
Serial          : 0000000000000000

And indeed the idivt is there.

The message comes from the kernel option:

... which is configured in:
https://elixir.bootlin.com/linux/v5.4.105/source/arch/arm/kernel/setup.c#L424

One may expect having idiva if thumb2 mode isn't enabled, but it isn't the case. Without THUMB2 mode it seems there is no idiv instruction available. This hardware capability is checked here:
https://elixir.bootlin.com/linux/v5.4.105/source/arch/arm/kernel/setup.c#L455

I checked the cpuinfo in WRT3200ACM with a more modern CPU and higher capabilities:

root@OpenWrt:~# cat /proc/cpuinfo 
processor       : 0
model name      : ARMv7 Processor rev 1 (v7l)
BogoMIPS        : 1866.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x4
CPU part        : 0xc09
CPU revision    : 1

Again no idiv instruction available.

As I posted elsewhere, when I compile the kernel in THUMB2 mode, the throughput is noticeably better (tested with the Armada370 SoC). Is the idiv extra instruction the culprit of this extra performance and not the THUMB2 mode?... or maybe both. AFAIK THUMB2 mode gives a similar performance compared with ARM mode.

BTW it seems the kernel option CONFIG_ARM_PATCH_IDIV is useless at least in the CORTEX A9 subtarget, since there is no idiv instruction available for the current kernel config and the hardware lacks idiva instruction (or at least it isn't detected by the setup).

ByteEnable · March 29, 2021, 6:17pm

Thumb mode is using a different instruction set (ISA) than regular ARM mode. There is a paper discussing this here:

Also the wrt3200 and others in the MVBEU routers are hamstrung by the openwrt kernel builds. The MVBEU compile uses the 1900AC v1 SoC parameters as the default which is a lower end MVBEU SoC which results in non optimal performance for the higher-end MVBEU SoC's. Quite frankly I do not understand this openwrt development decision which has led to many custom builds for end users who own the higher end Marvell based routers.

I would assume specifying the correct kernel arm definitions for the Marvell 88F6820 SoC would yield a 20% gain or more in performance with Thumb[2] mode. In my own little tests I was able to see a 10% gain in performance by using the proper definitions for the 88F6820 using the default ARM ISA.

anomeome · March 29, 2021, 6:34pm

It was actually another device that drove things to a lower common denominator. The mamba runs fine with CPU_SUBTYPE:=vfpv3 and more:

root@mamba:/etc/config# cat /proc/cpuinfo 
processor	: 0
model name	: ARMv7 Processor rev 2 (v7l)
BogoMIPS	: 25.00
Features	: half thumb fastmult vfp edsp vfpv3 tls idiva idivt vfpd32 lpae

but ya...

ByteEnable · March 29, 2021, 6:52pm

ya...it's been awhile and can't remember which router is the lowest common denominator. Took a wild swing at the 1900.

m95d · March 29, 2021, 7:05pm

But for WRT1900AC v1 (mamba) there is "CPU_SUBTYPE:=vfpv3-d16" set in target/linux/mvebu/cortexa9/target.mk
Isn't that the same thing?

anomeome · March 29, 2021, 7:11pm

No, see thread.

Here is a patch that I use in my build:

build.patch

diff --git a/include/target.mk b/include/target.mk
index 7526224972..65b6755eb9 100644
--- a/include/target.mk
+++ b/include/target.mk
@@ -230,6 +230,9 @@ ifeq ($(DUMP),1)
     CPU_TYPE = sparc
     CPU_CFLAGS_ultrasparc = -mcpu=ultrasparc
   endif
+  ifeq ($(ARCH),arm)
+    CPU_CFLAGS_cortex-a9 = -mthumb
+  endif
   ifeq ($(ARCH),aarch64)
     CPU_TYPE ?= generic
     CPU_CFLAGS_generic = -mcpu=generic
diff --git a/target/linux/mvebu/cortexa9/target.mk b/target/linux/mvebu/cortexa9/target.mk
index 02697fa62d..dd70acf1aa 100644
--- a/target/linux/mvebu/cortexa9/target.mk
+++ b/target/linux/mvebu/cortexa9/target.mk
@@ -7,5 +7,5 @@ include $(TOPDIR)/rules.mk
 ARCH:=arm
 BOARDNAME:=Marvell Armada 37x/38x/XP
 CPU_TYPE:=cortex-a9
-CPU_SUBTYPE:=vfpv3-d16
+CPU_SUBTYPE:=vfpv3
 KERNELNAME:=zImage dtbs

for mvebu targets.

ByteEnable · March 29, 2021, 7:13pm

You could also use NEON instead of vfpv3. NEON will allow the compiler to emit both vfpv3 and simd instructions.

anomeome · March 29, 2021, 7:17pm

Ya, I have done that in the past when playing with some of the patented stuff. Now I just care about openssl and WG, which are all SIMD OOTB, so don't bother.

ByteEnable · March 29, 2021, 7:25pm

Are you using gcc10 or the default gcc8? Is there a noticeable gain in gcc10?

Just FYI, I'm on last years 20 devel branch that I compiled with gcc10 but have not done any testing between gcc8 vs gcc10.

anomeome · March 29, 2021, 7:47pm

Build master, gcc 10 using -O2, have not noticed any performance change as the GCC verrsions have changed, but then I have not bothered trying to benchmark anything either.

m95d · March 30, 2021, 11:38am

I see the difference now: vfpv3-d16 uses only half the registers of "ordinary" vfpv3.

But I read the docs on Marvell Armada XP (used in WRT1900AC v1 mamba). It clearly specifies the FPU as vfpv3-16. See page 7. So the patch can't work on mamba.

anomeome · March 30, 2021, 5:01pm

Yes, I know what the docs specify for device capabilities. But until the change linked in the thread mentioned above occurred, that is how things were running. The actual CPU capabilities are to be seen in the above cpuinfo. And then of course there is the fact that I use that build.patch in the image that I run on a mamba. I'm going with pud'n proof.

m95d · April 1, 2021, 7:11am

You are right. I did a test today with WRT1900AC v1, latest master branch and CPU_SUBTYPE vfpv3:

root@OpenWrt:/# cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 2 (v7l)
BogoMIPS        : 25.00
Features        : half thumb fastmult vfp edsp vfpv3 tls idiva idivt vfpd32 lpae
CPU implementer : 0x56
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0x584
CPU revision    : 2
[...]
root@OpenWrt:/# dmesg | grep -i div
[    0.000000] CPU: div instructions available: patching division code

I also did a test of "openssl speed" w/ devcrypto enabled. There is no significant improvement compared to original vfpv3-d16.
I can't test routing speed right now (i need more cables). But I don't think a division instruction will make a difference there.
I'll try thumb2 mode next.

anomeome · April 1, 2021, 4:27pm

Re: thumb, the above is just userspace, if you want the kernel built that way you will have to modify config-5.10. I found it actually made the kernel size larger, exceeding the kernel reservation space. I have not tried since that issue was resolved.

danitool · April 1, 2021, 4:42pm

Your device already has the idiva instruction. Using idivt instead probably won't bring any benefit. If you didn't compile without the kernel THUMB2 config I wonder why idivt is reported in cpuinfo. The armadas I have ,only report idivt only when compiled with that option

Is your Linksys the mamba version (armada XP) ?

Then the armada XP has the idiva instruction whereas armada 385 hasn't, spite being more modern?

anomeome · April 1, 2021, 4:45pm

Do you know if

CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11=y

is still required on the kernel with thumb.

danitool · April 1, 2021, 7:18pm

I made several builds with and without that option and I didn't notice any difference, modules were allways loaded ok. (kernel 5.4)

m95d · April 2, 2021, 6:21am

I have a WRT1900AC v1 with Armada XP. Or at least that's what they say on the net; I didn't actually looked under the cooler.
I tried to build a firmware with thumb2 kernel. It didn't boot so I went back to normal kernel with vfpv3 patch.

I don't see that option in kernel_menuconfig v5.10

m95d · April 2, 2021, 6:38am

How does the kernel know which instructions are available? Does it actually tries to execute each one? Or is there a hardware or firmware "list" that it reads? Or is it simply hardcoded during compilation according to CPU type in menuconfig?

anomeome · April 4, 2021, 1:00am

A simple thumb patch for kernel(5.10), including the jump issue:

kernelThumb.patch

diff --git a/target/linux/mvebu/cortexa9/config-5.10 b/target/linux/mvebu/cortexa9/config-5.10
new file mode 100644
index 0000000000..6aff77fda7
--- /dev/null
+++ b/target/linux/mvebu/cortexa9/config-5.10
@@ -0,0 +1,2 @@
+CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11=y
+CONFIG_THUMB2_KERNEL=y

not clear to me if it is still needed, but given your result maybe, but iirc it was a WG issue.
I have not had a chance to try this, do you have a serial to see what is happening.