Support NEON on mvebu arch

rsalvaterra · June 8, 2017, 11:25am

Hi, everyone! I have two routers based on Armada 385 (Turris Omnia, I know they aren't supported on LEDE yet, but a lot of other popular systems are). After reading the documentation, I see the mvebu target is using the arm_cortex-a9_vfpv3 package architecture, which suggests these SoCs' capabilities are being underused. NEON (and VFPv3) is optional in Cortex-A9, but it's included in all Armada 375+ SoCs. Would it make sense to create a new package architecture (arm_cortex-a9_neon-vfpv3) for these newer systems? Thanks in advance!

jow · June 8, 2017, 1:08pm

At the moment we're trying to share architectures as much as possible to reduce the number of different build targets we need to support on the buildbot farm.

rsalvaterra · June 8, 2017, 1:33pm

Thanks for the reply, it was the answer I was afraid of, unfortunately. Does it mean it could happen in the future, or will it always be a "build it yourself" kind of thing?

rsalvaterra · June 8, 2017, 2:03pm

OK, so I dug around a bit. According to the GCC manual[1], "neon" is an alias for "neon-vfpv3", so it would be just a matter of reassigning these mvebu archs to the already existing arm_cortex-a9_neon, no new package arch needed at all.

[1] https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/ARM-Options.html#ARM-Options

anomeome · June 9, 2017, 2:09pm

Is it correct that all members of the wrt pack support NEON, except mamba(370/xp)? I modified the rango build to test with NEON, and just a cursory look would appear to suggest a noticeable performance benefit.

rsalvaterra · June 9, 2017, 2:49pm

Yes. Everything above (and including) Armada 375 implements NEON.

(Edit: to be honest, I thought mamba was also based on Armada 385, like the rest of the family. Oh, well… )

diizzy · July 9, 2017, 7:04pm

@anomeome @rsalvaterra
If you like to give it a try I made a pull request to add this functionality

rsalvaterra · July 10, 2017, 9:35am

Good stuff! I don't have a build environment ready, but I have a close friend doing his own compiles of LEDE for his WRT1200AC, I think I can give him a nudge…

starcms · July 13, 2017, 7:47pm

@dizzy, I really hope this gets merged. The performance improvements could be very significant. Thanks for your work in writing the PR.

anomeome · July 13, 2017, 7:54pm

Uptick the vote at FS#867.

hnyman · July 14, 2017, 5:32am

Why? Do you think that anybody actually monitors the vote count...
(I have seen no evidence of that, so far)

anomeome · July 14, 2017, 2:29pm

Well, if nothing else, at least a straw poll ...

anomeome · August 18, 2017, 11:47pm

If you build your own image for a member of the wrtpack, and:

use device as an A/V server and trans-code data stream
use a supported SSL package
perform Chebyshev approximation for solving nonlinear exponential decline analysis

before this disappears in the annals of history you may want to take a patch for your build. As the PR has been closed, and apparently will not be finding its way into master in the forseeable future.

@diizzy, thanks for all the fish
@rsalvaterra, changed heading in attempt to make more search friendly somewhere on down the road.

diizzy · August 20, 2017, 7:53pm

fwiw, I've closed the pull request and you can read the reason below.

tiagogaspar8 · December 10, 2019, 4:06pm

Hey guys!

This is a old, very old topic, but I'd like to reopen it.
Since the last port LEDE has merged with OpenWRT again, and the issues that made the deveolpers question this request are no longer viable (as there are other platforms that use NEON).
I'd like to ask if anyone can help me in making the pull request and separating the two targets.

rsalvaterra · December 10, 2019, 5:13pm

That's some world class necromancy right there.

I'm doing my own builds with NEON support, and they work perfectly, although I haven't done any performance measurements. Still, I do believe they are necessary, for a simple (though not obvious) reason: the Cortex-A9 NEON unit is rather gimped. I'm hoping for some nice improvements in some very specific areas (e. g.: crypto), but not much else.

tiagogaspar8 · December 10, 2019, 5:37pm

Haahahah yes.

From my point of view it should be done, this doesn't have any negative impact on anything and on top of it it improves performance, no matter how much.
Either way I'd need help doing it.

anomeome · December 10, 2019, 5:38pm

Ya, found face down DIW... I'm not sure anything has changed as regards the repo and another (duplicated) sub-target. Personally I just use -O3 on my compile, and as crypto with OpenSSL is my main concern I don't fuss it. I just go along with the OOTB, handcrafted NEON assembler (people still do that?) provided by that package.

rsalvaterra · December 11, 2019, 9:23am

Don't. Have you benchmarked? These are not x86 CPUs with fat caches. -O3 is quite probably hurting more than helping, since it increases text size significantly, which in turn will result in a lower cache hit rate, reducing the performance. The caches on the Armada 385 CPU are rather small (32 kiB L1I$/L1D$, 1 MiB unified L2$).
In my builds, for example, I remove the target/linux/generic/pending-4.14/201-extra_optimization.patch, which increases the text size by adding -fno-reorder-blocks -fno-tree-ch to the build flags (I have no idea why it was added; hysterical raisins, surely).

For extremely performance sensitive inner loops? Yes, all the time (and not just SIMD).

anomeome · December 11, 2019, 7:11pm

I assume by text size you are referring to the compiler generated code size. Although the L1 (32KB x 2) and L2 (1MB), might not be considered large, they are certainly capable of handling the expected load types of a router running long lived processes, with a reasonable L1/L2 hit rate. I have seen no indication that the increased code size generated by -O3 optimisations, inlining, loop unrolling etc. is having any detrimental effect.

OpenWrt attempts to minimise code size by default (i.e -Os), and I believe the referenced patch is a further attempt at that reduction; although the GNU C documentation gives conflicting information regarding the two flags and what happens with the -Os option. I guess someone decided to ensure what occurs.

That was a rhetorical query, meant to have been taken humorously. Given the nature of the requirements around this particular area, I would guess GNU C would churn out something less than desirable, if left to its own devices. But I will not be looking into it, my compiler writing days are way behind me.