Hi, everyone! I have two routers based on Armada 385 (Turris Omnia, I know they aren't supported on LEDE yet, but a lot of other popular systems are). After reading the documentation, I see the mvebu target is using the arm_cortex-a9_vfpv3 package architecture, which suggests these SoCs' capabilities are being underused. NEON (and VFPv3) is optional in Cortex-A9, but it's included in all Armada 375+ SoCs. Would it make sense to create a new package architecture (arm_cortex-a9_neon-vfpv3) for these newer systems? Thanks in advance!
At the moment we're trying to share architectures as much as possible to reduce the number of different build targets we need to support on the buildbot farm.
Thanks for the reply, it was the answer I was afraid of, unfortunately. Does it mean it could happen in the future, or will it always be a "build it yourself" kind of thing?
OK, so I dug around a bit. According to the GCC manual, "neon" is an alias for "neon-vfpv3", so it would be just a matter of reassigning these mvebu archs to the already existing arm_cortex-a9_neon, no new package arch needed at all.
Is it correct that all members of the wrt pack support NEON, except mamba(370/xp)? I modified the rango build to test with NEON, and just a cursory look would appear to suggest a noticeable performance benefit.
Yes. Everything above (and including) Armada 375 implements NEON.
(Edit: to be honest, I thought mamba was also based on Armada 385, like the rest of the family. Oh, well… )
Good stuff! I don't have a build environment ready, but I have a close friend doing his own compiles of LEDE for his WRT1200AC, I think I can give him a nudge…
@dizzy, I really hope this gets merged. The performance improvements could be very significant. Thanks for your work in writing the PR.
Uptick the vote at FS#867.
Why? Do you think that anybody actually monitors the vote count...
(I have seen no evidence of that, so far)
Well, if nothing else, at least a straw poll ...
If you build your own image for a member of the wrtpack, and:
- use device as an A/V server and trans-code data stream
- use a supported SSL package
- perform Chebyshev approximation for solving nonlinear exponential decline analysis
before this disappears in the annals of history you may want to take a patch for your build. As the PR has been closed, and apparently will not be finding its way into master in the forseeable future.
fwiw, I've closed the pull request and you can read the reason below.
This is a old, very old topic, but I'd like to reopen it.
Since the last port LEDE has merged with OpenWRT again, and the issues that made the deveolpers question this request are no longer viable (as there are other platforms that use NEON).
I'd like to ask if anyone can help me in making the pull request and separating the two targets.
That's some world class necromancy right there.
I'm doing my own builds with NEON support, and they work perfectly, although I haven't done any performance measurements. Still, I do believe they are necessary, for a simple (though not obvious) reason: the Cortex-A9 NEON unit is rather gimped. I'm hoping for some nice improvements in some very specific areas (e. g.: crypto), but not much else.
From my point of view it should be done, this doesn't have any negative impact on anything and on top of it it improves performance, no matter how much.
Either way I'd need help doing it.
Ya, found face down DIW... I'm not sure anything has changed as regards the repo and another (duplicated) sub-target. Personally I just use -O3 on my compile, and as crypto with OpenSSL is my main concern I don't fuss it. I just go along with the OOTB, handcrafted NEON assembler (people still do that?) provided by that package.
Don't. Have you benchmarked? These are not x86 CPUs with fat caches. -O3 is quite probably hurting more than helping, since it increases text size significantly, which in turn will result in a lower cache hit rate, reducing the performance. The caches on the Armada 385 CPU are rather small (32 kiB L1I$/L1D$, 1 MiB unified L2$).
In my builds, for example, I remove the target/linux/generic/pending-4.14/201-extra_optimization.patch, which increases the text size by adding -fno-reorder-blocks -fno-tree-ch to the build flags (I have no idea why it was added; hysterical raisins, surely).
For extremely performance sensitive inner loops? Yes, all the time (and not just SIMD).
I assume by text size you are referring to the compiler generated code size. Although the L1 (32KB x 2) and L2 (1MB), might not be considered large, they are certainly capable of handling the expected load types of a router running long lived processes, with a reasonable L1/L2 hit rate. I have seen no indication that the increased code size generated by -O3 optimisations, inlining, loop unrolling etc. is having any detrimental effect.
OpenWrt attempts to minimise code size by default (i.e -Os), and I believe the referenced patch is a further attempt at that reduction; although the GNU C documentation gives conflicting information regarding the two flags and what happens with the -Os option. I guess someone decided to ensure what occurs.
That was a rhetorical query, meant to have been taken humorously. Given the nature of the requirements around this particular area, I would guess GNU C would churn out something less than desirable, if left to its own devices. But I will not be looking into it, my compiler writing days are way behind me.