Poking my GL.iNet Flint GL-AX1800 cryto abilities on 25.12.1, I got very promising results but also came across some issues I couldn't overcome on the firmware builder.
Target SOC: IPQ6018 / IPQ6000 (GL-AX1800 and similar) The Problem: The current qualcommax target defaults to Pure Software (C-Implementation) for crypto tasks. Even when users manually "unleash" the hardware layers via custom builds, the kernel prioritizes ARMv8 Crypto Extensions (CE) over the Qualcomm Crypto Engine (QCE). This results in 100% CPU saturation ("Green Load") for tasks that should be handled by dedicated DMA-capable silicon.
The Three-Tier Discovery
Through exhaustive testing on the IPQ6018, I have mapped three distinct performance tiers. OpenWrt is currently stuck on Tier 1 by default:
Tier 1: Pure Software (Current Default): ~20-40 MB/s. High CPU, abysmal efficiency.
Tier 2: ARMv8 CE (Manual "Unleashed" Build): Using sha256-ce. Hits ~500 MB/s but results in 100% CPU utilization. This is a "power virus" that creates massive thermal stress and micro-jitter in the network stack.
Tier 3: QCE Silicon (The "Grey Load" Goal): Using sha256-qce via BAM DMA. This is true asynchronous offloading. CPU utility remains <10%, preserving cycles for SQM, Wi-Fi 6 interrupts, and 1Gbps routing.
The Priority Inversion
Even when both hardware layers are enabled, the Linux Crypto API selects the wrong "Hardware" provider:
sha256-ce (ARMv8 CE) Priority: 200
sha256-qce (QCE Silicon) Priority: 175
Because 200 > 175, the kernel defaults to pinning the A53 core (Tier 2) instead of using the DMA block (Tier 3).
Proposed Changes
Default to Hardware: Move away from the C-implementation as the primary provider for IPQ60xx.
Correct Priorities: Patch drivers/crypto/qce/ to increase priority to 300, ensuring Silicon-offload outranks CPU-instructions.
Package Unlinking: Update transmission-daemon and other crypto-heavy packages to link against a hardware-aware OpenSSL build by default.
Thought I better raise the issue here for a broader discussion before taking it to the GitHub Issue Tracker.
You will have to present your performance comparisons with real-world test and the information to reproduce them. Past tests (ipq806x, mvebu, ...) in real-world scenarios suggested that making the codecs behave (e.g. for dm_crypto or ipsec/ openvpn) killed the performance of the hardware engines due to context switches and properly 'formatting' the data (to be processed by the offloaded codecs). So far the CPU/ software codecs always came out as advantageous.
The trick would be:
these are my patches
these are the strongswan, openvpn or dm_crypto settings/ configs
these are the performance differences (for common/ modern codecs and reasonable encryption strengths)
make it easy for others to experience (confirm) the improvements themselves, on their hardware (ipq50xx, ipq60xx and ipq807x might differ here, due to the clockrate differences)
This aspect is considerably less likely, size still matters (OEM partitioning can even reduce 128/ 256 MB NAND to less than 20 MB usable). The default tls provider for OpenWrt is therefore mbedtls, which is not offloaded, but small. This might therefore only be possible as a build-time decision.
I appreciate the feedback regarding the context-switch overhead observed on ipq806x/mvebu. However, the current qualcommax implementation for the IPQ6018 (GL-AX1800) prevents the community from providing the comparative "real-world" benchmarks you are requesting due to the following structural barriers:
1. Software-only Baseline (Tier 1)
The current qualcommax target defaults to generic C-software implementations for SHA/AES. This results in ~40 MB/s throughput at 100% CPU load. Both available hardware layers (A53 CE and QCE Silicon) remain unutilized in the standard image.
2. Structural Inability to Benchmark QCE (Tier 3)
Even when a user attempts to enable hardware acceleration via custom builds, the current target configuration masks the results:
Kernel Priority Inversion: The kernel hardcodes sha256-ce (ARMv8 instructions) at Priority 200 and sha256-qce (QCE Silicon) at Priority 175. Because 200 > 175, the kernel will always default to the synchronous, CPU-bound CE provider. This makes it impossible to measure the asynchronous DMA benefits of the QCE in a standard environment.
Missing Userspace Bridge: The qualcommax default profile omits kmod-crypto-afalg and kmod-crypto-user. Furthermore, the libopenssl package in the official feeds is built without the necessary engine support required to interface with the QCE via the AF_ALG or devcrypto APIs.
Application Linkage: Binaries such as transmission-daemon are linked against libraries (e.g., mbedtls) that lack the plumbing to reach the QCE silicon, even if the drivers were present and prioritized.
3. Large-Block I/O vs. Small-Packet Latency
The argument that "CPU codecs are always advantageous" typically stems from small-packet overhead (IPsec/OpenVPN). This does not apply to Large-Block Synchronous I/O (16KB to 1MB blocks) typical of Torrent piece-hashing or dm-crypt.
A53 CE (Tier 2): Hits ~500 MB/s but pins the core at 100%, starving networking interrupts.
QCE Silicon (Tier 3): Capable of GB/s throughput with <10% CPU utility via BAM DMA, provided the priority and userspace bridge are functional.
4. Proposed Changes for a "Level Playing Field"
To allow for the rigorous testing you’ve requested, the following baseline changes are needed for the qualcommax target:
Shift Driver Priority: Patch drivers/crypto/qce/ to set QCE algorithm priorities to 300 so that silicon offloading outranks CPU-bound instructions.
Standardize Bridge: Include kmod-crypto-afalg and engine-enabled libopenssl in the qualcommax profile.
Until these barriers are removed, we are forced to use the CPU for heavy crypto
Disclaimer: I'm not in the decisive path, just trying to help you.
The source is at your fingertips, you can patch it locally and post the patches alongside your configs and performance comparisons.
Keep in mind, doing nothing and keeping everything as-is is the easiest option for anyone involved. If you are lobbying for change, you should make it easy for anyone to reproduce your findings and quantify the performance improvements.
Sure, you can also hope that just raising the topic will motivate someone else to pick this up... Chances for success may vary, I wouldn't hold my breath.
Just a word of advice, ai usage might not be that convincing to make others spend their time on a topic, rather than dismissing it as unproven theories.
I know, and I appreciate it. And I'm sorry for the formal tone of the smart typewriter, it just makes life so much easier when it comes to put a complex structure into writing fast.
My intention is not to pin the work on somebody else's back, but I'm not confident that I could carry the job to completion by myself either, I'm merely pointing out to my findings and current limitations and constrains in the build.
My best results with the build I managed to put together were: # openssl speed -evp aes-256-cbc
Interesting. I have a build setup, complete novice so... If there are instructions to test, i have a Linksys MR7350 IPQ6018 and a Linksys MR5500 IPQ5018 I can test out. Looking thru I am not sure how to properly patch and test at this time.
I can re look at this is weekend to see if i can get the build system re set up on my sysvinit system. (went non systemd)
Update: Compiled 25.12.snapshot today on my Devuan system - up and running build system.