Yes, I am recommending that you go for something with AES-NI. That would be either x86_64 or ARMv8. Because the penalty for shunting the data over the bus to the crypto silicon makes it perform worse for small, synchronous operations than AES-NI.
Correct. It's also a pita to integrate - it required a lot of work to port the code, including a large number of patches, most of it to kernel code, along with crypto drivers, contiguous memory drivers and others, as well as a port of a patched openssl version designed to work with the hardware.
As @slh pointed out, actually using it is also non-trivial and the configuration of the hardware itself, while complicated, is the least of the issues.
To use Intel QuickAssist requires a patched asynchronous version of openssl, which was also a gigantic pita to compile and get working (for some reason Intel likes to write software designed for embedded systems that simply cannot be cross-compiled; pretty bizarre and yes, I pointed this out to the Intel folk responsible for maintaining the software). Using QuickAssist in nginx requires significant patches to nginx as well. It's not an "out of the box" experience by any means
If you're curious to look at what it takes to get hardware like this working, the code is here
For typical Openwrt synchronous workloads on smallish buffers (something like Openvpn), performance using the crypto hardware on AES-CBC was about 70% - 80% of the performance of AES-NI. For larger buffers, the performance started to approach parity. For multiple (36+ threads) asynchronous operations, the speeds was about 10x as fast as you'd get using AES-NI.
It would be real hassle to give you benchmarks, as I compiled the AES acceleration out of the Intel QuickAssist drivers. I'd need to recompile a half dozen kernel modules to be able to get you a benchmark.
The performance on RSA is very good, particularly signing operations, which performs much better than software regardless of whether it's sychronous or not.
On core AES-NI definitely, no doubt in my mind.