Your probably better off using aesni and making sure the interrupts on the network card are balanced properly over the different cores.
See the irqbalance comments above. Your use case is encryption of small buffers and the penalty for offloading it over the lookaside bus will probably be too high to justify. That is, if you're using a multi-core x86_64.
If the device has a much less capable CPU then the encryption offload may be viable.
You can use a mask that allows it to work across multiple cores. Generally, though, you'd want one interrupt to be processed on one core and not different cores. That's how irqbalance assigns it: one interrupt to a single core.
That was just a random example I wrote for the post. They're not actually distributed like that in practise.
I've been fiddling with irqbalance for the last few hours, so the proc/interrupts output above is what irqbalance has been doing. In practise, my manual affinity assignment is just as you suggest.
For performance I'm not debating that we should all use some little "nuc" kind of device base on intel with at least the aes-ni instruction set.
This is about me, developing a crypto-driver for the EIP93 which is incorporated in the Mediatek MT7621 (ramips target). And I know that the ubnt ER-X and the Mikrotik HEX RB750GR3 are using their own proprietary OS with IPsec (only) hardware offload. Those devices with the same MT7621 SoC are doing 300+ mbps with IPsec. and according to some guy who did some benchmarking even the ER-X maxed out at around 125 mbps and according to him that was because it was only using one Core for "upload".
I would imagine that they've done some proprietary mods to the ipsec implementation to make it aware of the crypto accelerator and use it asynchronously, otherwise they're unlikely to get such improved performance.
That is next on my "wish" or "todo" list. See if (how) I can get it to integrate with the ethernet driver and register it as XFRM-offload for ESP. That way a lot of overhead getting it from an SKB to crypto scatterlist including the additional system call for the encryption/decryption and back to an SKB will be bypassed.
If I can make some improvements on the Cipher code in the process that would be great: maybe it would be possible to see at least 10-15% improvement using it for OpenVPN.
For OpenVPN I am envisioning some kind of openssl engine running the EIP93 in Direct Host Mode, versus the Autonomous Ring Mode that I'm using now. Not sure if I am allowed to do some kind of concurrency locking within that engine, but it needs to have something like that. And Since OpenVPN is (still) single threaded it should not be a problem as long as its the "only" user.
It will be a separate project and its long term thinking for now. Lets see how far performance can be pushed with smarter (better) code.
One of the issues with Wireguard is the crypto algorithms it uses are mostly not supported by hardware accelerators anyway. Certainly for the Intel adapters, of the protocols used by Wireguard only HKDF and Curve25519 are supported. The symmetric ciphers are not supported.
As far as I know WireGuard is using "padata" which makes all the encryption and decryption using as many cores as available. All the data is in a ring buffer and they are processing packets in parallel with each packet given a "node" number to maintain sequence once its time to send.
I was already looking into this, but I don't think the problem is on the "transmit" to hardware side, but rather on the "receive" from hardware. This is why RPS is more function than XPS for a NIC. The transmitting side are the various applications we are using running on different CPU's. The receiving side is all done by one CPU, which is assigned to the interrupt.
For now I will see if two queue would make a difference: an encryption queue and a decryption queue.