Is it possible to add block ciphers for Cavium octeon

I can not test the performance with iperf3 because rcu stall always happens . Ping is OK. Yesterday, I said "it works" because I only tested ping. If sw crypto is used, iperf3 test is OK.

openvpn and rcu stall log:

root@OpenWrt:/tmp/etc# /tmp/openvpn openvpn-test.conf                                                                                                                                    [257/7644]
2024-01-11 07:11:20 Note: --cipher is not set. OpenVPN versions before 2.5 defaulted to BF-CBC as fallback when cipher negotiation failed in this case. If you need this fallback please add '--da.
2024-01-11 07:11:20 OpenVPN 2.6.8 mips64-openwrt-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] [DCO]
2024-01-11 07:11:20 library versions: OpenSSL 1.1.1v  1 Aug 2023, LZO 2.10
2024-01-11 07:11:20 DCO version: N/A
2024-01-11 07:11:20 net_route_v4_best_gw query: dst 0.0.0.0
2024-01-11 07:11:20 net_route_v4_best_gw result: via 192.168.5.1 dev eth0
2024-01-11 07:11:20 Diffie-Hellman initialized with 1024 bit key
2024-01-11 07:11:20 net_iface_new: add tun0 type ovpn-dco
2024-01-11 07:11:20 DCO device tun0 opened
2024-01-11 07:11:20 net_iface_mtu_set: mtu 1500 for tun0
2024-01-11 07:11:20 net_iface_up: set tun0 up
2024-01-11 07:11:20 net_addr_v4_add: 10.9.0.1/24 dev tun0
2024-01-11 07:11:20 Could not determine IPv4/IPv6 protocol. Using AF_INET
2024-01-11 07:11:20 Socket Buffers: R=[294912->294912] S=[294912->294912]
2024-01-11 07:11:20 UDPv4 link local (bound): [AF_INET][undef]:1194
2024-01-11 07:11:20 UDPv4 link remote: [AF_UNSPEC]
2024-01-11 07:11:20 UID set to nobody
2024-01-11 07:11:20 Capabilities retained: CAP_NET_ADMIN
2024-01-11 07:11:20 MULTI: multi_init called, r=256 v=256
2024-01-11 07:11:20 IFCONFIG POOL IPv4: base=10.9.0.2 size=253
2024-01-11 07:11:20 ifconfig_pool_read(), in='client,10.9.0.2,'
2024-01-11 07:11:20 succeeded -> ifconfig_pool_set(hand=0)
2024-01-11 07:11:20 IFCONFIG POOL LIST
2024-01-11 07:11:20 client,10.9.0.2,
2024-01-11 07:11:20 Initialization Sequence Completed
2024-01-11 07:11:29 192.168.5.226:60179 VERIFY OK: depth=1, C=TW[  643.335556] tun0: ovpn_netlink_new_peer: adding peer with endpoint=192.168.5.226:60179/UDP id=0 VPN-IPv4=10.9.0.2 VPN-IPv6=::
, ST=TW, L=Taipei, O=netgear, OU=netgear, CN=netgear CA, name=EasyRSA, emailAddress=mail@netgear
2024-01-11 07:11:29 192.168.5.[  643.354046] tun0: ovpn_netlink_new_key: new key installed (id=0) for peer 0
226:60179 VERIFY OK: depth=0, C=TW, ST=TW, L=Taipei, O=netgear, OU=netgear, CN=client, name=EasyRSA, emailAddress=mail@netgear
2024-01-11 07:11:29 192.168.5.226:60179 peer info: IV_VER=2.6.8
2024-01-11 07:11:29 192.168.5.226:60179 peer info: IV_PLAT=win
2024-01-11 07:11:29 192.168.5.226:60179 peer info: IV_TCPNL=1
2024-01-11 07:11:29 192.168.5.226:60179 peer info: IV_MTU=1600
2024-01-11 07:11:29 192.168.5.226:60179 peer info: IV_NCP=2
2024-01-11 07:11:29 192.168.5.226:60179 peer info: IV_CIPHERS=AES-256-GCM:AES-128-GCM
2024-01-11 07:11:29 192.168.5.226:60179 peer info: IV_PROTO=990
2024-01-11 07:11:29 192.168.5.226:60179 peer info: IV_GUI_VER=OpenVPN_GUI_11.46.0.0
2024-01-11 07:11:29 192.168.5.226:60179 peer info: IV_SSO=openurl,webauth,crtext
2024-01-11 07:11:29 192.168.5.226:60179 TLS: move_session: dest=TM_ACTIVE src=TM_INITIAL reinit_src=1
2024-01-11 07:11:29 192.168.5.226:60179 TLS: tls_multi_process: initial untrusted session promoted to trusted
2024-01-11 07:11:29 192.168.5.226:60179 Control Channel: TLSv1.3, cipher TLSv1.3 TLS_CHACHA20_POLY1305_SHA256, peer certificate: 1024 bits RSA, signature: RSA-SHA256, peer temporary key: 253 bit9
2024-01-11 07:11:29 192.168.5.226:60179 [client] Peer Connection Initiated with [AF_INET]192.168.5.226:60179
2024-01-11 07:11:29 client/192.168.5.226:60179 MULTI_sva: pool returned IPv4=10.9.0.2, IPv6=(Not enabled)
2024-01-11 07:11:29 client/192.168.5.226:60179 MULTI: Learn: 10.9.0.2 -> client/192.168.5.226:60179
2024-01-11 07:11:29 client/192.168.5.226:60179 MULTI: primary virtual IP for client/192.168.5.226:60179: 10.9.0.2
2024-01-11 07:11:29 client/192.168.5.226:60179 SENT CONTROL [client]: 'PUSH_REPLY,route-gateway 10.9.0.1,topology subnet,ping 10,ping-restart 120,ifconfig 10.9.0.2 255.255.255.0,peer-id 0,cipher)
2024-01-11 07:11:30 client/192.168.5.226:60179 Data Channel: cipher 'AES-256-GCM', peer-id: 0
2024-01-11 07:11:30 client/192.168.5.226:60179 Timers: ping 10, ping-restart 240
2024-01-11 07:11:30 client/192.168.5.226:60179 Protocol options: protocol-flags cc-exit tls-ekm dyn-tls-crypt
[  882.172581] INFO: rcu_sched self-detected stall on CPU
[  882.177742]  0-...: (1 GPs behind) idle=33e/140000000000001/0 softirq=5864/5866 fqs=2608
[  882.185929]   (t=6000 jiffies g=2147 c=2146 q=8)
[  882.190556] NMI backtrace for cpu 0
[  882.194051] CPU: 0 PID: 253 Comm: kworker/0:2 Tainted: P                4.14.76 #0
[  882.201640] Workqueue: ovpn-crypto-wq-tun0 ovpn_decrypt_work [ovpn_dco_v2]
[  882.208525] Stack : ffffffff83778288 0000000014001ce0 22773746ac1f384c 22773746ac1f384c
[  882.216546]         0000000000000000 8000000000323a90 ffffffff83778f00 0000000000000000
[  882.224567]         0000000000000142 0000000000000007 0000000000000000 712d74756e30206f
[  882.232587]         0000000000000000 0000000000000000 0000000000000010 ffffffff81180000
[  882.240607]         0000000000000000 0000000000000000 ffffffff811f0000 0000000000000000
[  882.248628]         ffffffff81180000 ffffffff81186920 0000000000000000 0000000000000000
[  882.256648]         0000000000000000 ffffffff80bf1a90 0000000000000000 1e00000000ab3147
[  882.264668]         800000002a774000 8000000000323a90 ffffffff81170000 ffffffff80f4df10
[  882.272689]         0000000000000000 0000000000000007 0000000000000000 712d74756e30206f
[  882.280709]         0000000000000000 ffffffff808718fc ffffffffc0f13230 c00000fefffdb900
[  882.288729]         ...
[  882.291180] Call Trace:
[  882.293633] [<ffffffff808718fc>] show_stack+0x64/0x108
[  882.298786] [<ffffffff80f4df10>] dump_stack+0x90/0xd0
[  882.303848] [<ffffffff80f54d30>] nmi_cpu_backtrace+0xe0/0x108
[  882.309603] [<ffffffff80f54e28>] nmi_trigger_cpumask_backtrace+0xd0/0x178
[  882.316403] [<ffffffff80925a78>] rcu_dump_cpu_stacks+0xbc/0x128
[  882.322338] [<ffffffff808ec460>] rcu_check_callbacks+0x2e8/0x7e0
[  882.328355] [<ffffffff808efaac>] update_process_times+0x34/0x70
[  882.334287] [<ffffffff808feb00>] tick_sched_timer+0x170/0x1d8
[  882.340042] [<ffffffff808f06f8>] __hrtimer_run_queues+0xd8/0x1b0
[  882.346057] [<ffffffff808f094c>] hrtimer_interrupt+0xd4/0x288
[  882.351813] [<ffffffff80874dd8>] c0_compare_interrupt+0x80/0x98
[  882.357745] [<ffffffff808dcb60>] __handle_irq_event_percpu+0x78/0x188
[  882.364195] [<ffffffff808dcc90>] handle_irq_event_percpu+0x20/0x68
[  882.370383] [<ffffffff808e1760>] handle_percpu_irq+0x80/0xb0
[  882.376050] [<ffffffff808dbbbc>] generic_handle_irq+0x2c/0x48
[  882.381810] [<ffffffff80f6926c>] do_IRQ+0x1c/0x30
[  882.386525] [<ffffffff80807b68>] plat_irq_dispatch+0xe8/0x110
[  882.392283] [<ffffffff8086c52c>] handle_int+0x14c/0x158
[  882.397517] [<ffffffff8087674c>] cnmips_cu2_call+0x7c/0xa0
[  882.403013] [<ffffffff808b31c0>] notifier_call_chain+0x50/0xa8
[  882.408856] [<ffffffff8086d1f4>] handle_cpu_int+0x24/0x30

Here is the ping and iperf3 log from a ssh window.

root@OpenWrt:~# ping 10.9.0.2
PING 10.9.0.2 (10.9.0.2): 56 data bytes
64 bytes from 10.9.0.2: seq=0 ttl=128 time=0.779 ms
64 bytes from 10.9.0.2: seq=1 ttl=128 time=0.629 ms
64 bytes from 10.9.0.2: seq=2 ttl=128 time=0.742 ms
64 bytes from 10.9.0.2: seq=3 ttl=128 time=0.866 ms
64 bytes from 10.9.0.2: seq=4 ttl=128 time=0.744 ms
64 bytes from 10.9.0.2: seq=5 ttl=128 time=0.703 ms
64 bytes from 10.9.0.2: seq=6 ttl=128 time=0.614 ms
64 bytes from 10.9.0.2: seq=7 ttl=128 time=0.545 ms
64 bytes from 10.9.0.2: seq=8 ttl=128 time=0.699 ms
64 bytes from 10.9.0.2: seq=9 ttl=128 time=0.634 ms
64 bytes from 10.9.0.2: seq=10 ttl=128 time=0.986 ms
64 bytes from 10.9.0.2: seq=11 ttl=128 time=0.697 ms
64 bytes from 10.9.0.2: seq=12 ttl=128 time=0.585 ms
64 bytes from 10.9.0.2: seq=13 ttl=128 time=0.681 ms
64 bytes from 10.9.0.2: seq=14 ttl=128 time=0.934 ms
64 bytes from 10.9.0.2: seq=15 ttl=128 time=0.651 ms
64 bytes from 10.9.0.2: seq=16 ttl=128 time=0.968 ms
64 bytes from 10.9.0.2: seq=17 ttl=128 time=1.291 ms
64 bytes from 10.9.0.2: seq=18 ttl=128 time=0.602 ms
64 bytes from 10.9.0.2: seq=19 ttl=128 time=1.091 ms
64 bytes from 10.9.0.2: seq=20 ttl=128 time=0.579 ms
64 bytes from 10.9.0.2: seq=21 ttl=128 time=0.601 ms
64 bytes from 10.9.0.2: seq=22 ttl=128 time=0.568 ms
64 bytes from 10.9.0.2: seq=23 ttl=128 time=0.494 ms
64 bytes from 10.9.0.2: seq=24 ttl=128 time=0.504 ms
64 bytes from 10.9.0.2: seq=25 ttl=128 time=0.480 ms
64 bytes from 10.9.0.2: seq=26 ttl=128 time=0.559 ms
^C
--- 10.9.0.2 ping statistics ---
27 packets transmitted, 27 packets received, 0% packet loss
round-trip min/avg/max = 0.480/0.712/1.291 ms
root@OpenWrt:~# iperf3 -c 10.9.0.2
Connecting to host 10.9.0.2, port 5201
[  5] local 10.9.0.1 port 44924 connected to 10.9.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.05 MBytes  17.2 Mbits/sec    0    112 KBytes
client_loop: send disconnect: Broken pipe

According to my experience, tcrypt can not cover all test cases. It's a basic test tool. If you want to test one hw crypto module , you also need to test it with real user case like ipsec and ovpn-dco, then you may find out if there are bugs in hw crypto module. So have you tried aes-gcm with ipsec or ovpn-dco at your side?

i have a idea about the rcu stall but it may decrease performance. each block which is encrypted or decrypted is covered by a local_irq_save and does also disable preempt which disables irq's at this operation. now if the requested block for encrypt/decrypt is very big it may stall the cpu due the massive load. i will make a small commit which may fix that problem. gimme some minutes

fix or lets say "guess" is commited

I confirm the rcu stall issue is fixed.

  1. Test topology:
    cavium cn7020 router(iperf3 client,openvpn server) ---wan---Windows pc (iperf3 server, openvpn client)

  2. use SW crypto:

root@OpenWrt:~# iperf3 -c 10.9.0.2
Connecting to host 10.9.0.2, port 5201
[  5] local 10.9.0.1 port 44864 connected to 10.9.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  12.6 MBytes   106 Mbits/sec    0    167 KBytes
[  5]   1.00-2.00   sec  12.7 MBytes   107 Mbits/sec    0    194 KBytes
[  5]   2.00-3.00   sec  12.5 MBytes   105 Mbits/sec    0    205 KBytes
[  5]   3.00-4.00   sec  12.9 MBytes   108 Mbits/sec    0    205 KBytes
[  5]   4.00-5.00   sec  12.7 MBytes   107 Mbits/sec    0    205 KBytes
[  5]   5.00-6.00   sec  12.7 MBytes   106 Mbits/sec    0    205 KBytes
[  5]   6.00-7.00   sec  12.7 MBytes   106 Mbits/sec    0    205 KBytes
[  5]   7.00-8.00   sec  12.6 MBytes   106 Mbits/sec    0    205 KBytes
[  5]   8.00-9.00   sec  12.6 MBytes   106 Mbits/sec    0    205 KBytes
[  5]   9.00-10.00  sec  12.8 MBytes   107 Mbits/sec    0    205 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   127 MBytes   106 Mbits/sec    0             sender
[  5]   0.00-10.01  sec   126 MBytes   106 Mbits/sec                  receiver

iperf Done.
root@OpenWrt:~# iperf3 -c 10.9.0.2 -R
Connecting to host 10.9.0.2, port 5201
Reverse mode, remote host 10.9.0.2 is sending
[  5] local 10.9.0.1 port 44868 connected to 10.9.0.2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.57 MBytes  55.1 Mbits/sec
[  5]   1.00-2.00   sec  7.82 MBytes  65.6 Mbits/sec
[  5]   2.00-3.00   sec  7.80 MBytes  65.4 Mbits/sec
[  5]   3.00-4.00   sec  7.80 MBytes  65.4 Mbits/sec
[  5]   4.00-5.00   sec  7.83 MBytes  65.6 Mbits/sec
[  5]   5.00-6.00   sec  7.79 MBytes  65.4 Mbits/sec
[  5]   6.00-7.00   sec  7.81 MBytes  65.5 Mbits/sec
[  5]   7.00-8.00   sec  7.80 MBytes  65.4 Mbits/sec
[  5]   8.00-9.00   sec  7.79 MBytes  65.3 Mbits/sec
[  5]   9.00-10.00  sec  7.81 MBytes  65.5 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  77.0 MBytes  64.6 Mbits/sec                  sender
[  5]   0.00-10.00  sec  76.8 MBytes  64.4 Mbits/sec                  receiver

iperf Done.

  1. Use HW crypto(aes-gcm)
root@OpenWrt:~# iperf3 -c 10.9.0.2
Connecting to host 10.9.0.2, port 5201
[  5] local 10.9.0.1 port 44856 connected to 10.9.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  32.2 MBytes   270 Mbits/sec    0    340 KBytes
[  5]   1.00-2.00   sec  40.5 MBytes   340 Mbits/sec    0    340 KBytes
[  5]   2.00-3.00   sec  40.8 MBytes   343 Mbits/sec    0    340 KBytes
[  5]   3.00-4.00   sec  40.5 MBytes   340 Mbits/sec    0    354 KBytes
[  5]   4.00-5.00   sec  40.5 MBytes   340 Mbits/sec    0    354 KBytes
[  5]   5.00-6.00   sec  40.5 MBytes   340 Mbits/sec    0    354 KBytes
[  5]   6.00-7.00   sec  40.8 MBytes   343 Mbits/sec    0    354 KBytes
[  5]   7.00-8.00   sec  40.5 MBytes   340 Mbits/sec    0    354 KBytes
[  5]   8.00-9.00   sec  41.0 MBytes   344 Mbits/sec    0    354 KBytes
[  5]   9.00-10.00  sec  39.9 MBytes   335 Mbits/sec    0    513 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   397 MBytes   333 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   396 MBytes   332 Mbits/sec                  receiver

iperf Done.
root@OpenWrt:~# iperf3 -c 10.9.0.2 -R
Connecting to host 10.9.0.2, port 5201
Reverse mode, remote host 10.9.0.2 is sending
[  5] local 10.9.0.1 port 44860 connected to 10.9.0.2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  9.01 MBytes  75.6 Mbits/sec
[  5]   1.00-2.00   sec  8.96 MBytes  75.2 Mbits/sec
[  5]   2.00-3.00   sec  12.6 MBytes   106 Mbits/sec
[  5]   3.00-4.00   sec  13.1 MBytes   110 Mbits/sec
[  5]   4.00-5.00   sec  14.4 MBytes   121 Mbits/sec
[  5]   5.00-6.00   sec  18.9 MBytes   159 Mbits/sec
[  5]   6.00-7.00   sec  18.8 MBytes   158 Mbits/sec
[  5]   7.00-8.00   sec  18.7 MBytes   157 Mbits/sec
[  5]   8.00-9.00   sec  18.9 MBytes   158 Mbits/sec
[  5]   9.00-10.00  sec  18.8 MBytes   158 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec   152 MBytes   128 Mbits/sec                  sender
[  5]   0.00-10.00  sec   152 MBytes   128 Mbits/sec                  receiver

iperf Done.

We can see HW crypto really works and encryption is faster.

I will also test the case that traffic between LAN PC and WAN PC(openvpn client), which needs some time to set a few firewall rules. I will share the result. BTW, one another question may not related to this topic, do you know if cavium cn70xx/cn71xx chips support HW nat? HW NAT means application CPU (it's MIPS CPU in cn70xx/cn71xx) only sets the nat rule to one HW and then traffic is forwarded by this HW.

unfortunatly i have no experience with cn70xx. i have such a device but i have never seen any code for hw nat for this device.

for the hw crypto. i will spend some more time in making it better. especially in avoiding disabling irq's etc. i already started with it for some algorithms, but for gcm its a little bit more complicated. but at least we have a good start with 3 times better performance. but i expect we can do 5 times better performance

Are you sure this? Any number can be referred to? I'm very curious:

  1. If I only use this hw crypto in kernel space and no hw crypto operations from user space, can I remove irq disable/restore to get better performance?
  2. Furthermore, If ovpn-dco is only user for this hw crypto in the kernel space, can I bypass something to get better performance?
  3. I'm also thinking if we call ASM primitives of hw crypto in ovpn-dco itself, can we get better performance. I know this needs a lot of work and make the code not fit kernel crypto framework. , just curious.

from what i know you cannot use the crypto functions in irq context. of course you can remove it if you can make sure that no crypto operation is done in irq context. if you can make sure that the coprocessor isnt in use from userspace you can also remove the context save/restore. just make sure that he coprocessor is enabled which is is also part of crypto_enable/disable. otherwise the kernel will hang.

and yes if you use the crypto function direct without all that api shit its even faster. i impelemented aes-cbc once in openssl direct in userspace

openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 7453901 aes-128-cbc's in 2.98s
Doing aes-128-cbc for 3s on 64 size blocks: 6573681 aes-128-cbc's in 2.98s
Doing aes-128-cbc for 3s on 256 size blocks: 3226206 aes-128-cbc's in 2.99s
Doing aes-128-cbc for 3s on 1024 size blocks: 1067392 aes-128-cbc's in 2.98s
Doing aes-128-cbc for 3s on 8192 size blocks: 147226 aes-128-cbc's in 2.99s
Doing aes-128-cbc for 3s on 16384 size blocks: 74167 aes-128-cbc's in 3.00s
OpenSSL 1.1.1v 1 Aug 2023
built on: Sun Dec 24 07:45:06 2023 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: ccache mips64-linux-uclibc-gcc -I/home/seg/DEV/octeon/src/router/openssl/crypto -fPIC -fPIC -pthread -mabi=64 -Wa,--noexecstack -Os -fno-unwind-tables -fno-asynchronous-unwind-tables -DOCTEON_OPENSSL -pipe -march=octeon2 -mabi=64 -msoft-float -fno-caller-saves -mno-branch-likely -fno-plt -Os -fno-unwind-tables -fno-asynchronous-unwind-tables -pipe -march=octeon2 -mabi=64 -msoft-float -fno-caller-saves -mno-branch-likely -fno-plt -fno-unwind-tables -fno-asynchronous-unwind-tables -ffunction-sections -fdata-sections -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DPOLY1305_ASM -DNDEBUG -DOCTEON_OPENSSL -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -DOCTEON -DOCTEON_OPENSSL -DNDEBUG -D_GNU_SOURCE -DOPENSSL_SMALL_FOOTPRINT -I/home/seg/DEV/octeon/src/router/openssl/include/executive
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 40020.94k 141179.73k 276223.66k 366781.68k 403369.70k 405050.71k

openssl speed aes
Doing aes-128 cbc for 3s on 16 size blocks: 12097404 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 64 size blocks: 9804799 aes-128 cbc's in 2.99s
Doing aes-128 cbc for 3s on 256 size blocks: 3877421 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 1127608 aes-128 cbc's in 2.99s
Doing aes-128 cbc for 3s on 8192 size blocks: 148588 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 16384 size blocks: 74360 aes-128 cbc's in 3.00s
Doing aes-192 cbc for 3s on 16 size blocks: 10971039 aes-192 cbc's in 2.97s
Doing aes-192 cbc for 3s on 64 size blocks: 8659391 aes-192 cbc's in 2.98s
Doing aes-192 cbc for 3s on 256 size blocks: 3365425 aes-192 cbc's in 2.99s
Doing aes-192 cbc for 3s on 1024 size blocks: 981688 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 8192 size blocks: 127715 aes-192 cbc's in 2.98s
Doing aes-192 cbc for 3s on 16384 size blocks: 64501 aes-192 cbc's in 2.98s
Doing aes-256 cbc for 3s on 16 size blocks: 10490636 aes-256 cbc's in 2.99s
Doing aes-256 cbc for 3s on 64 size blocks: 7929268 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 256 size blocks: 3018928 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 864945 aes-256 cbc's in 2.97s
Doing aes-256 cbc for 3s on 8192 size blocks: 113584 aes-256 cbc's in 2.98s
Doing aes-256 cbc for 3s on 16384 size blocks: 55436 aes-256 cbc's in 2.88s
OpenSSL 1.1.1v 1 Aug 2023
built on: Sun Dec 24 07:45:06 2023 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: ccache mips64-linux-uclibc-gcc -I/home/seg/DEV/octeon/src/router/openssl/crypto -fPIC -fPIC -pthread -mabi=64 -Wa,--noexecstack -Os -fno-unwind-tables -fno-asynchronous-unwind-tables -DOCTEON_OPENSSL -pipe -march=octeon2 -mabi=64 -msoft-float -fno-caller-saves -mno-branch-likely -fno-plt -Os -fno-unwind-tables -fno-asynchronous-unwind-tables -pipe -march=octeon2 -mabi=64 -msoft-float -fno-caller-saves -mno-branch-likely -fno-plt -fno-unwind-tables -fno-asynchronous-unwind-tables -ffunction-sections -fdata-sections -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DPOLY1305_ASM -DNDEBUG -DOCTEON_OPENSSL -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -DOCTEON -DOCTEON_OPENSSL -DNDEBUG -D_GNU_SOURCE -DOPENSSL_SMALL_FOOTPRINT -I/home/seg/DEV/octeon/src/router/openssl/include/executive
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128 cbc 64519.49k 209868.61k 330873.26k 386177.46k 405744.30k 406104.75k
aes-192 cbc 59103.24k 185973.50k 288143.41k 335082.84k 351087.68k 354625.63k
aes-256 cbc 56137.18k 169157.72k 257615.19k 298216.73k 312241.65k 315369.24k

thanks! I will try compiling wolfssl which is supposed to support aes-gcm and test aes-gcm performance in user space.

wolfssl has the code in it but you cannot compile it since alot of octeon sdk related files are missing. and it may only help you with benchmarking

Yes,you are right. Below files are missing.

#include "cvmx.h"
#include "cvmx-asm.h"
#include "cvmx-key.h"
#include "cvmx-swap.h"

Seems need to contact wolfssl.

 % cat  wolfcrypt/src/port/cavium/README_Octeon.md
# Cavium Octeon III CN7300

Please contact wolfSSL at info@wolfssl.com to request an evaluation.

gain 30+Mbps when removing local_irq_save and local_irq_restore:

root@OpenWrt:~# iperf3 -c 10.9.0.2
Connecting to host 10.9.0.2, port 5201
[  5] local 10.9.0.1 port 60064 connected to 10.9.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  44.1 MBytes   370 Mbits/sec    0    258 KBytes
[  5]   1.00-2.00   sec  44.4 MBytes   372 Mbits/sec    0    282 KBytes
[  5]   2.00-3.00   sec  44.0 MBytes   369 Mbits/sec    0    306 KBytes
[  5]   3.00-4.00   sec  44.1 MBytes   370 Mbits/sec    0    306 KBytes
[  5]   4.00-5.00   sec  44.1 MBytes   370 Mbits/sec    0    319 KBytes
[  5]   5.00-6.00   sec  44.3 MBytes   372 Mbits/sec    0    332 KBytes
[  5]   6.00-7.00   sec  44.0 MBytes   369 Mbits/sec    0    347 KBytes
[  5]   7.00-8.00   sec  43.9 MBytes   368 Mbits/sec    0    347 KBytes
[  5]   8.00-9.00   sec  44.3 MBytes   371 Mbits/sec    0    347 KBytes
[  5]   9.00-10.00  sec  43.9 MBytes   368 Mbits/sec    0    347 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   441 MBytes   370 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   440 MBytes   369 Mbits/sec                  receiver

iperf Done.
root@OpenWrt:~# iperf3 -c 10.9.0.2 -R
Connecting to host 10.9.0.2, port 5201
Reverse mode, remote host 10.9.0.2 is sending
[  5] local 10.9.0.1 port 60068 connected to 10.9.0.2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  16.8 MBytes   141 Mbits/sec
[  5]   1.00-2.00   sec  13.7 MBytes   115 Mbits/sec
[  5]   2.00-3.00   sec  19.7 MBytes   165 Mbits/sec
[  5]   3.00-4.00   sec  19.7 MBytes   165 Mbits/sec
[  5]   4.00-5.00   sec  19.5 MBytes   163 Mbits/sec
[  5]   5.00-6.00   sec  19.8 MBytes   166 Mbits/sec
[  5]   6.00-7.00   sec  19.8 MBytes   166 Mbits/sec
[  5]   7.00-8.00   sec  19.5 MBytes   164 Mbits/sec
[  5]   8.00-9.00   sec  19.6 MBytes   165 Mbits/sec
[  5]   9.00-10.00  sec  19.8 MBytes   166 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec   188 MBytes   158 Mbits/sec                  sender
[  5]   0.00-10.00  sec   188 MBytes   158 Mbits/sec                  receiver

iperf Done.

if you go the older way now with muss less crypto_enable/disable calls the cpu should not stall anymore but it may even give more performance boost. consider also my last patch for the small replacement of some crypto_xor_cpy calls. for the sdk. if you look for it, you can find it on github. and you can also find these files in my own sources

I see it in commit 5220861fe631d8fd818c3b37d70b7c01b57dd9df. However, seems that I can not run traffic with ovpn-dco after adding this change. Can you reproduce?

Do you mean "octeon-aes-gcm.c before fix stall + no_irq_save" ? If yes, I tried this and still saw stall.

[398765.109027] INFO: rcu_sched self-detected stall on CPU
[398765.114274]         0-...: (5999 ticks this GP) idle=8e6/140000000000001/0 softirq=2455842/245
[398765.123591]          (t=6000 jiffies g=577440 c=577439 q=40)
[398765.128740] NMI backtrace for cpu 0
[398765.132324] CPU: 0 PID: 30402 Comm: kworker/0:2 Tainted: P                4.14.76 #0
[398765.140174] Workqueue: ovpn-crypto-wq-tun0 ovpn_decrypt_work [ovpn_dco_v2]
[398765.147145] Stack : ffffffff83798288 0000000014001ce0 22773746ac1f384c 22773746ac1f384c
[398765.155255]         0000000000000000 8000000000323a90 ffffffff83798f00 0000000000000000
[398765.163362]         000000000000054e 0000000000000007 0000000000000000 712d74756e30206f
[398765.171469]         0000000000000000 0000000000000000 0000000000000010 ffffffff81180000
[398765.179577]         0000000000000000 0000000000000000 ffffffff811f0000 0000000000000000
[398765.187684]         ffffffff81180000 ffffffff81186920 0000000000000000 0000000000000000
[398765.195791]         0000000000009ca0 ffffffff80bf1b10 0000000000000000 1e00000000ab7347
[398765.203898]         800000002a260000 8000000000323a90 ffffffff81170000 ffffffff80f4df90
[398765.212006]         0000000000000000 0000000000000007 0000000000000000 712d74756e30206f
[398765.220113]         0000000000000000 ffffffff8087197c ffffffffc0f13230 c00000fefffdb900
[398765.228220]         ...
[398765.230759] Call Trace:
[398765.233300] [<ffffffff8087197c>] show_stack+0x64/0x108
[398765.238541] [<ffffffff80f4df90>] dump_stack+0x90/0xd0
[398765.243689] [<ffffffff80f54db0>] nmi_cpu_backtrace+0xe0/0x108
[398765.249531] [<ffffffff80f54ea8>] nmi_trigger_cpumask_backtrace+0xd0/0x178
[398765.256418] [<ffffffff80925af8>] rcu_dump_cpu_stacks+0xbc/0x128
[398765.262439] [<ffffffff808ec4e0>] rcu_check_callbacks+0x2e8/0x7e0
[398765.268543] [<ffffffff808efb2c>] update_process_times+0x34/0x70
[398765.274562] [<ffffffff808feb80>] tick_sched_timer+0x170/0x1d8
[398765.280404] [<ffffffff808f0778>] __hrtimer_run_queues+0xd8/0x1b0
[398765.286507] [<ffffffff808f09cc>] hrtimer_interrupt+0xd4/0x288
[398765.292349] [<ffffffff80874e58>] c0_compare_interrupt+0x80/0x98
[398765.298368] [<ffffffff808dcbe0>] __handle_irq_event_percpu+0x78/0x188
[398765.304905] [<ffffffff808dcd10>] handle_irq_event_percpu+0x20/0x68
[398765.311181] [<ffffffff808e17e0>] handle_percpu_irq+0x80/0xb0
[398765.316937] [<ffffffff808dbc3c>] generic_handle_irq+0x2c/0x48
[398765.322783] [<ffffffff80f692ec>] do_IRQ+0x1c/0x30
[398765.327585] [<ffffffff80807b68>] plat_irq_dispatch+0xe8/0x110
[398765.333430] [<ffffffff8086c5ac>] handle_int+0x14c/0x158
[398765.338750] [<ffffffff808767cc>] cnmips_cu2_call+0x7c/0xa0
[398765.344334] [<ffffffff808b3240>] notifier_call_chain+0x50/0xa8
[398765.350263] [<ffffffff8086d274>] handle_cpu_int+0x24/0x30

curious. because that means that irq is still blocked. are you sure that you made that all correct?

Yes, it should be all correct according to below output

 % grep octeon_crypto_ octeon-aes-gcm.c
                flags = octeon_crypto_enable_no_irq_save(&state);
                octeon_crypto_disable_no_irq_save(&state, flags);
                flags = octeon_crypto_enable_no_irq_save(&state);
                octeon_crypto_disable_no_irq_save(&state, flags);
        flags = octeon_crypto_enable_no_irq_save(&state);
        octeon_crypto_disable_no_irq_save(&state, flags);
        flags = octeon_crypto_enable_no_irq_save(&state);
                octeon_crypto_disable_no_irq_save(&state, flags);
        octeon_crypto_disable_no_irq_save(&state, flags);
        flags = octeon_crypto_enable_no_irq_save(&state);
                octeon_crypto_disable_no_irq_save(&state, flags);
        octeon_crypto_disable_no_irq_save(&state, flags);

did you remove preempt_disable/enable?

the point is this stall message raises up of the timer interrupt did not raise up in the specified time. this can be a total overload of the cpu or the irq is blocked. if you use HZ = 100. this is 10 ms. so the question would be also how big the encryption block is. must be very big it this timing cannot be guaranteed. and no_irq_save etc. just saves the state and we only relocated the call position. its curious.

Yes. Here is the code.

unsigned long octeon_crypto_enable_no_irq_save(struct octeon_cop2_state *state)
{
        int status;

        status = read_c0_status();
        write_c0_status(status | ST0_CU2);
        if (KSTK_STATUS(current) & ST0_CU2) {
                octeon_cop2_save(&(current->thread.cp2));
                KSTK_STATUS(current) &= ~ST0_CU2;
                status &= ~ST0_CU2;
        } else if (status & ST0_CU2) {
                octeon_cop2_save(state);
        }
        return status & ST0_CU2;
}
EXPORT_SYMBOL_GPL(octeon_crypto_enable_no_irq_save);

void octeon_crypto_disable_no_irq_save(struct octeon_cop2_state *state,
                                       unsigned long crypto_flags)
{
        if (crypto_flags & ST0_CU2)
                octeon_cop2_restore(state);
        else
                write_c0_status(read_c0_status() & ~ST0_CU2);
}
EXPORT_SYMBOL_GPL(octeon_crypto_disable_no_irq_save);

then we may have to live with the working variant unless you want still to play with it to find the cause or a better solution. best start is maybe if you put some printks to find out the requested block size for encryption and then maybe optimize the code todo it in the right size of chucks. should be not to small for performance reasons of course and not to big to avoid the kernel warning. the warning itself is not really critical but it will definitly create heavy jitter on the system clock

Will check the cause later when I have more free time.

How about this? I saw you updated some code about xor after 5220861fe and not sure if it's done.