R2S: openssl gcm performance

I am using official 22.03 R2S image and opkg to install openssl. Why is the gcm performance so worse comparing with cbc? Does the R2S use the armv8 crypto extensions to optimize the cbc speed in openssl? @SvenH and @jayanta525, any idea? Thanks

oot@OpenWrt:/# openssl speed -evp aes-128-cbc -elapsed                                                                                                                                                            
You have chosen to measure elapsed time instead of user CPU time.                                                                                                                                                  
Doing aes-128-cbc for 3s on 16 size blocks: 17670608 aes-128-cbc's in 3.00s                                                                                                                                        
Doing aes-128-cbc for 3s on 64 size blocks: 13704145 aes-128-cbc's in 3.00s                                                                                                                                        
Doing aes-128-cbc for 3s on 256 size blocks: 7113068 aes-128-cbc's in 3.00s                                                                                                                                        
Doing aes-128-cbc for 3s on 1024 size blocks: 2496547 aes-128-cbc's in 3.00s                                                                                                                                       
Doing aes-128-cbc for 3s on 8192 size blocks: 354119 aes-128-cbc's in 3.00s                                                                                                                                        
Doing aes-128-cbc for 3s on 16384 size blocks: 178708 aes-128-cbc's in 3.00s                                                                                                                                       
OpenSSL 1.1.1q  5 Jul 2022                                                                                                                                                                                         
built on: Thu Oct 13 13:10:56 2022 UTC                                                                                                                                                                             
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)                                                                                                                                                    
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=generic -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result T
The 'numbers' are in 1000s of bytes per second processed.                                                                                                                                                          
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes                                                                                                                         
aes-128-cbc      94243.24k   292355.09k   606981.80k   852154.71k   966980.95k   975983.96k                                                                                                                        

root@OpenWrt:/# openssl speed -evp aes-128-gcm -elapsed                                                                                                                                                            
You have chosen to measure elapsed time instead of user CPU time.                                                                                                                                                  
Doing aes-128-gcm for 3s on 16 size blocks: 6591430 aes-128-gcm's in 3.00s                                                                                                                                         
Doing aes-128-gcm for 3s on 64 size blocks: 2176086 aes-128-gcm's in 3.00s                                                                                                                                         
Doing aes-128-gcm for 3s on 256 size blocks: 602401 aes-128-gcm's in 3.00s                                                                                                                                         
Doing aes-128-gcm for 3s on 1024 size blocks: 155371 aes-128-gcm's in 3.00s                                                                                                                                        
Doing aes-128-gcm for 3s on 8192 size blocks: 19669 aes-128-gcm's in 3.00s                                                                                                                                         
Doing aes-128-gcm for 3s on 16384 size blocks: 9840 aes-128-gcm's in 3.00s                                                                                                                                         
OpenSSL 1.1.1q  5 Jul 2022
built on: Thu Oct 13 13:10:56 2022 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr) 
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -Os -pipe -mcpu=generic -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result T
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-gcm      35154.29k    46423.17k    51404.89k    53033.30k    53709.48k    53739.52k

root@OpenWrt:~# openssl speed -evp aes-128-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 18254269 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 14234734 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 7252691 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2516185 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 354435 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 178608 aes-128-cbc's in 3.00s
OpenSSL 1.1.1q  5 Jul 2022
built on: Tue Oct 18 18:13:41 2022 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -pipe -mcpu=cortex-a53+crypto+crc -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -O3 -DPIC -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_PREFER_CHACHA_OVER_GCM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc      97356.10k   303674.33k   618896.30k   858857.81k   967843.84k   975437.82k

root@OpenWrt:~# openssl speed -evp aes-128-gcm -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-gcm for 3s on 16 size blocks: 12316276 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 8990693 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 4391512 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 1024 size blocks: 1453294 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 199177 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 16384 size blocks: 99966 aes-128-gcm's in 3.00s
OpenSSL 1.1.1q  5 Jul 2022
built on: Tue Oct 18 18:13:41 2022 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc -fPIC -pthread -Wa,--noexecstack -Wall -O3 -pipe -mcpu=cortex-a53+crypto+crc -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -O3 -DPIC -fPIC -ffunction-sections -fdata-sections -znow -zrelro -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -DOPENSSL_PREFER_CHACHA_OVER_GCM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-gcm      65686.81k   191801.45k   374742.36k   496057.69k   543885.99k   545947.65k
root@OpenWrt:~#

May I know if you have added some patches to optimize gcm? Thanks.

My builds have cpu specific optimization for the devices, hence for R2S -mcpu=cortex-a53+crypto+crc
This is most probably the reason but I didn't check in detail. All the code to reproduce the builds is available on my repo.