SMB & NFS speed

I have an Archer C7 v2 using OpenWrt from https://github.com/gwlim/openwrt-sfe-flowoffload/tree/master/MAR-2020 (because in the past I had very unreliable WiFi with normal OpenWrt)

Now I added some USB hard drive (ext4) to the router, read & write speed is likely limited by USB2 (ca 25 MiB/s for both).
When trying to setup the drive for network access I initially used NFS (server version 2.3.4-3) because I read that it is lighter on CPU and recommended if there is no Windows system involved.
Speed is 13 MiB/s read, 15 MiB/s write, and it seems to be limited by CPU (some 30% usage by ksoftirqd, most of the rest to 100% by 8 nfsd processes). Options like version, async, wsize, rsize don't change this.
I tried samba (server version 3.6.25-14), and got 14 MiB/s read and 21 MiB/s write when mounting with vers=1.0, and the CPU does not seem to be the limiting factor here (read speed is limited by WiFi speed, in iperf3 I get 150-170 Mbit/s from router to client)
Using SMB version 2 has 11 MiB/s read and 16 MiB/s write (definitely CPU limited, but still about as fast as NFS).

So my question is: Why is NFS slower/heavier on CPU? Is this normal, or can I improve it somehow?
I am ok with the speeds of SMB v1, but I would like to avoid it as I read this is rather insecure (although this might not be relevant for a private network).

Update:
vsftpd read is same as SMB 1, write is 20 MiB/s with vsftp using somewhat more CPU than smbd when writing (62% smbd, 68% vsftpd, tested several times)
So I will not get more than that, but still don't understand why NFS is considerably slower...

NFS speed is increased to SMB levels when reducing the number of NFS threads.
Why are more threads so much slower, and can it cause any problems if I have the number set to 2 or 3 instead of 8?

perf is your friend ( not easy, but has all the answers you need )...

all hardware is different ( and likely software/fs/proto/opts/ossched setups )... rather than trying to understand each setting... best to;

  • start with the most common...
  • benchmark it a few steps each time
  • settle on something quasi optimal
  • rinse and repeat for another config parameter

the higher the software level, the easier to resolve / find simple answers to...