Netgear R7800 exploration (IPQ8065, QCA9984)

Have you got the firmware loading onto the NSS cores? It's here https://github.com/qca/nss-firmware and if you look at the stock netgear firmware (or the QCA-NSS repos) there's an init script to load it into the CPU.

Looks like you're getting closer than I ever did, well done!

Here's a result from a ping test:

--- 8.8.8.8 ping statistics ---
272 packets transmitted, 272 packets received, 0% packet loss
round-trip min/avg/max = 10.267/12.586/88.540 ms

Most of the pings are around 10-12ms, but now and then 1 result is around 90ms.
I see this in several ping tests.

This is my interrupt-list:

Router1:~# cat /proc/interrupts
           CPU0       CPU1
 16:     419823     465146     GIC-0  18 Edge      gp_timer
 18:      15693          0     GIC-0  51 Edge      qcom_rpm_ack
 19:          0          0     GIC-0  53 Edge      qcom_rpm_err
 20:          0          0     GIC-0  54 Edge      qcom_rpm_wakeup
 28:          2          0   msmgpio   6 Edge      gpio-keys
 76:          2          0   msmgpio  54 Edge      gpio-keys
 87:          2          0   msmgpio  65 Edge      gpio-keys
 95:          0          0     GIC-0 241 Edge      ahci[29000000.sata]
 96:          0          0     GIC-0 210 Edge      tsens_interrupt
 97:     105522     274623     GIC-0  67 Edge      qcom-pcie-msi
 98:     250555     402858     GIC-0  89 Edge      qcom-pcie-msi
 99:     257587          0     GIC-0 202 Edge      adm_dma
100:     124952     333924     GIC-0 255 Level     eth0
101:      59543     361456     GIC-0 258 Level     eth1
102:          0          0     GIC-0 130 Level     bam_dma
103:          0          0     GIC-0 128 Level     bam_dma
104:          0          0   PCI-MSI   0 Edge      aerdrv
105:     105522     274623   PCI-MSI   1 Edge      ath10k_pci
137:          0          0   PCI-MSI   0 Edge      aerdrv
138:     250555     402858   PCI-MSI   1 Edge      ath10k_pci
170:         12          0     GIC-0 184 Level     msm_serial0
171:          0          0     GIC-0 142 Level     xhci-hcd:usb1
172:          0          0     GIC-0 237 Level     xhci-hcd:usb3
IPI0:          0          0  CPU wakeup interrupts
IPI1:          0          0  Timer broadcast interrupts
IPI2:     302365       3904  Rescheduling interrupts
IPI3:          0     195006  Function call interrupts
IPI4:          0          0  CPU stop interrupts
IPI5:      60991      10080  IRQ work interrupts
IPI6:          0          0  completion interrupts

The interrupts "gp_timer" and 2x "ath10k_pci" are also active on CPU1.

I also took a quick look at the individual interrupts, I noticed that some of the interrupts have an smp_affinity of "3". What does that mean? Are there 4 CPU's?

It was an adaptation of the @dissent1 script. Yeah i forgot to change that lines and i've done it but also forgot to post it here. Sorry!. Nice @bouwew that you got to the point :slight_smile:

Affined with both cores. bits are flags

01
10
11

It means it can be assigned to any core, 1 or 2. :slight_smile:

Here the script corrected with the labels eth0 and eth1

If i have time i'll try to create a service script for irqbalance --oneshot option so we can execute it from the web front and add it as a service like set_cpu_affinity

#!/bin/sh /etc/rc.common
# First start irqbalance with the --oneshot option
# Try to balance manually both eth to core2 and wifi0 to core2 ifthey are not balanced correctly
# System -> startup -> Local Startup
# /usr/sbin/irqbalance --oneshot --debug > /var/log/irqbalance.log
#  /etc/init.d/set_cpu_affinity

START=99

set_irq_affinity() {
	local name="$1"
	local val="$2"
  
case "$name" in
wifi0)
  	local irq_wifi0=`grep -E -m1 'ath10k_ahb|qcom-pcie-msi' /proc/interrupts | cut -d: -f1 | tail -n1 | tr -d ' '`
	[ -n "$irq_wifi0" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq_wifi0/smp_affinity"
	;;
wifi1)
  	local irq_wifi1=`grep -E -m2 'ath10k_ahb|qcom-pcie-msi' /proc/interrupts | cut -d: -f1 | tail -n1 | tr -d ' '`
	[ -n "$irq_wifi1" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq_wifi1/smp_affinity"
	;;
eth0)
  	local irq_eth0=`grep -E -m3 'eth0' /proc/interrupts | cut -d: -f1 | tail -n1 | tr -d ' '`
	[ -n "$irq_eth0" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq_eth0/smp_affinity"
	;;
eth1)
  	local irq_eth1=`grep -E -m3 'eth1' /proc/interrupts | cut -d: -f1 | tail -n1 | tr -d ' '`
	[ -n "$irq_eth1" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq_eth1/smp_affinity"
	;;
*)
  	local irq=`grep -m 1 "$name" /proc/interrupts | cut -d: -f1 | sed 's, *,,'`
	[ -n "$irq" ] || echo "$name irq not found."
	echo "$val" > "/proc/irq/$irq/smp_affinity"
	;;
esac
}

start() {

. /lib/functions.sh

    set_irq_affinity eth0 2
	set_irq_affinity eth1 2
	set_irq_affinity wifi0 2

}

So, it did not help. Are you using master or 17.01? My tests are all using master and I used @hnymanā€™s build.

Does cat /proc/cmdline show the parameter added?

Yes, the firmware loaded. I extracted the firmware binaries from the latest Netgear firmware image. But the driver doesnā€™t seem to work right. Still trying to find out whatā€™s wrong. Also i only managed to find the device tree Config for the first core. The second core still not active.

Sounds like you're making excellent progress - from memory it's a tough slog getting it all working, and it's going to be a pain keeping it all up to date, but you're doing impressive work - and I'll be tracking it closely!

1 Like

Iā€™ve committed my changes here:

If youā€™re interested, do try it out. I didnā€™t commit the firmware binaries as I think I donā€™t have the rights to post it online, until I saw again the read me on the link you posted earlier.

1 Like

You could add a script as part of the build root setup to automatically clone them from that github repo

Iā€™m still testing the driver at this stage, so currently copying the files manually into my routerā€™s overlay partition. Will automate it once itā€™s ready for use, although I donā€™t know how long that will take. :grimacing:

1 Like

@fantom-x
I'm using master, it's a mix of hnyman's and escalade's builds. And, yes, isolcpus=1 is active.
Now that I come to think of it, escalade's build includes 2 sets of updates, made by dissent1, that are still present as pull-requests to master: https://github.com/openwrt/openwrt/pull/669 and https://github.com/openwrt/openwrt/pull/632
Maybe they are effecting my results somehow.
More experimenting to do during next weekend :slight_smile:

@hnyman, @lesandie, thanks for clarifying the smp_affinity numbers!

@hnyman

BTW any chance (and time left also :-D) that you can update the irqbalance package from version 1.2 to 1.3?. If not i'll try to do it myself as in the new version some optimization of platform device irq detection has been made.

No, I will not be doing that upgrade. irqbalance currently requires external glib2 library that is large, so the version upgrade would increase the installed size quite much.
See https://github.com/Irqbalance/irqbalance/issues/40

(they also broken build without GUI (see https://github.com/Irqbalance/irqbalance/issues/41 ) but I patched that away by reverting their commits for the current version.

Having full glib2 as dependency would increase size (with dependencies) by a megabyte.
One option might be to statically link the needed glib2 library parts, but I have not tried that.

What use case can irqbalance solve?

Yeah, i read about the glib2 dummy in the irqbalance website. If i have time i'll try to compile it statically, since it only adds a couple of KBs.

But, as @fantom-x is asking

What use case can irqbalance solve?

i can agree with the probable answer for that question: none :smiley:

Better to do it manually with the affinity script, at least i get better results, the spikes went down but are still there, although they always were not annoying.

I'll try @quarky approach with the nss driver

While I was testing the nss drivers I thot Iā€™d simulate the latency issue. There seem to be correlation between the spike and wireless network activity. Try disabling both wireless interfaces and see if you still see the spikes.

As for the nss drivers, Iā€™ve made a wee bit more progress, I think. Found the 2nd NSS coreā€™s details from Netgearā€™s firmware source code. Still trying to work out the exact values by trial and error tho. The nss driver seems to be able to offload WiFi traffic as well, so if the latency spike is linked to WiFi, maybe the nss driver may help reduce the latency spikes.

1 Like

I still experience the spikes with both the 5GHz and 2.4GHz networks disabled. My spikes are relatively infrequent: one ~80ms spike every 3-4 minutes maybe.

As soon as i get a mvebu unit (wrt1900acs) i'll substitute the R7800 and i'll help in the nss driver testing :smiley: