FriendlyARM NanoPi R2S crashes with new firmware

So the NanoPi R2s worked fine ever since July 2021. I compiled a new firmware from time to time, flashed it and it worked.

Until I tried v23.05.3 earlier this year. The router started to crash. It became completely unresponsive. No response to ping, no internet, nothing. Turns out it crashed after approx. 24 days. Only a power cycle gets it running again.

I decided to go back to the last working firmware: 22.03.3.

A few days ago I decided to give it another try. With 22.03.7 this time. I compile a new firmware, flash it and it crashes today, after only two days.

What can be the problem? How do I find out why it crashes if it is impossible to even login?

Can you flash standard firmware from firmware-selector.openwrt.org
23.05.5 and maube 24.10-rcX ?
I used to have one, it was running extremely hot for semi-closed locker it had to be in.

That is a bit difficult because it is running a router / gateway for about 20 machines / devices.

I've had collectd running over years and the only long term changes it shows correspond to changes in ambient temperature. The NanoPi is tied to the metal frame of a hallway wall unit that works as a huge heat sink :joy:.

Lucky you.... Try to find some thermal indicator in /sys/....
Coolest version is the best.

DEFFO. It's running now at 42.2 C. Never more than 45 C. Not even in summer.

BTW, here is the temperature: /sys/class/thermal/thermal_zone0/temp

Anyway, if this 22.03.7 firmware made it run hot, the problem would still be the firmware, not the device or its heat sink. Right?

Today, again two days after reboot, I noticed that ksoftirqd/1 had started to rise to over 10% from time to time.

Searching for this particular process I found that it's recommended to enable Packet Steering for this device and reboot so that's what I did.

ksoftirqd is immediate processes after hardware interrupt is handled - firewall and qdisc on a router, disk schedulers and video sync in other devices.

First check cat /proc/interrupts -> if same device looks like having multiple IRQ-s start with irqbalance (enable and start after install) to spread interrupts across cores.

root@router:~# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       
 11:     521023     655772    1550027     982197     GICv2  30 Level     arch_timer
 14:          0          0          0          0     GICv2  32 Level     ff1f0000.dmac
 15:          0          0          0          0     GICv2  33 Level     ff1f0000.dmac
 20:          5          0          0          0     GICv2  89 Level     ttyS2
 21:     268263          0          0          0     GICv2  69 Level     ff160000.i2c
 23:          0          0          0          0     GICv2  90 Level     rockchip_thermal
 32:          0          0          0          0     GICv2  43 Level     ff350800.iommu
 33:     127720          0          0          0     GICv2  44 Level     dw-mci
 34:          0    6738171          0          0     GICv2  56 Level     eth0
 36:          0          0          0          0     GICv2  48 Level     ehci_hcd:usb1
 37:          0          0          0          0     GICv2  49 Level     ohci_hcd:usb2
 42:          0          0          0          0     GICv2  94 Level     rockchip_usb2phy
 43:        356          0   10829424          0     GICv2  99 Level     xhci-hcd:usb3
 44:          0          0          0          0  rockchip_gpio_irq  24 Level     rk805
 50:          0          0          0          0     rk805   5 Edge      RTC alarm
 53:          0          0          0          0  rockchip_gpio_irq   0 Edge      keys
IPI0:    266303     228815     181501     196813       Rescheduling interrupts
IPI1:   3135983     837076    1786315    1785391       Function call interrupts
IPI2:         0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:    437124     334838     465230     665449       Timer broadcast interrupts
IPI5:     24183      29260      28841      27507       IRQ work interrupts
IPI6:         0          0          0          0       CPU wake-up interrupts
Err:          0

Where do I look for the answer to your question?

Interrupts are balanced across cpu-s
cpu1 holds eth0 card, if ethtool -g/-G is supporyed you may get multiple interrupts, if not click steering (and keep irqbalance)

I installed irqbalance. How can I see the effect?

CPU3 started to handle eth0 interrupts according to /proc/interrupts. Is that OK?

Yep, it works as expected.

Run

iperf3 --bidir -c <server>

Servers: https://iperf.fr/iperf-servers.php
From some client on LAN side, it should not hit 100% on that ethernet CPU, and sometimes re-balance, and sooner or later settle on some layout.

1 Like

I've used iperf3 for years but I didn't know public iPerf3 servers. I hadn't even thought of searching for them :upside_down_face:

irqbalance seems to work. Now hope it doesn't crash again.

Many thanks @brada4 :clap:

While you are near it you can try other stress/heat test
Run 1-2 per processor (killall yes to stop)

nice yes > /dev/null &

Then watch temperature.

Will do later, at a more appropriate moment. If the router crashes and my family find out it was done deliberately, just to run a test, they will come for me :fearful:

BTW, if the core temperature were a problem, limiting its frequency would be a solution. I've done that successfully with a Raspberry Pi 3B+ in the past. I managed to prevent throttling by lowering the CPU speed once set up a webRTC session remotely.

Is there anything in OpenWrt for this? It seems there is no cpufrequtils package.

However, reading /sys/devices/system/cpu/cpufreq/policy0/cpuinfo_cur_freq gives encouraging results:

1008000 (most of the time)
816000  (less frequently)
408000  (rare)

Although the numbers are a bit weird, they suggest there is already a policy in place.

This is an overview of what's in the /sys/devices/system/cpu/cpufreq/policy0 folder:


affected_cpus:    0 1 2 3
cpuinfo_cur_freq: 816000
cpuinfo_max_freq: 1296000
cpuinfo_min_freq: 408000
cpuinfo_transition_latency: 68000
related_cpus:     0 1 2 3
scaling_available_frequencies: 408000 600000 816000 1008000 1200000 1296000 
scaling_available_governors: powersave performance schedutil 
scaling_cur_freq: 816000
scaling_driver:   cpufreq-dt
scaling_governor: schedutil
scaling_max_freq: 1296000
scaling_min_freq: 408000
scaling_setspeed: <unsupported>

What does cpufreq-dt driver mean?

Can I just write e.g. 1008000 to scaling_max_freq without breaking anything?

that dt ?devicetree? frequency switching driver is used on system, it is not a PC where you have like acpi or amd selectable.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.