OpenWrt 19.07.7 kernelpanic Netgear R7800

Hi,

I replaced our main OpenWRT router two weeks ago with an netgear r7800 and 19.07.7 openwrt. It's main purpose is connecting five office branches via strongswan and a few clients via openvpn. It also acts an wlan access point for around 20 mainly mobile devices.
Today the system rebooted and left the following kernel panic in the log file.

Tue Mar 30 12:41:54 2021 kern.alert kernel: [572799.396113] Unable to handle kernel paging request at virtual address 6c696164
Tue Mar 30 12:41:54 2021 kern.alert kernel: [572799.396153] pgd = c0204000
Tue Mar 30 12:41:54 2021 kern.alert kernel: [572799.402399] [6c696164] *pgd=00000000
Tue Mar 30 12:41:54 2021 kern.emerg kernel: [572799.405026] Internal error: Oops: 80000005 [#1] SMP ARM
Tue Mar 30 12:41:54 2021 kern.warn kernel: [572799.408828] Modules linked in: pppoe ppp_async ath10k_pci ath10k_core ath pppox ppp_generic nf_nat_pptp nf_conntrack_pptp nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 cdc_ether xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_policy xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_esp xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY usbnet ts_fsm ts_bm slhc nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4 nf_nat_ipv4 nf_nat_h323 nf_nat_amanda nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_tftp

I use the r7800 in other smaller branches and they are running for weeks without issues.
I wonder if such kernel panics are know issues. If not what would be an good strating point to track these issues down.

So far the router reboots around every five days. I added an check_mk client script which does not show irregularities in kontext switches or memory usage. Also i tried the ath10k-ct-smallbuffers module. But with no luck. An kern.alert does not always get logged. Today I found this one from an reboot on sunday, so at an time when no ones is at wor and the router is idling.


Sat Apr 17 14:03:24 2021 kern.alert kernel: [562221.709322] Unable to handle kernel NULL pointer dereference at virtual address 00000000
Sat Apr 17 14:03:24 2021 kern.alert kernel: [562221.709363] pgd = c0204000
Sat Apr 17 14:03:24 2021 kern.alert kernel: [562221.716608] [00000000] *pgd=00000000

Ummmm...you'll kinda need the log before it reboots...

Can you sent it via the logger or display it in an SSH screen etc. so you can review it when it reboots/disconnects?

1 Like

Did you consider:

It solved my identical problems.

1 Like

19.07 w/ kernel 4.x had some reboot problems. Mine used to reboot every few days also. I haven't seen them on kernel 5.4, it may be worth trying a 21.02 or mainline build.

1 Like

Thank you all for the feedback. I changed the min cpu frequency from 384MHz to 800MHz because this ws the easiest to do. Will have to wait a few days and report results here in a week.
I use the RS7800 with version 19.07 in threeother branches. These do not reboot and have an uptime of 20 and 40 days.

Really...this was easy?

I'm curious;

  • What do you believe it'll solve
  • How did you make that adjustment

(Feel free to respond when you have a moment.)

There are some caveats with ipq806x running at 384 MHz, on the one hand the ramp up time (longer latency than necessary, but not buggy behaviour) - on the other hand there are some intricacies with the silicon at this speed as well (and this can cause crashes). This pull request explains some of the reasoning (there are other fixes pending as part of the v5.10 kernel bump for ipq806x as well, touching L2 cache scaling and MDIO busy waits, which will improve the situation as well).

As far as 19.07.x with kernel 4.14 is concerned, there may be another issue looming on the stmmac (ethernet) driver when encountering jumbo frames (this is fixed in kernel >=4.19, but hasn't been backported as no one identified the commit fixing these).

3 Likes

I upped the min frequency manual via echo on the command line for both cores.

echo 800000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
echo 800000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq

If it is an L2 cache error as mentioned above chances are low that redirect logging to an other machine will help. Same goes for jumboframe introduced errors.
I configure all my router images via ansible scripts reading settings from netbox. Putted adding an option for log redirects on my list, but plant to wait for 21.02 stable before i switch to that.

1 Like

It improves latency (as the lowest frequencies are eliminated from the CPU frequency scaling), and might improve stability.

Background discussion leading to that PR can be found in a few messages around this:

Example script in

2 Likes

The router did not reboot the last 13 days. Before it usually rebooted every five days, one time the router had an uptime of 10 days, but it looks promising. Seems the min frequency fix did the trick.

image

1 Like

Router is running since 24 days now so I mark this issue as solved.
image

In short the fix (now added to /etc/rc.local)

echo 800000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
echo 800000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq
1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.