Crashes nearly daily - finally have a kernel log - help with interpretation

Hi all,

Our router has been crashing about once per day through multiple versions of OpenWRT, and so far I haven't been able to work out why.

At first I blamed the ath10k wifi driver, so I replaced the board with a second ath9k board. Then I saw that system logs implicated Bluetooth, so I removed the bluetooth hardware that I had on the USB bus.

But it still crashes once per day or so.

So I decided to hook the serial cable up to see if I could capture kernel-level oops data, and I finally got what looks like a useful log:

Aug 31 04:20:47.054118 [34757.315320] BUG: unable to handle page fault for address: ffffffff804443d0
Aug 31 04:20:47.059237 [34757.322284] #PF: supervisor read access in kernel mode
Aug 31 04:20:47.064424 [34757.327436] #PF: error_code(0x0000) - not-present page
Aug 31 04:20:47.069584 [34757.332586] PGD 240e067 P4D 240e067 PUD 240f063 PMD 0
Aug 31 04:20:47.073204 [34757.337751] Oops: 0000 [#1] SMP NOPTI
Aug 31 04:20:47.079368 [34757.341432] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.162 #0
Aug 31 04:20:47.085538 [34757.347544] Hardware name: PC Engines APU/APU, BIOS 4.0 09/08/2014
Aug 31 04:20:47.089536 [34757.353731] RIP: 0010:0xffffffff8113f441
Aug 31 04:20:47.098298 [34757.357673] Code: e0 48 39 cf 73 ae 48 8b 40 10 48 85 c0 75 ea 31 d2 8b 05 52 76 2c 01 41 39 c1 7
Aug 31 04:20:47.106998 5 b2 48 85 d2 74 33 48 8b 42 f8 48 85 c0 74 2a <48> 8b 90 90 01 00 00 48 39 d7 72 25 8b 88 98 01 00
Aug 31 04:20:47.108335 00 48 01 ca 48
Aug 31 04:20:47.113522 [34757.376447] RSP: 0018:ffffc900000bc1d8 EFLAGS: 00010082
Aug 31 04:20:47.120661 [34757.381689] RAX: ffffffff80444240 RBX: ffffffffa043fedb RCX: ffffffffb0446000
Aug 31 04:20:47.127795 [34757.388830] RDX: ffffffffa04443a0 RSI: 00000000ac007000 RDI: ffffffffa043fedb
Aug 31 04:20:47.134958 [34757.395974] RBP: ffffc900000bc1e0 R08: 0000000000000000 R09: 00000000000003ee
Aug 31 04:20:47.142078 [34757.403123] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc900000bc308
Aug 31 04:20:47.149221 [34757.410264] R13: 000000000000000e R14: 0000000000000000 R15: ffff888100100000
Aug 31 04:20:47.157337 [34757.417406] FS:  0000000000000000(0000) GS:ffff88811ad00000(0000) knlGS:0000000000000000
Aug 31 04:20:47.163098 [34757.425511] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 31 04:20:47.170219 [34757.431271] CR2: ffffffff804443d0 CR3: 00000001101e8000 CR4: 00000000000006e0
Aug 31 04:20:47.172712 [34757.438415] Call Trace:
Aug 31 04:20:47.174745 [34757.440875]  <IRQ>
Aug 31 04:20:47.178042 [34757.442903]  ? 0xffffffff81a8dafb
Aug 31 04:20:47.181379 [34757.446229]  ? 0xffffffff81a8dbcc
Aug 31 04:20:47.184716 [34757.449557]  ? 0xffffffff8106028b
Aug 31 04:20:47.188056 [34757.452885]  ? 0xffffffff8115c041
Aug 31 04:20:47.191389 [34757.456213]  ? 0xffffffff8113f441
Aug 31 04:20:47.194704 [34757.459541]  ? 0xffffffff810dbaeb
Aug 31 04:20:47.198030 [34757.462869]  ? 0xffffffff8106047a
Aug 31 04:20:47.201352 [34757.466199]  ? 0xffffffff810605c5
Aug 31 04:20:47.204695 [34757.469527]  ? 0xffffffff8106066e
Aug 31 04:20:47.208022 [34757.472853]  ? 0xffffffff81ae1607
Aug 31 04:20:47.211354 [34757.476183]  ? 0xffffffff81c00c07

This repeats about a dozen times. As you can see I'm running a PC Engines APU2.

RAM is barely used at all:

root@OpenWrt:~# cat /proc/meminfo 
MemTotal:        4004228 kB
MemFree:         3091404 kB
MemAvailable:    3495536 kB
Buffers:           66244 kB
Cached:           367388 kB
SwapCached:            0 kB
Active:           615492 kB
Inactive:         173680 kB
Active(anon):     353360 kB
Inactive(anon):     2184 kB
Active(file):     262132 kB
Inactive(file):   171496 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        355548 kB
Mapped:           188444 kB
Shmem:              2624 kB
KReclaimable:      52968 kB
Slab:              76872 kB
SReclaimable:      52968 kB
SUnreclaim:        23904 kB
KernelStack:        3360 kB
PageTables:         3612 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2002112 kB
Committed_AS:     594996 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       12808 kB
VmallocChunk:          0 kB
Percpu:              456 kB
DirectMap4k:       50488 kB
DirectMap2M:     2015232 kB
DirectMap1G:     2097152 kB

And the two CPUs rarely get over 0.17 load average.

This hardware should be the best available, so I'm really pulling out what's left of my hair. Can any of you maybe find a clue in that kernel log ? Is there any way to work out which driver was referencing address ffffffff804443d0 at the time of the crash?

I failed to include our current openWRT version:

OpenWrt 23.05.4, r24012-d8dd03c46f

The dmesg is very scarce.
Any special conditions around it?

That's it. For what it's worth that kernel log does not appear in either dmesg or in the system logs (which I ship). In order to get that I had to connect with a serial cable.

The previous line is 6 hours earlier, like so:

Aug 30 21:54:05.391503 [11558.255230] kmodloader: - ccp-crypto - 0
Aug 31 04:20:47.054118 [34757.315320] BUG: unable to handle page fault for address: ffffffff804443d0

So I'm fully stumped.

You might want to review the bios settings for anything that might be Windows specific and turn them off .
Bit of a long shot but you never know

Have we ever reviewed the complete configuration just to make sure everything is valid?

Are there any non-default packages installed in this system?

Please connect to your OpenWrt device using ssh and copy the output of the following commands and post it here using the "Preformatted text </> " button:
grafik
Remember to redact passwords, MAC addresses and any public IP addresses you may have:

ubus call system board
cat /etc/config/network
cat /etc/config/wireless
cat /etc/config/dhcp
cat /etc/config/firewall

Well, I doubt that there would be anything Windows-specific since I think the hardware is mainly meant to run BSD, but there is potentially a firmware update:

https://pcengines.github.io/

So I'll look into that.

I think I've redacted anything remotely sensitive, but I'll appreciate hearing if I haven't.

{
	"kernel": "5.15.162",
	"hostname": "OpenWrt",
	"system": "AMD G-T40E Processor",
	"model": "PC Engines apu1",
	"board_name": "pc-engines-apu1",
	"rootfs_type": "ext4",
	"release": {
		"distribution": "OpenWrt",
		"version": "23.05.4",
		"revision": "r24012-d8dd03c46f",
		"target": "x86/64",
		"description": "OpenWrt 23.05.4 r24012-d8dd03c46f"
	}
}

config interface 'loopback'
	option device 'lo'
	option proto 'static'
	option ipaddr '127.0.0.1'
	option netmask '255.0.0.0'

config device
	option name 'br-lan'
	option type 'bridge'
	list ports 'eth1'
	list ports 'eth2'

config interface 'lan'
	option device 'br-lan'
	option proto 'static'
	option ipaddr '172.24.42.1'
	option netmask '255.255.255.0'
	option ip6assign '60'
	option ipv6 '0'
	list dns '172.24.42.1'
	list dns_search 'somedomain_that_I_own'

config interface 'wan'
	option device 'eth0'
	option proto 'static'
	option ipaddr '172.22.22.22'
	option netmask '255.255.255.0'
	option gateway '172.22.22.1'
	option ipv6 '0'
	list dns '172.22.22.1'

config interface 'openvpn'
	option proto 'static'
	option device 'tun0'
	option ipaddr '172.24.24.1'
	option netmask '255.255.255.0'

config interface 'docker'
	option device 'docker0'
	option proto 'none'
	option auto '0'

config device
	option type 'bridge'
	option name 'docker0'


config wifi-device 'radio0'
	option type 'mac80211'
	option hwmode '11a'
	option path 'pci0000:00/0000:00:15.0/0000:06:00.0'
	option country 'CH'
	option cell_density '2'
	option channel '116'

config wifi-device 'radio1'
	option type 'mac80211'
	option path 'pci0000:00/0000:00:07.0/0000:04:00.0'
	option htmode 'HT20'
	option cell_density '0'
	option country 'CH'
	option hwmode '11g'
	option channel 'auto'
	option txpower '20'

config wifi-iface 'default_radio1'
	option device 'radio1'
	option network 'lan'
	option mode 'ap'
	option ssid '*********'
	option encryption 'psk2'
	option key '*******************'

config wifi-iface 'wifinet1'
	option device 'radio0'
	option mode 'ap'
	option ssid '************'
	option encryption 'psk2'
	option key '********************************'
	option network 'lan'


config dnsmasq
	option domainneeded '1'
	option localise_queries '1'
	option rebind_protection '1'
	option rebind_localhost '1'
	option local '/lan/'
	option expandhosts '1'
	option authoritative '1'
	option readethers '1'
	option leasefile '/tmp/dhcp.leases'
	option localservice '1'
	option ednspacket_max '1232'
	option domain 'somedomain_that_I_own'
	option noresolv '1'
	option filter_aaaa '1'
	list rebind_domain 'somedomain_that_I_own'
	option port '1053'

config dhcp 'lan'
	list dhcp_option 'option:dns-server,172.24.42.1'
	option interface 'lan'
	option start '100'
	option limit '150'
	option leasetime '12h'
	option dhcpv4 'server'
	option port '1053'

config dhcp 'wan'
	option interface 'wan'
	option ignore '1'
	option start '100'
	option limit '150'
	option leasetime '12h'

config odhcpd 'odhcpd'
	option maindhcp '0'
	option leasefile '/tmp/hosts/odhcpd'
	option leasetrigger '/usr/sbin/odhcpd-update'
	option loglevel '4'

config domain
	option name 'spike.lan'
	option ip '172.24.42.5'

config domain
	option name 'sickchill.lan'
	option ip '172.24.42.4'

config domain
	option name 'sab.lan'
	option ip '172.24.42.4'

config domain
	option name 'openwrt.lan'
	option ip '172.24.42.1'

config domain
	option name 'router.lan'
	option ip '172.24.42.1'

config domain
	option ip '172.24.42.118'
	option name 'fp3'

config domain
	option name 'graylog.lan'
	option ip '172.24.42.4'

config domain
	option name 'kodi.lan'
	option ip '172.24.42.4'

config host
	option name 'spike'
	option duid '0004C295F315E177C737BBA0B1903EA9FBEC'
	option hostid 'd59'

config domain
	option name 'sickchill.somedomain_that_I_own'
	option ip '172.24.42.4'

config domain
	option name 'kodi.somedomain_that_I_own'
	option ip '172.24.42.4'

config domain
	option name 'sab.somedomain_that_I_own'
	option ip '172.24.42.4'

config domain
	option name 'router.somedomain_that_I_own'
	option ip '172.24.42.1'

config domain
	option name 'spike.somedomain_that_I_own'
	option ip '172.24.42.4'

config domain
	option name 'printer.somedomain_that_I_own'
	option ip '172.24.42.135'

config host
	option name 'printer.somedomain_that_I_own'
	option dns '1'
	option mac '50:57:9C:6A:AA:1A'
	option ip '172.24.42.135'

config domain
	option name 'i.somedomain_that_I_own'
	option ip '172.24.42.4'

config domain
	option name 'radarr.somedomain_that_I_own'
	option ip '172.24.42.4'

config domain
	option name 'sonarr.somedomain_that_I_own'
	option ip '172.24.42.4'

config domain
	option name 'k2so.lan'
	option ip '172.24.42.4'

config domain
	option name 'readarr'
	option ip '172.24.42.4'

config domain
	option name 'syncthing'
	option ip '172.24.42.4'

config domain
	option name 'syncthinga'
	option ip '172.24.42.4'

config domain
	option name 'calibre'
	option ip '172.24.42.4'

config domain
	option name 'home.somedomain_that_I_own'
	option ip '172.24.42.1'

config domain
	option name 'homeassistant.somedomain_that_I_own'
	option ip '172.24.42.1'

config domain
	option name 'perkeep'
	option ip '172.24.42.4'


config defaults
	option input 'ACCEPT'
	option output 'ACCEPT'
	option forward 'REJECT'
	option synflood_protect '1'

config zone 'lan'
	option name 'lan'
	option input 'ACCEPT'
	option output 'ACCEPT'
	option forward 'ACCEPT'
	list device 'tun+'
	list network 'lan'
	list network 'openvpn'

config zone 'wan'
	option name 'wan'
	option input 'REJECT'
	option output 'ACCEPT'
	option forward 'REJECT'
	option masq '1'
	option mtu_fix '1'
	list network 'wan'

config forwarding
	option src 'lan'
	option dest 'wan'

config rule
	option name 'Allow-DHCP-Renew'
	option src 'wan'
	option proto 'udp'
	option dest_port '68'
	option target 'ACCEPT'
	option family 'ipv4'

config rule
	option name 'Allow-Ping'
	option src 'wan'
	option proto 'icmp'
	option icmp_type 'echo-request'
	option family 'ipv4'
	option target 'ACCEPT'

config rule
	option name 'Allow-IGMP'
	option src 'wan'
	option proto 'igmp'
	option family 'ipv4'
	option target 'ACCEPT'

config rule
	option name 'Allow-DHCPv6'
	option src 'wan'
	option proto 'udp'
	option src_ip 'fc00::/6'
	option dest_ip 'fc00::/6'
	option dest_port '546'
	option family 'ipv6'
	option target 'ACCEPT'

config rule
	option name 'Allow-MLD'
	option src 'wan'
	option proto 'icmp'
	option src_ip 'fe80::/10'
	list icmp_type '130/0'
	list icmp_type '131/0'
	list icmp_type '132/0'
	list icmp_type '143/0'
	option family 'ipv6'
	option target 'ACCEPT'

config rule
	option name 'Allow-ICMPv6-Input'
	option src 'wan'
	option proto 'icmp'
	list icmp_type 'echo-request'
	list icmp_type 'echo-reply'
	list icmp_type 'destination-unreachable'
	list icmp_type 'packet-too-big'
	list icmp_type 'time-exceeded'
	list icmp_type 'bad-header'
	list icmp_type 'unknown-header-type'
	list icmp_type 'router-solicitation'
	list icmp_type 'neighbour-solicitation'
	list icmp_type 'router-advertisement'
	list icmp_type 'neighbour-advertisement'
	option limit '1000/sec'
	option family 'ipv6'
	option target 'ACCEPT'

config rule
	option name 'Allow-ICMPv6-Forward'
	option src 'wan'
	option dest '*'
	option proto 'icmp'
	list icmp_type 'echo-request'
	list icmp_type 'echo-reply'
	list icmp_type 'destination-unreachable'
	list icmp_type 'packet-too-big'
	list icmp_type 'time-exceeded'
	list icmp_type 'bad-header'
	list icmp_type 'unknown-header-type'
	option limit '1000/sec'
	option family 'ipv6'
	option target 'ACCEPT'

config rule
	option name 'Allow-IPSec-ESP'
	option src 'wan'
	option dest 'lan'
	option proto 'esp'
	option target 'ACCEPT'

config rule
	option name 'Allow-ISAKMP'
	option src 'wan'
	option dest 'lan'
	option dest_port '500'
	option proto 'udp'
	option target 'ACCEPT'

config rule
	option name 'Support-UDP-Traceroute'
	option src 'wan'
	option dest_port '33434:33689'
	option proto 'udp'
	option family 'ipv4'
	option target 'REJECT'
	option enabled '0'

config include
	option path '/etc/firewall.user'

config redirect
	option target 'DNAT'
	option name 'SSH to spike'
	list proto 'tcp'
	option src 'wan'
	option src_dport '4300'
	option dest 'lan'
	option dest_ip '172.24.42.4'
	option dest_port '22'

config rule 'ovpn'
	option name 'Allow-OpenVPN'
	option src 'wan'
	option proto 'udp'
	option target 'ACCEPT'
	option dest_port '9411'

config redirect
	option target 'DNAT'
	option name 'Plex'
	list proto 'tcp'
	option src 'wan'
	option src_dport '32400'
	option dest 'lan'
	option dest_ip '172.24.42.4'
	option dest_port '32400'

config zone
	option name 'cjdns'
	option input 'REJECT'
	option output 'ACCEPT'
	option forward 'REJECT'
	option conntrack '1'
	option family 'ipv6'

config rule
	option name 'Allow-ICMPv6-cjdns'
	option src 'cjdns'
	option proto 'icmp'
	list icmp_type 'echo-request'
	list icmp_type 'echo-reply'
	list icmp_type 'destination-unreachable'
	list icmp_type 'packet-too-big'
	list icmp_type 'time-exceeded'
	list icmp_type 'bad-header'
	list icmp_type 'unknown-header-type'
	option limit '1000/sec'
	option family 'ipv6'
	option target 'ACCEPT'

config rule
	option enabled '0'
	option name 'Allow-SSH-cjdns'
	option src 'cjdns'
	option proto 'tcp'
	option dest_port '22'
	option target 'ACCEPT'

config rule
	option enabled '0'
	option name 'Allow-HTTP-cjdns'
	option src 'cjdns'
	option proto 'tcp'
	option dest_port '80'
	option target 'ACCEPT'

config rule
	option name 'Allow-cjdns-wan'
	option src 'wan'
	option proto 'udp'
	option dest_port '33876'
	option target 'ACCEPT'

config zone 'docker'
	option input 'ACCEPT'
	option output 'ACCEPT'
	option forward 'ACCEPT'
	option name 'docker'
	list network 'docker'

config forwarding
	option src 'lan'
	option dest 'docker'

config forwarding
	option src 'docker'
	option dest 'lan'

config rule
	option name 'Disallow-direct-home-assistant'
	option src 'lan'
	option dest_port '8123'
	option target 'REJECT'

This is suspect:

Normally, an OpenVPN tunnel should be unmanaged (option proto 'none'). You may be creating a conflict here by having this address assigned here.

The DHCP server has an invalid dhcp_option line:

It should be something like this:

	list dhcp_option '6,172.24.42.1'

I'm not sure about the port line -- I've never seen that used like that, not sure if it is valid. Any particular reason you are trying to use a non-standard DNS port?

With the OpenVPN setup in the firewall zone:

You have a choice... either remove the network interface (discussed earlier) entirely from both the network config and this zone, and declare the device here in the firewall, or remove the device line from the firewall and leave the network in place. Don't specify both, though.

1 Like

OK, so unless I hear otherwise I'm going to flash the firmware with version 4.17.0 which seems to be the most recent supported on my hardware which turns out to be APU1.

OK! Thank you!

I'll do this ASAP. Do you think this misconfiguration could cause the crashes?

FYI: The port 1053 line basically disables the stock DHCP server in favour of unbound. I had to do that because my ISP's DNS servers sometimes provide v6 addresses, even though it does not give me a v6 address, so they are unreachable.

It could... not sure what happens in those situations.

But it's also clear that you have other stuff going on. If the crashes continue, you should consider starting from a completely fresh configuration (omit docker and OpenVPN and anything else you've installed) so that you can run a near-default OpenWrt configuration. If the crashes continue, you may have a hardware issue (or an issue with the BIOS settings, etc.) since OpenWrt should be stable in a (near) default state. If things work for say 2x - 10x the typical crash interval (so if you see a crash every day, wait 2-10 days before moving on to gain confidence that everything is stable), then install one thing (OpenVPN, Docker, etc.)... get it configured and then let it soak for another several days or so. Then add the next, etc.

You can probably safely install the wifi drivers as part of the near-default config if you need them, but if you have other APs, you might want to just lean on those instead while you're working this stuff out.

It has been years, check for dust inside case.....

Thanks! That's good advice for a lot of folks. But in this case I clean it about once a month, including when I took out the ath10k card.

That's good advice. I'll do it.

I suspect the OpenVPN over the docker config (Home Assistant). So I'll start with OpenVPN. The only reason is that the crashing started before I added Home Assistant to the rig.