Our router has been crashing about once per day through multiple versions of OpenWRT, and so far I haven't been able to work out why.
At first I blamed the ath10k wifi driver, so I replaced the board with a second ath9k board. Then I saw that system logs implicated Bluetooth, so I removed the bluetooth hardware that I had on the USB bus.
But it still crashes once per day or so.
So I decided to hook the serial cable up to see if I could capture kernel-level oops data, and I finally got what looks like a useful log:
Aug 31 04:20:47.054118 [34757.315320] BUG: unable to handle page fault for address: ffffffff804443d0
Aug 31 04:20:47.059237 [34757.322284] #PF: supervisor read access in kernel mode
Aug 31 04:20:47.064424 [34757.327436] #PF: error_code(0x0000) - not-present page
Aug 31 04:20:47.069584 [34757.332586] PGD 240e067 P4D 240e067 PUD 240f063 PMD 0
Aug 31 04:20:47.073204 [34757.337751] Oops: 0000 [#1] SMP NOPTI
Aug 31 04:20:47.079368 [34757.341432] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.162 #0
Aug 31 04:20:47.085538 [34757.347544] Hardware name: PC Engines APU/APU, BIOS 4.0 09/08/2014
Aug 31 04:20:47.089536 [34757.353731] RIP: 0010:0xffffffff8113f441
Aug 31 04:20:47.098298 [34757.357673] Code: e0 48 39 cf 73 ae 48 8b 40 10 48 85 c0 75 ea 31 d2 8b 05 52 76 2c 01 41 39 c1 7
Aug 31 04:20:47.106998 5 b2 48 85 d2 74 33 48 8b 42 f8 48 85 c0 74 2a <48> 8b 90 90 01 00 00 48 39 d7 72 25 8b 88 98 01 00
Aug 31 04:20:47.108335 00 48 01 ca 48
Aug 31 04:20:47.113522 [34757.376447] RSP: 0018:ffffc900000bc1d8 EFLAGS: 00010082
Aug 31 04:20:47.120661 [34757.381689] RAX: ffffffff80444240 RBX: ffffffffa043fedb RCX: ffffffffb0446000
Aug 31 04:20:47.127795 [34757.388830] RDX: ffffffffa04443a0 RSI: 00000000ac007000 RDI: ffffffffa043fedb
Aug 31 04:20:47.134958 [34757.395974] RBP: ffffc900000bc1e0 R08: 0000000000000000 R09: 00000000000003ee
Aug 31 04:20:47.142078 [34757.403123] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc900000bc308
Aug 31 04:20:47.149221 [34757.410264] R13: 000000000000000e R14: 0000000000000000 R15: ffff888100100000
Aug 31 04:20:47.157337 [34757.417406] FS: 0000000000000000(0000) GS:ffff88811ad00000(0000) knlGS:0000000000000000
Aug 31 04:20:47.163098 [34757.425511] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 31 04:20:47.170219 [34757.431271] CR2: ffffffff804443d0 CR3: 00000001101e8000 CR4: 00000000000006e0
Aug 31 04:20:47.172712 [34757.438415] Call Trace:
Aug 31 04:20:47.174745 [34757.440875] <IRQ>
Aug 31 04:20:47.178042 [34757.442903] ? 0xffffffff81a8dafb
Aug 31 04:20:47.181379 [34757.446229] ? 0xffffffff81a8dbcc
Aug 31 04:20:47.184716 [34757.449557] ? 0xffffffff8106028b
Aug 31 04:20:47.188056 [34757.452885] ? 0xffffffff8115c041
Aug 31 04:20:47.191389 [34757.456213] ? 0xffffffff8113f441
Aug 31 04:20:47.194704 [34757.459541] ? 0xffffffff810dbaeb
Aug 31 04:20:47.198030 [34757.462869] ? 0xffffffff8106047a
Aug 31 04:20:47.201352 [34757.466199] ? 0xffffffff810605c5
Aug 31 04:20:47.204695 [34757.469527] ? 0xffffffff8106066e
Aug 31 04:20:47.208022 [34757.472853] ? 0xffffffff81ae1607
Aug 31 04:20:47.211354 [34757.476183] ? 0xffffffff81c00c07
This repeats about a dozen times. As you can see I'm running a PC Engines APU2.
And the two CPUs rarely get over 0.17 load average.
This hardware should be the best available, so I'm really pulling out what's left of my hair. Can any of you maybe find a clue in that kernel log ? Is there any way to work out which driver was referencing address ffffffff804443d0 at the time of the crash?
That's it. For what it's worth that kernel log does not appear in either dmesg or in the system logs (which I ship). In order to get that I had to connect with a serial cable.
The previous line is 6 hours earlier, like so:
Aug 30 21:54:05.391503 [11558.255230] kmodloader: - ccp-crypto - 0
Aug 31 04:20:47.054118 [34757.315320] BUG: unable to handle page fault for address: ffffffff804443d0
Have we ever reviewed the complete configuration just to make sure everything is valid?
Are there any non-default packages installed in this system?
Please connect to your OpenWrt device using ssh and copy the output of the following commands and post it here using the "Preformatted text </> " button:
Remember to redact passwords, MAC addresses and any public IP addresses you may have:
Well, I doubt that there would be anything Windows-specific since I think the hardware is mainly meant to run BSD, but there is potentially a firmware update:
Normally, an OpenVPN tunnel should be unmanaged (option proto 'none'). You may be creating a conflict here by having this address assigned here.
The DHCP server has an invalid dhcp_option line:
It should be something like this:
list dhcp_option '6,172.24.42.1'
I'm not sure about the port line -- I've never seen that used like that, not sure if it is valid. Any particular reason you are trying to use a non-standard DNS port?
With the OpenVPN setup in the firewall zone:
You have a choice... either remove the network interface (discussed earlier) entirely from both the network config and this zone, and declare the device here in the firewall, or remove the device line from the firewall and leave the network in place. Don't specify both, though.
OK, so unless I hear otherwise I'm going to flash the firmware with version 4.17.0 which seems to be the most recent supported on my hardware which turns out to be APU1.
FYI: The port 1053 line basically disables the stock DHCP server in favour of unbound. I had to do that because my ISP's DNS servers sometimes provide v6 addresses, even though it does not give me a v6 address, so they are unreachable.
It could... not sure what happens in those situations.
But it's also clear that you have other stuff going on. If the crashes continue, you should consider starting from a completely fresh configuration (omit docker and OpenVPN and anything else you've installed) so that you can run a near-default OpenWrt configuration. If the crashes continue, you may have a hardware issue (or an issue with the BIOS settings, etc.) since OpenWrt should be stable in a (near) default state. If things work for say 2x - 10x the typical crash interval (so if you see a crash every day, wait 2-10 days before moving on to gain confidence that everything is stable), then install one thing (OpenVPN, Docker, etc.)... get it configured and then let it soak for another several days or so. Then add the next, etc.
You can probably safely install the wifi drivers as part of the near-default config if you need them, but if you have other APs, you might want to just lean on those instead while you're working this stuff out.
I suspect the OpenVPN over the docker config (Home Assistant). So I'll start with OpenVPN. The only reason is that the crashing started before I added Home Assistant to the rig.