Router Rebooting Daily

I have a TP-Link Archer A7 running OpenWrt 23.05.2 r23630-842932a63d.

The router has been rebooting pretty much daily (but not at the same time). I set up ssh and have been running ssh root@ip uptime >> reboot_log.txt to record the uptime and it reboots after about 24-30 hours.

How can I see why it is rebooting? Is there a way to get a persistent log or run a command to see why it rebooted?

  • syslog down the pipe to some machine
  • check if crashlog support available
2 Likes

Anything in your crontab (I understand it is not at a regular interval). Anything in your /etc/rc.local? Wondering about hardware issues (power supply, or SoC). 23.05.3 is out, might not hurt to update and see if your situation changes. I do not know about keeping logs in a persistent fashion. I think there's a way to have the log sent to an external box.

2 Likes
root@OpenWrt:/etc# crontab -l
*/5 * * * * scp /tmp/dhcp.leases root@10.0.2.2:/tmp/dhcp.leases

This is in there. IIRC, it is to make the status page on the access point show the correct names in the wireless status page.

No.

Maybe. I have a TP-Link Archer C7 as an access point. I could swap the PSUs to see if that matters. I could also try the update.

It sounds like you know what you're talking about. But I don't :smile:. Is this or the crashlog documented somewhere?

You need to collect crash logs via serial. They disappear at boot.
You could log `free | logger" via crontab to feed logs better suspecting OOM as legit cause of reboot in otherwise well functioning device.

1 Like

If crashlog available, it should persist at:

/sys/kernel/debug/crashlog

check your bootlog:

dmesg | grep -i crashlog

if built to support.

syslog down the pipe will be a bit hit and miss, probably more of the latter.

1 Like

I don't have anything in /sys/kernel/debug/crashlog. The dmesg grep didn't find anything. So I guess I don't have support for that.

I did set up a syslog-ng docker container and I found the spot in luci to set that to forward to the other machine. That may catch something. I'll let you know in a day or so :slight_smile:

I am using ssh to pull the uptime and (as of today) the load averages from both openwrt machines. I can grab the results from free or /proc/meminfo to record those too.

Ok, waiting on news in 24-30h :wink:

2 Likes

The bootlog example at A7V5 has:

[ 0.265726] Crashlog allocated RAM at address 0x3f00000

but maybe that was a custom build with things enabled?

1 Like

Sounds reasonable technical proposal for big RAM devices.

1 Like

I'm not sure. This is what I get from my command line:

root@OpenWrt:~# ls /sys/kernel/debug/
ag71xx              clear_warn_once     devices_deferred    gpio                lru_gen_full        mtd                 regulator
bdi                 clk                 extfrag             ieee80211           memblock            pinctrl             sleep_time
block               device_component    fault_around_bytes  lru_gen             mips                regmap              usb
root@OpenWrt:~# dmesg | grep -i crashlog
root@OpenWrt:~#

I'll save the effort of pasting the whole dmesg output, but it looks like it is complete from startup.

yes, @anomeome did custom build and has the crashlog at fixed physical memory location. would be cool to have during each platforms first year util dust settles at least....

1 Like

Well, I have the uptime, MemFree kB, load avg and syslog recorded (which is mostly just reports of the ssh in for the other metrics). Hopefully we can find something in the next crash or two. If not, I can try updating. If not that, then I will look into mounting a USB drive and pointing syslog there.

1 Like

Well, between 1:08 and 1:12PM, it rebooted. The last measurement before it rebooted was:

Load Avg: 0.66
Available memory: 24,322kB

There is a spike in load just before the reboot. The max was 6.19 at 1:05PM.

The free memory looks like it slowly climbs down until it cleans up after itself. It does that a few times:

There are several data points in the 17MB-20MB range before the crash. These samples aren't perfect. They involve sshing in, and running processes. So they may be affecting the memory while measuring it.

The syslog has the usual messages reporting sshes before the crash:

May  7 13:08:02 OpenWrt dropbear[12035]: Child connection from X,X,X,X:53524
May  7 13:08:03 OpenWrt dropbear[12035]: Pubkey auth succeeded for 'root' with ssh-ed25519 key SHA256:blah from X.X.X.X:53524
May  7 13:08:03 OpenWrt dropbear[12035]: Exit (root) from <X.X.X.X:53524>: Disconnect received

After the reboot, the messages look nominal. The timestamps are a little bonkers (probably until NTP is up?):

May  6 14:44:57 OpenWrt logread[1082]: Logread connected to X.X.X.X:514 via udp
May  6 14:44:58 OpenWrt kernel: [   45.376894] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
May  6 14:44:58 OpenWrt netifd: wan (2591): udhcpc: started, v1.36.1
May  6 14:44:58 OpenWrt dropbear[2594]: Child connection from X.X.X.X:57116
May  6 14:44:58 OpenWrt odhcp6c[2592]: Failed to send RS (Address not available)
May  6 14:44:59 OpenWrt netifd: radio0 (2523): WARNING: Variable 'data' does not exist or is not an array/object
May  6 14:44:59 OpenWrt netifd: radio1 (2524): WARNING: Variable 'data' does not exist or is not an array/object
May  6 14:45:00 OpenWrt crond[2087]: USER root pid 2688 cmd scp /tmp/dhcp.leases root@X.X.X.X:/tmp/dhcp.leases
May  6 14:45:00 OpenWrt netifd: wan (2591): udhcpc: broadcasting discover
May  6 14:45:00 OpenWrt netifd: wan (2591): udhcpc: broadcasting select for X.X.X.X, server 96.113.84.150
May  6 14:45:00 OpenWrt netifd: wan (2591): udhcpc: lease of X.X.X.X obtained from X.X.X.X, lease time 86400
May  6 14:45:00 OpenWrt dropbear[2594]: Pubkey auth succeeded for 'root' with ssh-ed25519 key SHA256:blah from X.X.X.X:57116
May  6 14:45:01 OpenWrt procd: - init complete -
May  6 14:45:01 OpenWrt dropbear[2594]: Exit (root) from <X.X.X.X:57116>: Disconnect received
May  6 14:45:01 OpenWrt firewall: Reloading firewall due to ifup of lan (br-lan)
May  6 14:45:01 OpenWrt hostapd: Set new config for phy phy0:
May  6 14:45:01 OpenWrt dropbear[2754]: Child connection from X.X.X.X:53920
May  6 14:45:01 OpenWrt wpa_supplicant[1848]: Set new config for phy phy0
May  6 14:45:02 OpenWrt netifd: Interface 'wan' is now up
May  6 14:45:02 OpenWrt dnsmasq[1]: reading /tmp/resolv.conf.d/resolv.conf.auto
...
May  6 14:45:02 OpenWrt wpa_supplicant[1848]: Set new config for phy phy1
May  6 14:45:03 OpenWrt odhcpd[2023]: No default route present, overriding ra_lifetime!
May  6 14:45:04 OpenWrt odhcpd[2023]: No default route present, overriding ra_lifetime!
May  7 13:12:42 OpenWrt odhcpd[2023]: No default route present, overriding ra_lifetime!
May  7 13:12:43 OpenWrt dnsmasq[1]: read /etc/hosts - 12 names
May  7 13:12:43 OpenWrt dnsmasq[1]: read /tmp/hosts/dhcp.cfg01411c - 30 names
May  7 13:12:43 OpenWrt dnsmasq[1]: read /tmp/hosts/dhcp.cfg01411c.2586 - 0 names
May  7 13:12:43 OpenWrt dnsmasq[1]: read /tmp/hosts/odhcpd - 2 names

Here is the syslog at 1:04/1:05 when the load spiked:

May  7 13:04:43 OpenWrt hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5
May  7 13:04:43 OpenWrt hostapd: phy1-ap3: AP-STA-DISCONNECTED XX:XX:XX:XX:XX
May  7 13:04:44 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: did not acknowledge authentication response
May  7 13:04:44 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: did not acknowledge authentication response
May  7 13:04:44 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: did not acknowledge authentication response
May  7 13:04:44 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: did not acknowledge authentication response
May  7 13:04:44 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: did not acknowledge authentication response
May  7 13:04:44 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: did not acknowledge authentication response
May  7 13:04:44 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: did not acknowledge authentication response
May  7 13:04:44 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: did not acknowledge authentication response
May  7 13:04:44 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: did not acknowledge authentication response
May  7 13:05:00 OpenWrt crond[2085]: user root: process already running: scp /tmp/dhcp.leases root@X.X.X.X:/tmp/dhcp.leases
May  7 13:05:10 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: authenticated
May  7 13:05:10 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: associated (aid 1)
May  7 13:05:10 OpenWrt hostapd: phy1-ap3: AP-STA-CONNECTED XX:XX:XX:XX:XX auth_alg=open
May  7 13:05:10 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX WPA: pairwise key handshake completed (RSN)
May  7 13:05:10 OpenWrt hostapd: phy1-ap3: EAPOL-4WAY-HS-COMPLETED XX:XX:XX:XX:XX
May  7 13:05:10 OpenWrt dnsmasq-dhcp[1]: DHCPDISCOVER(br-lan) XX:XX:XX:XX:XX
May  7 13:05:10 OpenWrt dnsmasq-dhcp[1]: DHCPOFFER(br-lan) Y.Y.Y.Y XX:XX:XX:XX:XX
May  7 13:05:13 OpenWrt dnsmasq-dhcp[1]: DHCPDISCOVER(br-lan) XX:XX:XX:XX:XX
May  7 13:05:13 OpenWrt dnsmasq-dhcp[1]: DHCPOFFER(br-lan) Y.Y.Y.Y XX:XX:XX:XX:XX
May  7 13:05:16 OpenWrt dnsmasq-dhcp[1]: DHCPDISCOVER(br-lan) XX:XX:XX:XX:XX
May  7 13:05:16 OpenWrt dnsmasq-dhcp[1]: DHCPOFFER(br-lan) Y.Y.Y.Y XX:XX:XX:XX:XX
May  7 13:05:21 OpenWrt dnsmasq-dhcp[1]: DHCPDISCOVER(br-lan) XX:XX:XX:XX:XX
May  7 13:05:21 OpenWrt dnsmasq-dhcp[1]: DHCPOFFER(br-lan) Y.Y.Y.Y XX:XX:XX:XX:XX
May  7 13:05:22 OpenWrt dnsmasq-dhcp[1]: DHCPREQUEST(br-lan) Y.Y.Y.Y XX:XX:XX:XX:XX
May  7 13:05:22 OpenWrt dnsmasq-dhcp[1]: DHCPACK(br-lan) Y.Y.Y.Y XX:XX:XX:XX:XX
May  7 13:05:33 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: authenticated
May  7 13:05:33 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX IEEE 802.11: associated (aid 2)
May  7 13:05:33 OpenWrt hostapd: phy1-ap3: AP-STA-CONNECTED XX:XX:XX:XX:XX auth_alg=open
May  7 13:05:33 OpenWrt hostapd: phy1-ap3: STA XX:XX:XX:XX:XX WPA: pairwise key handshake completed (RSN)
May  7 13:05:33 OpenWrt hostapd: phy1-ap3: EAPOL-4WAY-HS-COMPLETED XX:XX:XX:XX:XX
May  7 13:05:34 OpenWrt dnsmasq-dhcp[1]: DHCPDISCOVER(br-lan) XX:XX:XX:XX:XX
May  7 13:05:34 OpenWrt dnsmasq-dhcp[1]: DHCPOFFER(br-lan) Y.Y.Y.Y XX:XX:XX:XX:XX
May  7 13:05:34 OpenWrt dnsmasq-dhcp[1]: DHCPREQUEST(br-lan) Y.Y.Y.Y XX:XX:XX:XX:XX
May  7 13:05:34 OpenWrt dnsmasq-dhcp[1]: DHCPACK(br-lan) Y.Y.Y.Y XX:XX:XX:XX:XX host

I don't see any smoking guns there, unfortunately. Maybe y'all have a different perspective.

I can:

  • Get another data point. Maybe I'll see more in the next one.
  • Update
  • reboot in the middle of the night on a cron or something. This is a bad workaround. It would be interesting if it still didn't fix the issue.
  • Mount a USB drive or jffs2 to have the syslog written to persistent storage
  • Flash with a debug build that has the crashlog enabled?

Add zram-swap and monitor swap usage, seems oom

1 Like
sysctl kernel.panic
kernel.panic = 3
1 Like
root@OpenWrt:~# sysctl kernel.panic
kernel.panic = 3

Added. If it is OOM, I would expect it to at least fail after more hours of runtime. If not, then it won't change it much. So it is a valid test, if not a workaround.

Right, could try setting that tunable to 0 if one is willing to reboot manually on a panic.

1 Like