Spontaneous reboot LEDE 17.01.4 on WNDR3800

I just noticed that I seem to be getting spontaneous reboots on my Netgear WNDR3800 running LEDE 17.01.4.
Network went down temporarily and when looking at uptime it was just a couple of minutes old. I get new external IPs on reboots and as I log these I can see that this has been happening every few days.

A strange thing is that my Luci system log doesn't show any of the normal kernel boot messages at the point of reboot:

Mon Mar  5 13:55:42 2018 daemon.info hostapd: wlan0: STA 78:e8:b6:f8:f2:50 IEEE 802.11: associated (aid 1)
Mon Mar  5 13:55:42 2018 daemon.notice hostapd: wlan0: AP-STA-CONNECTED 78:e8:b6:f8:f2:50
Mon Mar  5 13:55:42 2018 daemon.info hostapd: wlan0: STA 78:e8:b6:f8:f2:50 RADIUS: starting accounting session 05C086880954C381
Mon Mar  5 13:55:42 2018 daemon.info hostapd: wlan0: STA 78:e8:b6:f8:f2:50 WPA: pairwise key handshake completed (RSN)
Tue Mar  6 11:36:31 2018 daemon.info dnsmasq-dhcp[1984]: DHCPDISCOVER(br-lan) 78:e8:b6:f8:f2:50 
Tue Mar  6 11:36:31 2018 daemon.info dnsmasq-dhcp[1984]: DHCPOFFER(br-lan) 192.168.1.3 78:e8:b6:f8:f2:50 
Tue Mar  6 11:36:31 2018 daemon.info dnsmasq-dhcp[1984]: DHCPREQUEST(br-lan) 192.168.1.3 78:e8:b6:f8:f2:50 
Tue Mar  6 11:36:31 2018 daemon.info dnsmasq-dhcp[1984]: DHCPACK(br-lan) 192.168.1.3 78:e8:b6:f8:f2:50 mikezterouter
Tue Mar  6 11:36:31 2018 daemon.info dnsmasq-dhcp[1984]: DHCPDISCOVER(br-lan) 00:0e:08:c1:e1:1b 
Tue Mar  6 11:36:31 2018 daemon.info dnsmasq-dhcp[1984]: DHCPOFFER(br-lan) 192.168.1.12 00:0e:08:c1:e1:1b 
Tue Mar  6 11:36:31 2018 daemon.info dnsmasq-dhcp[1984]: DHCPREQUEST(br-lan) 192.168.1.12 00:0e:08:c1:e1:1b 
Tue Mar  6 11:36:31 2018 daemon.info dnsmasq-dhcp[1984]: DHCPACK(br-lan) 192.168.1.12 00:0e:08:c1:e1:1b ciscovoip
Tue Mar  6 11:37:02 2018 daemon.info dnsmasq[1984]: read /etc/hosts - 1 addresses
Tue Mar  6 11:37:02 2018 daemon.info dnsmasq[1984]: read /tmp/hosts/dhcp.cfg02411c - 9 addresses

In the log above the reboot happened just before the Tue Mar 6 11:36 messages.

How can I dig further into finding the cause of this?

Two comments:

  • Kernel and system log do not survive reboot as they are on ramdisk. So, you will not see anything about the possible crash in the log after a reboot.

  • What you see in the log, is the router first starting with the date/time of the latest file timestamp in /etc directory. Then when you have got network connectivity, the router adjusts time from internet with NTP, so there is a jump in the time. (You could now do "touch /etc/banner" command and reboot, and the router would shows as starting from that moment's timestamp.)

  • WNDR3700v1/v2/3800 routers have been really stable the past 7-8 years. I have them all. There is nothing specific to 17.01.4 that would cause crash for WNDR3800. But, there may naturally be something in your packages etc. that reacts badly to something.

SSH in to the router and run the following...

dmesg | grep 'warn\|err'

Ah, trixy. Ok so nevermind about the system log then.

I recently added a number of packages for 3G connectivity so they are prime suspects. But how do I find the cause of crashes when logs are not persisted?

Comes out empty:

~# dmesg | grep 'warn\|err'
~#

You could make logs persistent by logging onto external log server via network, but that may not work well with crashes, as also the network functionality dies in the crash.

Or you could open the case and attach serial cable so that console output is shown on your PC.

But in general, 17.01 releases and ar71xx routers like 3800 are well established. It is hard to believe that there would be a wide crash bug that would not have surfaced earlier.

One typical reason for crashes is memory exhaustion. Possibly some log or temp file fills /tmp that is actually RAM memory. When there is no free memory, router crashes. Commands like free or top may help noticing that.

Assuming you installed the supported stable release, I think if it were my router, I would reset LEDE to factory settings and re-configure from scratch.

Looking into this again and really asking for advice if better to upgrade to latest 19.07.3 release before trying to hunt down the problem? (will create a new topic in that case)

Short story is that the router's spontaneous reboots seem to be related to network traffic over the 3G/LTE modem connected over NCM. I recently ran a speedtest from a PC on the network when the router rebooted by itself in the middle of the test. Ncat syslog logging showed no warning signs and both the router and the modem booted up nicely after the reboot, so it doesn't seem to be caused by a crashing modem cpu or similar. Router normally has 85MB free RAM so fine there, but could the OpenWrt 3G/NCM stack maybe go into some loop and quickly consume memory?

Either way, looking for advice if I should stay on my stable OpenWrt 17 installation while analyzing the problem, or if I should better take the leap and install the latest 19.07.3 + ath79 release on my WNDR3800 first? (how stable is OpenWrt in general and the NCM stack in particular for this device in the new release?)

Since your device is 16/128 and does appear to support OpenWrt 19.07.x, it would be highly advisable to upgrade. When you do the upgrade, do not keep settings. Especially if you change to the ath79 target (which is recommended), the old settings may not be compatible with your new running configuration. Take a backup of your current config so that you can use it as a reference, but don't restore it directly -- just use it as a reference as you reconfigure the router.

Now, all of that said, the upgrade may or may not resolve your issue with the modem -- totally speculating here, but it is plausible that it is a power issue -- the router may have an internal brown-out on the power system due to the increased load from the USB modem. I'd recommend trying two different things to resolve it...

  1. use a larger main power supply. Again, guessing here, but many routers use 12V power adapters. If you have one handy with higher current capacity, that would be preferable. So if you are using a 12VDC @ 1A supply, try swapping that out with a 12V @ 2A supply.
  2. Try connecting a powered USB hub between the router's USB port and the modem. This way the modem will draw power from the USB hub instead of from the internal power of the router itself.

Big thanks for the insight that the router's own internal voltage level could drop due to the power load from the modem! I hadn't thought about that but it makes total sense and would naturally of course reboot the router.

I tried but had some other issues when putting a powered USB hub between modem and router so for now I'll try your suggestion and find a more powerful power adapter. I'll keep everything else constant meanwhile to be able to see if this by itself was the fix! (will otherwise go the upgrade route)

Thanks again :slight_smile:

I agree that you should try one thing at a time. But it is a good idea to update you OpenWrt version anyway, so please do it soon (development on the 17.x branch ended a long time ago and there are potentially security vulnerabilities could be problematic, not to mention general feature and stability improvements in the subsequent versions).

As far as the power issue -- a larger main power supply should hopefully help. It does still depend on the power supply design within the router itself as well as how much power it is supposed to be able to provide on the USB port, but without a doubt this is worth a try.

I'm not sure why you would have had problems when using a USB hub -- maybe a different unit would give you better results?

Yes the USB hub problems were unexpected. For now I will try to keep things as simple as possible and not throw too many new things into the equation. Without a hub there could be two different power-related issues; either the modem losing power (on the live router this is seen as the modem USB device disconnecting in the log) or the router losing power (which would reboot it). My next step will be to examine the latter before going further with anything else.