Belkin RT3200 intermittently crashing, how can I better troubleshoot?

Hi all,

I've been really enjoying this router with OpenWrt. I've been getting some crashes lately though - at least what I believe are crashes. This happens with 22.03.6 and 23.05.2 which I recently upgraded to. The box itself does not appear to shut off, and the "Internet" light is still blinking. However, the ssh service becomes unavailable and none of my hard-wired devices can see the router. Based on some of the other forum posts for this device, this does not seem to be a common occurrence as other users can still access their boxes over ssh when a "crash" occurs.

Unfortunately, I don't have a lot of time to debug when this happens as I'm often working and need to reboot it quickly.

I've set up remote logging to one of my servers in my LAN. The only weird thing I've managed to capture is this kernel log, which shows an out of order timestamp. The crash happened around 2024-03-14 17:29, where I promptly power cycled it to reconnect to a meeting.

2024-03-14T16:09:02+00:00 cheese-router kernel: [464627.974246] br-lan: port 2(lan2) entered forwarding state
2024-03-10T22:11:21+00:00 cheese-router kernel: [   21.534772] IPv6: ADDRCONF(NETDEV_CHANGE): wl0-ap0: link becomes ready
2024-03-10T22:11:21+00:00 cheese-router kernel: [   21.541494] br-lan: port 6(wl0-ap0) entered blocking state
2024-03-10T22:11:21+00:00 cheese-router kernel: [   21.547056] br-lan: port 6(wl0-ap0) entered forwarding state
2024-03-10T22:11:22+00:00 cheese-router kernel: [   21.688710] IPv6: ADDRCONF(NETDEV_CHANGE): wl0-ap1: link becomes ready
2024-03-14T17:30:34+00:00 cheese-router kernel: [   96.034885] IPv6: ADDRCONF(NETDEV_CHANGE): wl1-ap0: link becomes ready
2024-03-14T17:30:34+00:00 cheese-router kernel: [   96.041733] br-lan: port 5(wl1-ap0) entered blocking state
2024-03-14T17:30:34+00:00 cheese-router kernel: [   96.047246] br-lan: port 5(wl1-ap0) entered forwarding state
2024-03-14T17:33:24+00:00 cheese-router kernel: [  265.464321] br-lan: port 5(wl1-ap0) entered disabled state

This out of order time stamp seems to line up with a collectd log where it thinks the system time has changed or something:

2024-03-14T17:29:43+00:00 cheese-router collectd[2972]: Sleeping only 2s because the next interval is 328648.290 seconds in the past!
2024-03-14T17:29:44+00:00 cheese-router collectd[2972]: rrdtool plugin: rrd_update_r failed: /tmp/rrd/cheese-router/cpu-0/percent-system.rrd: opening '/tmp/rrd/cheese-router/cpu-0/percent-system.rrd': No such file or directory
2024-03-14T17:29:44+00:00 cheese-router collectd[2972]: rrdtool plugin: rrd_update_r failed: /tmp/rrd/cheese-router/cpu-0/percent-softirq.rrd: expected 1 data source readings (got 0) from /tmp/rrd/cheese-router/cpu-0/percent-softirq.rrd:...
2024-03-14T17:29:44+00:00 cheese-router collectd[2972]: rrdtool plugin: rrd_update_r failed: /tmp/rrd/cheese-router/interface-br-lan/if_octets.rrd: expected 2 data source readings (got 0) from /tmp/rrd/cheese-router/interface-br-lan/if_octets.rrd:...

Of course, I do not know if this is even significant/related. If it is, then great! This is the only abnormality that I found in the logs when the crash occurred though. There's probably some other messages that cause this that I simply haven't captured with this setup.

I've read I can open up the RT3200 and use the serial port, but with the randomness of the crashes, I need to be able to have a sort of "set it and check it later" setup. Would this be at all possible? Are there any other tips and tricks out there for capturing these things?

For what it's worth, the router is only plugged into a surge protector - no UPS or anything. Maybe this is what is causing the instability, but I would imagine there's quite a bit resiliency in these devices.

Thank you for the help!

Does that mean the wireless clients still work? If so, does SSH work through those wireless clients?

This is expected for routers without a built-in RTC. When the router boots, its time will be completely wrong until the NTP client synchronizes the clock. Since you said the crash happened around 2024-03-14 17:29, this matches up perfectly with your collectd log. I see no abnormalities here and I don't think is relevant to your problem.

Nope. Wireless routers are computers at the end of the day, and computers are famously susceptible to electrical power quality problems. It would be fitting of the random nature of the issues you're seeing. It could be a failing wall wart, mains under-voltage, or even EMI from some other high-power device.

This is what I currently haven't been able to test. I quickly check if I can ssh and can see that my wired Windows device shows that it is not connected to a network. Like I mentioned, the crashes occur during times when I cannot effectively debug. I'm hoping to get another crash during a time where I can get a wireless client to test.

I am planning to buy a UPS to hopefully mitigate any chances of this. If that doesn't work, then I assume it could be a failing wall wart. I guess there's not much else to do here beyond that. Thank you for the help.

I wanted to provide an update to this since I still have not figured out what's going on.

I bought a UPS which seemed to help (uptime of about 31 days) but the device ended up crashing in the exact same way. I bought a new wall wart that matched the power requirements of the device, and that lasted about 20 days. I swapped out the new power supply for the old one, and it proceeded to only last about 8 days before another crash. There is no other device near this one that could cause EMI as far as I know unless my downstairs neighbors are doing something weird or the cable modem is causing issues.

I've also determined that the device fully crashes since wireless clients lose access to the local network, too. What's odd is for the past two crashes there is a PC that is wired to the device's switch which loses network connectivity no matter which port it is plugged into. I have to unplug/replug that ethernet cable for it to come back up on the network. Shortly after this happens (12-24 hours), the router crashes.

For grins and giggles, I set up a travel router that I had bought a little while ago (GL.iNet MT-3000) with the exact same network setup I use on the Belkin. I configured this router from scratch and didn't copy any files over. This device also hard crashed with an uptime of about a day. The PC that I mentioned above did not lose connectivity before the router crashed though.

As with all of these crashes, the logs sent to my external log server show nothing of interest and I have to physically power cycle the router.

Since this is happening with two devices, I wonder if it is not configuration related. I do run SQM, but I had disabled it while testing the stability of the new power supply, which I figured would rule it out as being the cause. The only other configuration that deviates from the norm is that I have 2 VLANs setup. The extra VLAN just houses a home server with Jellyfin and the like.

While doing all of this waiting/testing, I've kept top open on my other monitor and I've never experienced any abnormalities in CPU usage or RAM usage before it crashes. I'm going to keep testing with the GL.iNet MT-3000, but I truly don't know what else to troubleshoot or try. I really like OpenWrt so I don't want to stop using it. Perhaps I'm a bit unlucky on the device lottery, who knows!