Watchdogs nightmare

My remote router loses connectivity for days and I don´t know why.

Router: ZBT-WE826 with hardware watchdog.
OS: Openwrt 23.05
WAN : Double, via wifi and 3G with MWAN3 to change interface if one loses internet.
luci watchdog: Watchcat, that reboots if no pings after 30min.

I know for sure there is always a wan interface working at least. So, if router hangs I don't know why hardware watchdog does not reboot. If it is not hung I don't understand why watchcat does not reboot. If it is not hung and has internet in at least one interface I don't know why MWAN3 does not change interface.

The router is in a remote location, several hours away. It's full of watchdogs and redundant connections and I can't explain what's wrong with it.

nor can we ...

1 Like

There must be some hole. I know that mwan3 changes interfaces sometimes when connectivity is lost on one. That software watchdogs (I haven't mentioned that apart from watchcat I have my own watchdog script) should restart everything and they don't. So I understand that the router may have hung or there has been a kernel crash (it happens sometimes, I think, due to an external USB WiFi receiver due to some driver or power problem), but in that case I understand that the hardware watchdog should restart. Is it possible that sometimes it doesn't? i.e. a partial kernel failure that blocks software watchdogs from running but still refreshes hardware watchdogs?

I should also mention that when I make the trip to the router, unplug it and plug it back in, it starts working.

I used the WE826 quite often in the past. Did not know, there was a HW-watchdog. Pls, provide details about it.
On other systems, MWAN3 is sufficient to dyn switch between wan/wifiwan/wwan on a RUT955 (Teltonika). However, I added custom script, to periodically check web connectivity, and powercycle the modem , or , in worst case trigger a (software-)reboot, in case no web connection for minutes. Until now, this was sufficient in many installs, besides very few exceptions. However, I was already condsidering an external watchdog (special hardware device), to externally power-cycle the router in very last case. You might consider this option, which MUST work, because your manual power cycle fixes the problem.

1 Like

dmesg | grep watch
[ 1.085695] rt2880_wdt 10000120.watchdog: Initialized

Also this guy speaks about it:
https://forum.openwrt.org/t/support-for-zbt-we826-q/50997

I used to use one of these programmable sockets to power-cycle the router every night, but I thought that with the new router with watchdog hardware it would no longer be necessary.

For days or until you power cycle it?
You are not giving us much to work with.

What purpose is it used for? PTP? What enviroment is it in?

Is it on a UPS? Does your version use an SD card to expand the Flash? If so, how old is it?
What USB NIC are you using?

What do the kernal and system logs say?

What information leads you to think that?

1 Like

Then it looks like I used an old version of the WE826, with MT7620 SoC. Anyway, instead of guessing around, first of all systematically try to narrow down the real problem. First step would be checking the system log, before inaccessibility. Very first approach, to introduce remote logging. Because easy to implement. OR with some more effort, but not losing any message, to do system log to file on USB-stick. OR just buy the external watchdog, REALLY to power cycle the WE826. Built-in watchdog might not power cycle, but simply trigger a software boot. Insufficient sometimes for a hanging LTE modem.

1 Like

Sorry for the delay. I have traveled to where the router is.

The router saves system.log and my own log on an SD. I don't see anything strange in the logs except WiFi wan interface (wlan0 radio1) is down and shortly after the logs stop exactly on the day and time the router lost connectivity. That is, I understand that from that day and hour the router absolutely crashed. Therefore the connectivity watchdogs stopped working. It also did not allow access via Ethernet cable. It is also notable that the router crashed without error messages or tainted kernel or similar.

Then the problem is clearly that the hardware watchdog has not worked. My suspicion at this point is that there has been a failure in the router's motherboard. Maybe sometimes the external USB WiFi receiver exceeds the USB power and not even the hardware watchdog works.

Possible but not probable. What make and model it it? ( not the chipset, just who made it and what mdel do they calll it)
Is the router plugged into a UPS?
Your router has as a very narrow window of temperture and humidity. Is it in a climate controlled enviroment?

My 'guess' is a brown out. Do you have an old digital clock you could leave there to check because a brown out is likley to reset a cheap clock; but this is all a moot point if the router is on a UPS.

1 Like

No UPS. Router indoor.

lsusb:
WIFI: 148f:3070 Ralink Technology, Corp. RT2870/RT3070 Wireless Adapter
MODEM: 0bdb:1911 Ericsson Business Mobile Networks BV

Wifi adapter module is rt2x00usb

... I don't see anything strange in the logs except WiFi wan interface (wlan0 radio1) is down and shortly after the logs stop exactly on the day and time the router lost connectivity. ...
So, MWAN3 should have tried to switch over to wwan, when wifi-wan went down. And that did not work, too, for any reason. However, this should have left some traces in the system log. Feel free, to post /etc/config/mwan3, and some excerpt from the system, just a minute before/after wifi-wan went down. Note: You mentioned your own watchdog script. And using watchcat. You are shure, they do not interfere with each other, and do not interfere with mwan3 activities ? (mwan3 switches routes, modifies firewall rules) You might post your own watchdog script, too. However, I think, it is a better idea just to let mwan3 do its job, and only to have a safety net, in case mwan3 does not find a working interface. You can detect this either using a simple cron job, OR, more elegant, create a smart script mwan3.user to check the connection events.

1 Like

rt2x00 (as used on mt7620) is not exactly a great wireless driver, it mostly works in client mode, but AP mode is another topic (some required features are missing altogether).

1 Like