OpenWrt is crashing once a day

Hello there,

I have an Ubiquiti EdgeRouter X with OpenWRT 18.06.4 installed and I've faced sudden reboots for no reason. logread doesn't shows anything. Installed packages are up-to-date.

How can I diagnose the reason of this? I'm clueless and this is increasingly becoming worrysome for me.

Thanks!

Most definitive way would be with serial cable, you’d see the last thing kernel did before it restarted .

Next best would be to configure a remote rsyslog service on your laptop/pc and configure openwrt to log to there. Might or might not catch the real reason.

Beyond that, not sure a way to tell conclusively.

You could also try elimination approach:

  • unplug everything from the device and see if it still reboots (check periodically with uptime)
  • if it stays up, add a device for a day or two until the issue returns. If it does, then you’ve found the cause. (I had an Apple TV that seemed to freak out my old buffalo router when the Apple TV went to sleep. I ended up just replacing the buffalo device as it was old. But mentioning it here in case you’ve stumbled on something similar

Another "trick" is to ssh in from a "desktop" and run logread -f in the terminal window.

The syslog that I'm seeing right now:

Tue Nov 26 16:10:50 2019 kern.err kernel: [   81.240998] UBIFS error (ubi0:1 pid 5779): 0x801acf04: failed to read inode 2842, error -2
Tue Nov 26 16:10:50 2019 kern.err kernel: [   81.257499] UBIFS error (ubi0:1 pid 5779): 0x801a8f0c: dead directory entry 'README.md', error -2
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.275194] UBIFS warning (ubi0:1 pid 5779): 0x801aff50: switched to read-only mode, error -2
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.292477] CPU: 3 PID: 5779 Comm: git Not tainted 4.14.131 #0
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.304076] Stack : 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.320706]         00000000 00000000 00000000 00000000 00000000 00000001 8ec31a58 53261662
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.337335]         8ec31af0 00000000 00000000 00003720 00000038 804835d8 00000008 00000000
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.353962]         00000000 80530000 0004767d 00000000 8ec31a38 80000000 80550000 8fbe6660
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.370589]         8f701000 8e831600 12ce67c9 00000000 00000001 8029b5a8 003a3faf 003de6af
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.387218]         ...
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.392072] Call Trace:
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.392127] [<804835d8>] 0x804835d8
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.403912] [<8029b5a8>] 0x8029b5a8
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.410847] [<80010090>] 0x80010090
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.417776] [<80010098>] 0x80010098
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.424700] [<8046c57c>] 0x8046c57c
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.431635] [<801a8f0c>] 0x801a8f0c
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.438564] [<801aff50>] 0x801aff50
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.445495] [<801a8f18>] 0x801a8f18
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.452458] [<8011f428>] 0x8011f428
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.459413] [<80120820>] 0x80120820
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.466357] [<801d9c6c>] 0x801d9c6c
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.473303] [<801d9fac>] 0x801d9fac
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.480241] [<8012fd68>] 0x8012fd68
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.487189] [<801da5e0>] 0x801da5e0
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.494122] [<80130494>] 0x80130494
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.501060] [<80053a4c>] 0x80053a4c
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.508010] [<8011f428>] 0x8011f428
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.514958] [<801224f8>] 0x801224f8
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.521888] [<801a6614>] 0x801a6614
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.528840] [<80123130>] 0x80123130
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.535785] [<80124f30>] 0x80124f30
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.542716] [<801dbabc>] 0x801dbabc
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.549666] [<80124bf4>] 0x80124bf4
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.556595] [<8012e9c8>] 0x8012e9c8
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.563530] [<8012506c>] 0x8012506c
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.570458] [<80119978>] 0x80119978
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.577388] [<80119cb0>] 0x80119cb0
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.584328] [<80019578>] 0x80019578
Tue Nov 26 16:10:50 2019 kern.warn kernel: [   81.591266]
Tue Nov 26 16:11:20 2019 kern.err kernel: [  110.906743] UBIFS error (ubi0:1 pid 5): 0x801a22bc: cannot reserve 160 bytes in jhead 1, error -30
Tue Nov 26 16:11:20 2019 kern.err kernel: [  110.924620] UBIFS error (ubi0:1 pid 5): 0x801ab804: can't write inode 16819, error -30
Tue Nov 26 16:11:25 2019 kern.err kernel: [  115.946689] UBIFS error (ubi0:1 pid 5): 0x801a22bc: cannot reserve 160 bytes in jhead 1, error -30
Tue Nov 26 16:11:25 2019 kern.err kernel: [  115.964553] UBIFS error (ubi0:1 pid 5): 0x801ab804: can't write inode 16820, error -30

That is a corrupted flash memory and probably the cause of the kernel to panic.

Yeah, two reboots in less than 4 hours. I ssh'ed into the router and now I'm seeing that it's mounted in read-only mode. I've tried to run fsck but it seems that it doesn't come with OWRT.

I rebooted manually the router and I'm not seeing this log anymore. I think that I'll start the "manual debugging" of unplugging devices from it and seeing if that's causing the issue.

If you're lucky, sysupgrading OpenWrt (either the same or a newer version) while NOT keeping settings (nor restoring a potentially defective backup) might fix the situation, as that rewrites fix kernel/ rootfs and create a new overlay. If that doesn't help, you may have to dive deeper into ubi specifics (I'm not really a specialist with that).

The causes can either be hardware related (NAND is prone to individual cells going bad, that's why ubi and ubifs reserve a certain amount of spare sectors for ECC and wear leveling purposes) or just bad luck (disrupting power at the wrong time or something else breaking fs consistency.

2 Likes

If the problem persists after a fresh flash, if it's installed

jeff@office:~$ nandtest -h
usage: nandtest [OPTIONS] <device>

  -h, --help           Display this help output
  -V, --version        Display version information and exit
  -m, --markbad        Mark blocks bad if they appear so
  -s, --seed           Supply random seed
  -p, --passes         Number of passes
  -r <n>, --reads=<n>  Read & check <n> times per pass
  -o, --offset         Start offset on flash
  -l, --length         Length of flash to test
  -k, --keep           Restore existing contents after test

very carefully executed on the UBI partition, using the -k option (which usually doesn't destroy data) would be a way to get a clue as to if there is a problem with the NAND flash.

You should expect that your UBI file system gets broken by running nandtest (even if it isn't supposed to).

1 Like

So, given the different answers here, it seems that I'll need to flash the router again and cross fingers. I didn't wanted to pay full attention to this since I thought that it was a corrupt package causing that. I've updated all packages last weekend and now I'm getting more reboots than before.

I'll spare some time this weekend for doing that and I'll report back here.

Thanks!

In general, bulk updates of packages is a bad idea for many reasons (If "incompatible ABI" is meaningful to you, then that should give you a hint).

If you want to update, flashing a complete, self-consistent image is recommended. Any packages that you need that aren't present in the image should be added at that time (same day for snapshot images).

1 Like

Ok, I understand. I'm pretty accustomed to the rolling-release model of ArchLinux and one of the reasons that I found OpenWRT attractive (instead of EdgeOS) was the possibility of having an up-to-date software running -- that include packages --

There is some work afoot to at least be able to identify ABI breakage, but something as sophisticated as apt or your favorite OS's package manager is likely beyond what can be supported even ruling out 16 and 32 MB flash devices. Installing a new kernel generally can't be done with opkg.

Personally, I handle periodic updates by building from source, generally from HEAD of the master branch of OpenWrt and the package feeds. Another option is the Image Builder which will assemble pre-compiled packages from the current repos. The requirements and pre-req software are pretty much the same (Current Linux-based OS, somewhere in the range of 20-32 GB of disk, apt install build-essential git gitk libncurses5-dev gawk unzip wget curl ccache rsync zlib1g-dev or thereabouts).

1 Like

If nothing else helps, maybe booting a clean OpenWrt image from RAM, dumping all ubi volumes and ubiformat would help. After that you would need to recreate vendor ubi volumes, such as "factory" and restore the content from the dump (via ubiupdatevol command), while openwrt rootfs and kernel volumes can be recreated by doing sysupgrade from the temporary RAM-only system.

If you go by this way, make absolutely sure that you have backed up all the partitions to your computer and store them in a safe place, use md5 or sha256 to control the integrity. Some of ubi partitions might contain data unique to your particular device, such as MAC address, serial number, and/or calibration data and necessary for wireless chip to work, so don't lose it.

To boot OpenWrt into RAM you need to access the serial console first, then use some of bootloader commands, which depend on particular bootloader you have and use initramfs-kernel type image, which you can upload onto device RAM and start.

Do not touch bootloader mtd partitions, however make sure to back them up as well!

Just make sure not to use cat/ dd for backing up or restoring, but NAND-aware tools (such as nanddump or nandwrite)!

1 Like

In case of ubi, ubi-aware tools such as ubiupdatevol should be used for restoring, rather than nand tools, however cat /dev/ubiX_Y > /tmp/ubiX_Y.dump.img should be sufficient for saving the data.

Are there any problems with using cat /dev/mtdX > /tmp/mtdX.img with NAND? For NOR flash it works just fine at least. However cat mtdX.img > /dev/mtdX won't work, mtd write or /dev/mtdblockX have to be used instead. For ubi there is ubiblock tool, but I think it's easier to just use ubiupdatevol rather than create virtual block device which would work with cat/dd.

1 Like

I have been able to rescue two ER-X-SFPs with this tool from Ubiquiti...

It successfully marked bad blocks in the flash. In the end, EdgeOS is installed, but it's quite easy afterwards to get OpenWrt back on the router.

I have to say that I was overwhelmed at first with all the responses... some of them contained things that I've never done.
I gave this a shot through the "reset button" method and I was unlucky. I tried again with a USB to TTL adapter, flashed the recovery image and I got EdgeOS back.

Now, I'll reinstall OpenWRT.

As some of you mentioned, when the router was flashing the recovery image it found some bad blocks on the storage that were fixed during the ubi formatting process.

Thanks everyone for your help with this. This thread taught me a lot about how I need to handle updates in these kind of devices!

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.