Help for many FS failure on X86 hardware

Hello everyone,

I have a big problem on my latest X86 HW, I can't solve this problem alone.

First a bit of history, I've been running OWRT on small Chinese X86 HWs since 2020:

  • first a J3160 (2x realtek 1GB nics)
  • then on an N5105 to have more nics (4x intel i225) which suffered a nasty short circuit due to a faulty HDMI cable during an upgrade
  • then on another N5105 (4x intel i226) to replace the previous one

On the first 2 I never add any hw problem, but this is not the case with the last one, I regularly encounter the same failure: the sda2 fs goes read-only, so no more dhcp leases for my clients

I have the workaround (e2fsck -y /dev/sda2 && reboot) but it got really annoying

I tried:

  • 2 different dimm modules
  • 2 different NVMe SSDs
  • Barebone OpenWRT
  • Virtualized OpenWRT to try to add a storage abstraction layer (proxmox 7.2 and 7.4, legacy kernel and also edge kernel)

Sometimes it takes weeks between outages, sometimes it's twice a day.

I am not able to locate any useful log either on proxmox or OpenWRT.
And I have another VM on this proxmox box (homeassistant) with zero failure.

I'm hesitant to buy an identical box (maybe it's just a hw issue on THAT box).

The only informations I can find in OpenWRT are:

root@gw:~# e2fsck -y /dev/sda2
e2fsck 1.46.5 (30-Dec-2021)
rootfs contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -13884
Fix? yes

Free blocks count wrong (0, counted=15889).
Fix? yes


rootfs: ***** FILE SYSTEM WAS MODIFIED *****
rootfs: ***** REBOOT SYSTEM *****
rootfs: 2046/6656 files (0.0% non-contiguous), 10735/26624 blocks
root@gw:~#

and

Sun Jun  4 17:47:39 2023 kern.crit kernel: [   39.834805] EXT4-fs error (device sda2): mb_free_blocks:1506: group 0, inode 282: block 13885:freeing already freed block (bit 13885); block bitmap corrupt.
Sun Jun  4 17:47:39 2023 kern.crit kernel: [   39.952869] EXT4-fs (sda2): Remounting filesystem read-only
Sun Jun  4 17:47:39 2023 kern.crit kernel: [   39.954001] EXT4-fs error (device sda2): ext4_mb_generate_buddy:802: group 0, block bitmap and bg descriptor inconsistent: 15888 vs 15889 free clusters
Sun Jun  4 17:47:39 2023 daemon.err dnsmasq-dhcp[1]: failed to write /root/dhcp.leases: Read-only file system (retry in 60 s)
Sun Jun  4 17:47:41 2023 daemon.info dnsmasq[1]: exiting on receipt of SIGTERM
Sun Jun  4 17:47:51 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:47:51 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:47:56 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:47:56 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:01 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:48:01 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:06 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:48:06 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:11 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:48:11 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:12 2023 user.err adblock-4.1.5[5526]: dns backend restart with adblock blocklist failed
Sun Jun  4 17:48:22 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:48:22 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:27 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:48:27 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:32 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:48:32 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:37 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:48:37 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:42 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:48:42 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:47 2023 daemon.crit dnsmasq[1]: cannot open or create lease file /root/dhcp.leases: Read-only file system
Sun Jun  4 17:48:47 2023 daemon.crit dnsmasq[1]: FAILED to start up
Sun Jun  4 17:48:47 2023 daemon.info procd: Instance dnsmasq::cfg01411c s in a crash loop 6 crashes, 0 seconds since last crash

So:

  • what can I do to get more logs ?
  • do you think it can simply be a fault on this particular box ??
    if it is a bad hardware I will not spend more time with it ... will buy an other identical (finger crossed to not have the same pb)

I can't find another guy with the same problem on the net

So, something's clearly in the wrong here - but the question is what exactly.

I would stress-test the little bugger with another distro to rule out a problem with the OpenWrt-provided kernel (check out xfstests and/or fio for great tools to help you with this).

If you cannot reproduce the issue that way, maybe try switching to squashfs instead of ext4.

  • Use hardware solutions like UPS to avoid power outage.
  • Do not check online filesystem, it is not safe and can create even more errors.
  • Use virtio storage type for better performance and compatibility.
  • Test memory and storage on the host.

thanks for your replies

will perhaps create another debian VM to test with xfstests

  • I have an UPS
  • I uses the virtio scsi controller
  • don't have to check the fs while online ... I have to do the e2fsck only when the fs is already readonly

They say that even a RO-mounted filesystem is not safe to check:

1 Like

ok thanks vgaetera, will unmount it

I found some smart errors on the nvme drive which I don't know how to decode, I ordered a new one even though I don't know if it's related (another brand new nvme drive had the same FS issues at the beginning of this story).

I also migrated to a new openwrt VM in SquashFS version to be serene while waiting for the new NVMe drive

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        241     0  0x3009  0x0005      -            6     0     -
  1        240     0  0x0009  0x0005      -           12     0     -
  2        238     0  0x0001  0x0005      -           12     0     -
  3        237     0  0x3009  0x0005      -            6     0     -
  4        236     0  0x0009  0x0005      -           12     0     -
  5        235     0  0x001c  0x0005      -            6     0     -
  6        234     0  0x0005  0x0005      -           12     0     -
  7        232     0  0x001d  0x0005      -           12     0     -
  8        231     0  0x000c  0x0005      -            6     0     -
  9        230     0  0x0015  0x0005      -           12     0     -
 10        229     0  0x0018  0x0005      -            6     0     -
 11        228     0  0x0009  0x0005      -           12     0     -
 12        227     0  0x3005  0x0005      -            6     0     -
 13        226     0  0x0005  0x0005      -           12     0     -
 14        223     0  0x0004  0x0005      -            6     0     -
 15        217     0  0x000c  0x0005      -            6     0     -

and

root@pve:~# nvme error-log /dev/nvme0
Error Log Entries for device:nvme0 entries:16
.................
 Entry[ 0]   
.................
error_count     : 241
sqid            : 0
cmdid           : 0x3009
status_field    : 0x5(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0x6
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 1]   
.................
error_count     : 240
sqid            : 0
cmdid           : 0x9
status_field    : 0x5(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0xc
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0