[PATCH] SQUASHFS: data probably corrupt

vjorlikowski · May 9, 2022, 7:15pm

I've encountered this issue too, on the GL.iNet GL-B1300.

While I expect that an issue in spi-qup or spi-nor is the underlying cause (possibly due to the SPI frequency being too high) - I wonder if the correct workaround, for the time being, wouldn't be this patch:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2346491.html

that was ignored without any remark, back in October 2020.

vjorlikowski · May 9, 2022, 7:51pm

Which, BTW, is equivalent to a very small subset of the original patch posted by @NoTengoBattery

hurricos · May 11, 2022, 9:09pm

Note that @vjorlikowski may be facing an issue where the NAND underlying the UBI has become corrupted without the ubifs noticing; that would be enough to cause these errors:

SQUASHFS error: xz decompression failed, data probably corrupt

These errors will not be corrected by booting an initramfs and performing a sysupgrade because the sysupgrade path very intentionally does not re-ubiformat when it runs if it has already found an ubifs; it just uses the existing UBI if it thinks it is correct.

It is very easy to conflate these issues, as malfunctioning storage is very difficult to debug and will usually show unhelpful messages. I had a Meraki MX60 which showed messages identical to the one I posted above and to those posted by others in this thread -- the solution was to re-ubiformat from within a ram-booted initramfs (not a solution available for the ER-X, I'm assuming its u-boot checks the image format? If not, try using that).

@plunet: You will want to try and take a backup of /etc/config and re-install OpenWrt. I would suggest using scp to transfer the sysupgrade file, and making sure your filesystem target (e.g. /tmp) is actually a tmpfs and not somehow (e.g. via overlay) mounted over your squashfs.

vjorlikowski · May 11, 2022, 11:08pm

So - I don't think I have NAND; it should be NOR:

[    0.617329] spi_qup 78b5000.spi: IN:block:16, fifo:64, OUT:block:16, fifo:64
[    0.619783] spi-nor spi0.0: mx25l25635e (32768 Kbytes)
[    0.625581] 9 fixed-partitions partitions found on MTD device spi0.0
[    0.630519] Creating 9 MTD partitions on "spi0.0":
[    0.636991] 0x000000000000-0x000000040000 : "SBL1"
[    0.642536] 0x000000040000-0x000000060000 : "MIBIB"
[    0.647156] 0x000000060000-0x0000000c0000 : "QSEE"
[    0.652034] 0x0000000c0000-0x0000000d0000 : "CDT"
[    0.656853] 0x0000000d0000-0x0000000e0000 : "DDRPARAMS"
[    0.661689] 0x0000000e0000-0x0000000f0000 : "APPSBLENV"
[    0.666593] 0x0000000f0000-0x000000170000 : "APPSBL"
[    0.671914] 0x000000170000-0x000000180000 : "ART"
[    0.677043] 0x000000180000-0x000002000000 : "firmware"
[    0.682140] 2 fit-fw partitions found on MTD device firmware
[    0.685856] Creating 2 MTD partitions on "firmware":
[    0.691743] 0x000000000000-0x000000390000 : "kernel"
[    0.697549] 0x0000003884d4-0x000001e80000 : "rootfs"
[    0.702415] mtd: device 10 (rootfs) set to be root filesystem
[    0.706772] 1 squashfs-split partitions found on MTD device rootfs
[    0.712213] 0x0000006f0000-0x000001e80000 : "rootfs_data"

Right now, with some sysfs options set, I'm getting about 6 weeks of runtime between squashfs errors appearing in the logs. Before having set those sysfs options, I'd run anywhere between 2 hours and 24 before they appeared.

The sysfs options I've set that got me some modicum of stability are:

echo 0 > /sys/block/mtdblock10/queue/read_ahead_kb
echo 8 > /sys/block/mtdblock10/queue/max_sectors_kb
echo 2 > /sys/block/mtdblock10/queue/nomerges
echo 2 > /sys/block/mtdblock10/queue/rq_affinity

vjorlikowski · May 11, 2022, 11:10pm

My errors of late have mostly been corrupt fragment cache entries, for squashfs, following a timeout reading from mtdblock10.

vjorlikowski · May 16, 2022, 2:26pm

I have encountered the error again, on my hardware.
It starts with:

Sun May 15 14:37:12 2022 kern.err kernel: [533320.106127] blk_update_request: I/O error, dev mtdblock10, sector 2594 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:15 2022 kern.err kernel: [533323.226313] blk_update_request: I/O error, dev mtdblock10, sector 3488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:16 2022 kern.err kernel: [533324.269099] blk_update_request: I/O error, dev mtdblock10, sector 2596 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:19 2022 kern.err kernel: [533327.385955] blk_update_request: I/O error, dev mtdblock10, sector 3490 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:20 2022 kern.err kernel: [533328.426176] blk_update_request: I/O error, dev mtdblock10, sector 2598 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:23 2022 kern.err kernel: [533331.546073] blk_update_request: I/O error, dev mtdblock10, sector 3492 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:25 2022 kern.err kernel: [533333.627892] blk_update_request: I/O error, dev mtdblock10, sector 2600 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:27 2022 kern.err kernel: [533335.707119] blk_update_request: I/O error, dev mtdblock10, sector 3494 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:28 2022 kern.err kernel: [533336.745936] blk_update_request: I/O error, dev mtdblock10, sector 2602 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:30 2022 kern.err kernel: [533338.825761] blk_update_request: I/O error, dev mtdblock10, sector 3496 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:31 2022 kern.err kernel: [533339.865779] blk_update_request: I/O error, dev mtdblock10, sector 2604 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:33 2022 kern.err kernel: [533340.905699] blk_update_request: I/O error, dev mtdblock10, sector 3498 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

which then progresses to:

Sun May 15 14:37:33 2022 kern.err kernel: [533340.910849] SQUASHFS error: squashfs_read_data failed to read block 0x130f86
Sun May 15 14:37:33 2022 kern.err kernel: [533340.935308] SQUASHFS error: squashfs_read_data failed to read block 0x1b407e
Sun May 15 14:37:33 2022 kern.err kernel: [533340.935350] SQUASHFS error: Unable to read fragment cache entry [1b407e]
Sun May 15 14:37:33 2022 kern.err kernel: [533340.941430] SQUASHFS error: Unable to read page, block 1b407e, size c9c4
Sun May 15 14:37:33 2022 kern.err kernel: [533340.948634] SQUASHFS error: Unable to read fragment cache entry [1b407e]
Sun May 15 14:37:33 2022 kern.err kernel: [533340.955139] SQUASHFS error: Unable to read page, block 1b407e, size c9c4
Sun May 15 14:37:33 2022 kern.err kernel: [533340.975622] SQUASHFS error: xz decompression failed, data probably corrupt
Sun May 15 14:37:33 2022 kern.err kernel: [533340.975666] SQUASHFS error: squashfs_read_data failed to read block 0x130f86

This is exactly in-line with the errors as originally described.
My underlying storage controllers are spi-qup and spi-nor, there is no ubifs (that I can tell) - there's only the mtd driver in place between the spi controllers and squashfs.

The underlying issue is very clearly with the storage controller failing a read, which squashfs then persistently caches. A reboot sorts the issue out, and the underlying storage presents as "just fine" after a reboot as well (even when forcing a read from all blocks).

NoTengoBattery · May 16, 2022, 4:24pm

Yes! This is exactly the scenario that this patch solves.

In my custom build I also included a patch to spi-qup so that it can handle the DMA error correctly (or at least make the kernel aware of the error).

I am not sure if this will work without UBI but I think it should because MTD must have some sort of error correction, I would assume.

vjorlikowski · June 8, 2022, 3:15pm

So - I suspected that the following might trigger the issue, once it begins to happen...and, it turns out, it managed to also trigger a reboot on my hardware:

cd /rom
find . -type f -exec cat {} >/dev/null \;

I did this after noticing the following in dmesg this morning:

[2056856.012067] spi-nor spi0.0: SPI transfer failed: -110
[2056856.012132] spi_master spi0: failed to transfer one message from queue

Which resulted in:

[2056856.012132] spi_master spi0: failed to transfer one message from queue
[2056856.016230] blk_update_request: I/O error, dev mtdblock10, sector 518 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056857.052312] spi-nor spi0.0: SPI transfer failed: -110
[2056857.052378] spi_master spi0: failed to transfer one message from queue
[2056857.056385] blk_update_request: I/O error, dev mtdblock10, sector 520 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056858.092040] spi-nor spi0.0: SPI transfer failed: -110
[2056858.092100] spi_master spi0: failed to transfer one message from queue
[2056858.096208] blk_update_request: I/O error, dev mtdblock10, sector 522 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056859.132124] spi-nor spi0.0: SPI transfer failed: -110
[2056859.132251] spi_master spi0: failed to transfer one message from queue
[2056859.136243] blk_update_request: I/O error, dev mtdblock10, sector 524 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056860.172107] spi-nor spi0.0: SPI transfer failed: -110
[2056860.172184] spi_master spi0: failed to transfer one message from queue
[2056860.176180] blk_update_request: I/O error, dev mtdblock10, sector 526 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056861.212019] spi-nor spi0.0: SPI transfer failed: -110
[2056861.212093] spi_master spi0: failed to transfer one message from queue
[2056861.216095] blk_update_request: I/O error, dev mtdblock10, sector 528 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056861.223130] SQUASHFS error: squashfs_read_data failed to read block 0x409aa
[2056862.251948] spi-nor spi0.0: SPI transfer failed: -110
[2056862.252011] spi_master spi0: failed to transfer one message from queue
[2056862.256046] blk_update_request: I/O error, dev mtdblock10, sector 518 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056863.292705] spi-nor spi0.0: SPI transfer failed: -110
[2056863.293138] spi_master spi0: failed to transfer one message from queue
[2056863.297101] blk_update_request: I/O error, dev mtdblock10, sector 520 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056864.332057] spi-nor spi0.0: SPI transfer failed: -110
[2056864.332115] spi_master spi0: failed to transfer one message from queue
[2056864.336115] blk_update_request: I/O error, dev mtdblock10, sector 522 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056865.382584] spi-nor spi0.0: SPI transfer failed: -110
[2056865.383011] spi_master spi0: failed to transfer one message from queue
[2056865.386981] blk_update_request: I/O error, dev mtdblock10, sector 524 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056866.411867] spi-nor spi0.0: SPI transfer failed: -110
[2056866.411927] spi_master spi0: failed to transfer one message from queue
[2056866.416195] blk_update_request: I/O error, dev mtdblock10, sector 526 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056867.451870] spi-nor spi0.0: SPI transfer failed: -110
[2056867.451928] spi_master spi0: failed to transfer one message from queue
[2056867.455926] blk_update_request: I/O error, dev mtdblock10, sector 528 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056867.462942] SQUASHFS error: squashfs_read_data failed to read block 0x409aa
[2056868.492606] spi-nor spi0.0: SPI transfer failed: -110
[2056868.492900] spi_master spi0: failed to transfer one message from queue
[2056868.496771] blk_update_request: I/O error, dev mtdblock10, sector 548 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056869.532663] spi-nor spi0.0: SPI transfer failed: -110
[2056869.533096] spi_master spi0: failed to transfer one message from queue
[2056869.538334] blk_update_request: I/O error, dev mtdblock10, sector 550 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056870.571848] spi-nor spi0.0: SPI transfer failed: -110
[2056870.571919] spi_master spi0: failed to transfer one message from queue
[2056870.575906] blk_update_request: I/O error, dev mtdblock10, sector 552 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[2056870.753564] SQUASHFS error: squashfs_read_data failed to read block 0x448ee

Running the find/cat a couple more times caused the device to reboot - likely due to a lockup trying to access the storage underlying the squashfs /rom partition.

vjorlikowski · June 30, 2022, 2:17pm

Another update...

Built from source, using 21.02.3 tag, but modified spi-max-frequency from 24000000 to 20000000 on GL-iNet B1300. Still encountering underlying issue that causes squashfs to get into this pathological state.

Not far from forward-porting @NoTengoBattery's patch forward to the latest 21.02 for my hardware.

yogo1212 · July 4, 2022, 7:36am

Woah! Nice! Thank you!
A lot better than just panicing (while still being maintainable)!

belarios · July 4, 2022, 5:40pm

Has anyone tested the 22.03 branch?

jlpapple · July 7, 2022, 11:58pm

I'm testing the latest OpenWRT Snapshot (r20029-3c06a344e9), running the 5.15.50 kernel via testing mode on a WD N750, but still seeing the same SquashFS errors.

I'm disappointed, as there were a number of upstream bug fixes in the 5.15 kernel compared to the 5.10 kernel (the current kernel for 22.03 and master.)

jlpapple · July 16, 2022, 8:48pm

Update: Possible improvement! After a reboot, no errors and there have been no errors for past 7 days. No issues on second reboot after 7 day uptime either. In the past on OpenWRT 21 and 22 I would see a few SquashFS errors immediately upon reboot.

I built the current firmware image from source using the July 7th snapshot, selected testing kernel and running 5.15.50. I'll keep monitoring and will report back.

vjorlikowski · July 22, 2022, 12:00pm

I've got an image for 21.02.3 built for my GL.iNet B1300, with a forward-port of @NoTengoBattery's SQUASHFS patch - and a patch to disable DMA transfers for spi-qup on my hardware. I plan to install it and test in the next week or so (work/personal issues preventing an immediate test).

If everything installs, works correctly, and "burns in" for a week or so, I'll post the forward-ported SQUASHFS patch, at least; the DMA disablement is a hack-job, just to see if PIO transfers via the SPI from my NOR-based storage are more reliable.

jlpapple · July 30, 2022, 8:22pm

After a month of testing I’m sad to report the SquashFS errors continue with my WD N750, even when running the 5.15.50 kernel.

I’ve had no success with 21.02, 22.03, or Snapshot. I’ve not applied the patch to these builds, yet.

vjorlikowski · July 31, 2022, 2:23pm

As of yesterday, my custom build of 21.02.3 (with @NoTengoBattery's SQUASHFS patch and my hack to disable DMA on spi-qup) is successfully running on my GL.iNet B1300.

I will be burning this in for a while, to see if any SPI and/or SQUASHFS errors present themselves. If none do after 1-2 weeks, I'll post the updated SQUASHFS patch.

I have a hunch I won't see any errors; my suspicion is that the SQUASHFS errors I has been seeing were a result of a mishandling of the DMA path in recent kernels, for the SPI.

csharper2005 · August 12, 2022, 7:18am

Has anyone adapted the patch for 22.03? Can you share, please?

vjorlikowski · August 15, 2022, 4:34am

Two weeks have passed, during which I have let things burn-in.
I have seen no errors on my hardware (GL.INet B1300), but I won't be confident until two months have passed.

In any case - as promised - here's the link to the patches I'm running with, against 21.02.3:

gist.github.com

https://gist.github.com/vjorlikowski/74634714d5409010e0db5e7dba962409

903-qualcomm_dma_disable.patch

From: Victor J. Orlikowski <vjorlikowski@gmail.com>
Subject: Disable DMA in Qualcomm SPI
--- a/drivers/spi/spi-qup.c
+++ b/drivers/spi/spi-qup.c
@@ -1078,11 +1078,8 @@
        controller->cclk = cclk;
        controller->irq = irq;
 
-       ret = spi_qup_init_dma(master, res->start);
-       if (ret == -EPROBE_DEFER)

This file has been truncated. show original

904-squashfs-error_resistance.patch

From: Victor J. Orlikowski <vjorlikowski@gmail.com>
Subject: Forward-port SQUASHFS error resistance patch by Oever González
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -67,6 +67,44 @@
 
 
 /*
+ * Kill the pages if an error is found, this will try to evict them from the
+ * page cache forcing the kernel to read again the device, hopping that the read

This file has been truncated. show original

There are two patches present in the above gist:

@NoTengoBattery's SquashFS patch, forward-ported
My own hack to disable DMA in spi-qup

Running without DMA in spi-qup has led to my storage being completely stable over the past two weeks, and not having seen the error recovery paths in the SquashFS that @NoTengoBattery wrote get engaged at all (but they're present, just in case).

In order to apply the patches, just drop them into:

target/linux/ipq40xx/patches-5.4/

in a 21.02.3 source tree (presuming you've got ipq40xx hardware), and follow the usual steps to build an image.

jlpapple · August 21, 2022, 3:10pm

Are these patches platform independent? Meaning, can I drop these patches into an Ath79 target for my WD MyNet 750?

vjorlikowski · August 22, 2022, 1:21pm

For 21.02.3?
Both ipq40xx and ath79 use the 5.4.188 kernel...so, it ought to build if you just drop them in.

All caveats apply (haven't built on ath79, not responsible if doesn't boot/eats your pets/etc.).