[PATCH] SQUASHFS: data probably corrupt

NoTengoBattery · July 29, 2020, 7:31pm

This is actually a solution

This is not an "issue" what I'm posting here, instead, it's a solution. A patch for the Linux kernel that might solve the SQUASHFS problem for some devices [don't expect this to be a magic solution if your hardware is actually faulty].

The background behind this research

I'm a developer, I started my journey by developing for AOSP when I was younger (like when I was 15 or 16). I may not be the smartest or the wiser, neither the more expert. However, I do have a strong willpower. My device, the EA6350v3, works fine. It's a nice device, some issues with the VLAN and the switch (which I've fixed myself) and with the fact that ath10k is far from perfect. Still, a nice experience.

It worked flawlessly and amazingly running Linux 4.14 and Linux 4.19, but that quickly changed when Linux 5.4 came to master. Since then, when using the USB port and the WiFi intensively at the same time (say, using it as a NAS), and particularly with Samba since the binary weigths like 30MB, the device will spam the log like this:

[ 6282.919915] spi-nand spi0.1: SPI transfer failed: -110
[ 6282.919983] spi_master spi0: failed to transfer one message from queue
[ 6282.924107] ubi0 warning: ubi_io_read: error -110 while reading 4096 bytes from PEB 261:73728, read only 0 bytes, retry
[ 6283.008726] SQUASHFS: error: xz decompression failed, data probably corrupt
[ 6283.008737] SQUASHFS: error: squashfs_read_data failed to read block 0x10d0386
[ 6283.009113] SQUASHFS error: Unable to read fragment cache entry [10d0386]

Spamming the whole kernel log is not the actual issue: the process needing the resources from the ROM will eventually crash, and it can be init, since it may need a shared library, some code swapped-out of memory, from the ROM, since init must live in the flashable SQUASHFS partition; or to run some busybox tool, and if init crashes, the kernel will immediately panic.

I debugged the issue in multiple ways:

I'm not the only owner of the device with this very same issue, and it's not the only device
Doesn't happen at all using previous kernels
The flash is not damaged or worn out
The image is not corrupted or bad in any form, correctly generated and correctly flashed
The SPI signal is perfectly fine in an oscilloscope, but when the error happens, the SPI bus is quiet (like if doesn't even try to read)
Changing the SPI, the flash and the UBI drivers [in the source code] in many ways didn't stopped the problem
Soldering some capacitors near the RAM, CPU, SPI and WiFi didn't helped that much, it arguably manages to reduce the spamming
Changing the SPI frequency to the maximum for SPI-QUP (far from the chip's maximum) or the minimum didn't helped
Balancing the interrupts across all CPUs in all possible ways didn't helped either
Changing the real-time properties of the SPI, UBI, USB, ksoftirqd and ath10k kernel threads in any possible ways didn't helped

So, the problem is not phyisical. The problem only happens when using Linux 5.4, it's a software problem, it's a issue related to how the CPU, the RAM, the DMA and the respective drivers [ath10k, usbcore, ubi/ubiblock/ubifs, spi-qup, mtd/mtdblock] interact, but I can't find out what's wrong.

Investigating about UBI, I found that UBI is resistant to errors, it can detect and correct many problems with actual bad flash chips [not the issue here], and indeed UBI detects a failure and tries again as the log clearly states, and most of the times it succeed. This means that programs or drivers that talk directly to the UBI layer succeed, like the UBIFS for example, and the ubiupdatevol and many more. SQUASHFS should do the same.

This means: SQUASHFS needs to be fixed, since UBI corrects the error, and it's transparent to the upper layer, why isn't SQUASHFS satisfied? Because when UBI fixes the error, the data is correct in the memory, it should be transparent to SQUASHFS, what's the deal?

The technical background

The Linux kernel will not allow a corrupted page in the memory. It will fight back with any driver attempting to corrupt the kernel's memory [corrupting the memory on purpose it's another story], including the SPI and the MTD and NAND drivers. UBI itself taints it's buffer, tries to read, and perform a checksum over the buffer. This means that the data, after UBI retries, is indeed present in memory, and it's correct, but SQUASHFS is unable to detect it as such. Instead, SQUASHFS will enter in a fail loop, even when the data is technically correct

This is because somehow the page cache (a section of memory used by Linux to optimize disk usage) keep the "corrupted" pages in memory. Almost all file systems in the kernel know how to deal with that situation, like F2FS in a bad flash or EXT4 with a faulty data cable, but SQUASHFS doesn't. From this point and onwars, "corrupted" in quotes means that it looks corrupted to SQUASHFS but no corruption is present as it's guaranteed by UBI, it's synonymous when I write "poison".

There are two main issues with SQUASHFS which need to be fixed to allow SQUASHFS to actually decompress the data from memory, both doesn't need to coexist but they can and I do recommend it.

1. Pages and fragments will not be invalidated from the "front" cache

This is the first step to fix SQUASHFS: invalidate the pages and fragments that are "known to be corrupted", making them never stay in the "front" cache. The squashfs_cache_get function in the cache.c file contains the "front" cache verification logic of SQUASHFS, it contains the call that, eventually, will read the data into memory and decompress it either to a buffer or directly to the page cache [not discussed here].

If you see the logs closely, SQUASHFS only attempts to decompress once. This is because the data goes along to SQUASHFS's "front" cache even if it does not work. This means that upon the decompression failure, the cache will remain poissoned until Linux needs to free some RAM and the page it's lucky enough to get evicted, and only then it's when Linux will attempt to read again.

Invalidating the "front" cache entries, by changing their block property to the SQUASHFS_INVALID_BLK will force SQUASHFS to try to read and decompress the data again, otherwise, the page marked as faulty will stay in RAM for an indefinite amount of time. However, this does not resolve any poisioning in the "back" cache, and if it's poisoned, it will fail over and onver again.

This alone may fix some problems, but the problem with doing only this is that the "corrupted" data will be used by all waiting processes at least once. It will be attempted again only when another process waits for the same data, but after all the original waiters have already used the corrupted data, or received the SIGBUS killing signal from the kernel, or as the page is not held corrupted in the page cahce, it will enter in contention. A better solution is try to avoid poisoning the "front" cache in the first place.

2. The double cache poisoning

SQUASHFS manages it's "front" cache (the data decompressed and actually used) in the function described above. The front cache can be invalidated and it will force SQUASHFS to try to read the backing device and decompress again. Here it's the second cache: the "back" cache which is the cache that belongs to the raw SQUASHFS image, the compressed data stored in the disk. SQUASHFS uses the ll_rw_block function (not anymore in the mainstream kernel) to read pages from the backing device. Linux optimizes these calls by putting the data, again, into it's page cache.

So, the very first time, the data will be actually read from disk, but the second and subsequent calls, even when it's requested to do so, will not; and the "corrupted" compressed data will remain in memory indefinitely. Two things need to be done: implement the retry logic for the squashfs_read_data function, so it retries without poisoning the "front" cache and evict the data from RAM, so the function ll_rw_block may attempt to re-fetch the data from disk.

The first one is trivial: change the name of the original function to __squashfs_read_data, make it static inline, wrap it in a fake squashfs_read_data with the same signature that will loop n times when the returning code of the real function is error. This allows retrying without poisoning the "front" cache.

However, this is not enough since the "back" cache still poisoned. Therefore, all attempts may fail since the data is held in memory indefinitely. Here it's the second part: ll_rw_block ask for a buffer head pointer, and once done, it will return in that pointer a chain of buffer heads that, in consequence, indicates the actual location of the data in memory, where the SQUASHFS decompression algorithm begins it's decompression.

When data is corrupted by any means, file systems must kill or zap these pages. SQUASHFS doesn't because it doesn't implement the retry logic that most file systems do. To solve this, when a failure is detected and before freeng the buffer head, we shall call our newly implemented __squashfs_kill_pages. This function will kill (or zap) all the pages from all the buffer heads, efectively saying to the Linux's VM that the pages need to be read again.

It's inspired in the AFS's afs_kill_pages, found in the write.c file. This function loops the buffer head's pointers array that SQUASHFS allocated, for every head, it loops all the chain of the buffer heads and kills the corresponding buffer head pages one by one.

The killing process is as follows:

Lock the page (which is a requirement)
Call the delete_from_page_cache function which most likely will delete the page from the page cache and from the LRU
Clear the page's "up to date" flag, which is a flag set by the kernel when the page is synced with disk and contains valid data
Sets the page's "error" flag, which tells the kernel that the file system found an error and the data in the page is not reliable
Unlock the page
Clear the buffer's "up to date" head flag, which is similar to the page's flag

This is not the solution that should be used

Even if it works and makes SQUASHFS robust, you should fix your underlaying problem, but if you can't (like me), just give it a try. I think this patch is worthful, anyway, because it provides SQUASHFS the robustness it lacks. The code is free to audit and test; and doesn't impact the performance (except for a couple of checks) when the data is successfully decompressed. It blocks all the requesting processes until all attempts are depleted, or the opperation succeeds.

The blocked processes will eventually progress using this patch, giving that the read problem is only transitory or that the underlaying driver can correct errors (like UBI does). When using the default implementation, the blocked processes will "progress" immediately, but they will be killed, from the SIGBUS sent when an I/O error cannot be recovered.

Patch

github.com

NoTengoBattery/openwrt/blob/linksys-ea6350v3-mastertrack/target/linux/ipq40xx/patches-5.4/9996-fs_squashfs_improve_squashfs_error_resistance.patch

From: Oever González <notengobattery@gmail.com>
Subject: [PATCH] fs: SQUASHFS: improve SQUASHFS error resistance
Date: Sat, 25 Jul 2020 18:49:31 -0600

This patch greatly improves the SQUASHFS's ability to deal with read errors. By
applying this patch, a corrupted page in the cache will not longer lead to a
persistent failure in decompression, at least most of the times.

Signed-off-by: Oever González <notengobattery@gmail.com>
---
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -65,6 +65,42 @@ static struct buffer_head *get_block_length(struct super_block *sb,
 	return bh;
 }
 
+/*
+ * Kill the pages if an error is found, this will try to evict them from the
+ * page cache forcing the kernel to read again the device, hopping that the read
+ * error is just transitory.

This file has been truncated. show original

netprince · February 7, 2021, 8:18pm

I was having this problem with ath79 on my wd n750. I tried two and they both had this issue.

I rebuilt with your patch and so far everything is working great. Thanks for posting this.

netprince · February 7, 2021, 8:36pm

I spoke too soon, I'm still having trouble with wd n750.

Currently after reboot I get a 'Bad Header Checksum', which could be another issue.

Sorry for the noise.

NoTengoBattery · February 7, 2021, 9:08pm

No problem, we oftentimes speak too soon. Your issue looks like a hardware problem, with my experience in electronics maybe is a noise or poor power quality issue.

I can bisect it directly to a hardware issue because your last message means that the kernel image was read with errors by the bootloader, way before Linux had a chance to boot.

The implication is that no patch to Linux will fix it. While this patch can help, it will not make it work. In the particular described case, the device always boot fine and runs fine until you put pressure on it, plus the bug is not present in older Linux kernels.

I recommend you checking your power supply. You can, then, attach a ferrite ring to the cable and solder some high quality aluminum and ceramic capacitors in the power supply rails near to the CPU and/or the flash chip.

Do not work with electricity if you have no experience with it.

netprince · February 7, 2021, 9:15pm

Thanks for the reply!

I thought it was a hardware issue at first too, and I have several of these around so I wired up a second one with a serial port. It has the same issue, boots up fine after installing openwrt via the recovery web interface, but after a reboot I get the 'Bad Header Checksum'.

Additionally if I go back to 19.07 (ar71xx) then everything works fine. Reboot is fine.

So something is wrong with the ath79 build.

vevere · June 27, 2021, 12:45pm

I also have a lovely WD My Net N750:

running perfectly with 19.07.7
unable to use it with 21 rc2 or rc3 (load average > 6 after few minutes) and with "openwrt SQUASHFS error: xz decompression failed, data probably corrupt" errors. I thought my router was faulty but in fact it seems to be a pb with hostapd that consommes too much ressources
unable to use it with today's snapshot with a lot of "openwrt SQUASHFS error: xz decompression failed, data probably corrupt" with wget saying "wget returned -1." when I try to do a "opkg update"

As I also discovered a lot of small things not working as expected on 21-rc2/3 with no reply to any of my posts/questions/bug report I start to think about trying another router distro

ecloud · November 28, 2021, 8:48pm

I just had the same problem with the GnuBee: it's rather crippling. Had to reboot and retry several times before I could even downgrade the firmware back to 19.07.8; fortunately succeeded eventually. The failing image for me is openwrt-21.02.1-ramips-mt7621-gnubee_gb-pc2-squashfs-sysupgrade.bin

walmartshopper · November 29, 2021, 11:57pm

I have run into this issue with two different Ubiquiti Edgerouter X routers. 19.07 is unstable on this device because of an ethernet driver bug. But OpenWRT 21.02 has this squashfs issue, which means I can't really trust it to be stable long term. I just got bit by this yesterday. I had set up an Edgerouter X running a 21.02 snapshot at a friend's house about 8 months ago. It worked flawlessly for that whole time, and then all the sudden they started getting big ping spikes and eventually lost internet connectivity. I went over to check on it, and the syslog was full of these errors. It left the router in a weird partially working state where no clients on the lan could access the internet. Rebooting did not help, it just had the same errors in syslog and no internet access. I assumed the flash was going bad, so I swapped it with a Mikrotik 750gr3 running 21.02.1 and brought the Edgerouter X home. I flashed it back to stock firmware using TFTP recovery and now it's working fine. I noticed the latest stock firmware is still using kernel 4.14. I also noticed this message in the syslog:

ubi0: good PEBs: 1981, bad PEBs: 1, corrupted PEBs: 0

I'm guessing the one block went bad while OpenWRT was running, and it didn't know how to handle it, leaving the system in a broken state that could only be fixed by re-flashing. It's a big problem because if one block going bad can take down the whole network, I can't really trust these devices in production, especially in remote locations. It's a bummer because I bought 2 of these for friends after seeing how well mine worked on 21.02. But in less than a year, 2 out of 3 have gone down because of this bug. One of them may have been power supply related because both the router and a UAP-AC-LITE were being powered by a single 12w PoE injector (although that setup worked fine for 8 months). But the other one that went down was being powered by the factory AC adapter.

I have started flashing the Edgerouters back to stock and selling them off to be replaced by Mikrotik 750gr3 and NanoPi / Orange Pi devices. The Mikrotik 750gr3 is basically the same hardware as the Edgerouter X but has 16MB of NOR flash vs the Edgertouer's 256MB NAND flash. In theory the NOR flash should be less likely to develop bad blocks over time, so it might be a better choice for OpenWRT at this point. I've had two of the Mikrotiks running 21.02 for 5 months without hitting this bug but will keep an eye on them.

potros · December 27, 2021, 12:35pm

Same here!

Half day to fix some strange behaviours about checksums and corrupt data and solution was to return to 19.07.7

Not experienced enought to discover what is wrong with version 21.

db260179 · January 12, 2022, 12:26pm

Sounds to me like the flash chip frequency is set to high, this will cause squashfs errors.

Worth experimenting lowering the flash frequency in the dts file and see if it makes any difference.

walmartshopper · January 12, 2022, 8:21pm

Unfortunately I can't test this theory... I flashed my edgerouters back to stock, sold them on ebay, and replaced them with Miktrotiks and/or NanoPis. I have not experienced this bug on any of the new devices.

belarios · January 13, 2022, 1:21am

Are you suggesting lowering this value:

spi-max-frequency = <25000000>;

in this file:

github.com

openwrt/openwrt/blob/b2aca5a263e73cb2def1c6cbb36f3be24d38089b/target/linux/ath79/dts/ar9344_wd_mynet-nxxx.dtsi

// SPDX-License-Identifier: GPL-2.0-or-later OR MIT

#include "ar9344.dtsi"

#include <dt-bindings/gpio/gpio.h>
#include <dt-bindings/input/input.h>

/ {
	chosen {
		bootargs = "console=ttyS0,115200n8";
	};
};

&ref {
	clock-frequency = <40000000>;
};

&spi {
	status = "okay";

This file has been truncated. show original

What value would be a good test?

This datasheet for the flash talks about 104Mhz

SPI clock frequencies of up to 104MHz are supported allowing equivalent clock rates of 208MHz (104MHz x 2) for Dual Output and 280MHz (70MHz x 4) for Quad SPI when using the Fast Read Quad SPI instructions.

slh · January 13, 2022, 2:38am

While that's a good first reference, it doesn't mean that the vendor's PCB is also up to the task…

There is no way around to confirm/ test if the theoretic values conform to physical reality.

db260179 · January 13, 2022, 9:28am

@slh is correct, although the flash manufacture can specify a higher frequency, the board manufacturer does the opposite most of the time (cost saving exercise).

If you want to go hardware analyse, you can use an oscilloscope on the data lines of the flash chip and see how noisy it is at that frequency.

For a test, lower it from 25mhz to 20mhz and see if it helps?

Its possible the flash chip is faulty, most likely if reducing frequency doesn't fix the issue.

yogo1212 · May 1, 2022, 3:37pm

Deer lord

I've been hunting this thing for years and only now it occured to me that patching squashfs is a possibility!
Found this thread instantly - this is great!
Thank you so much for putting the work in to confirm that this indeed wasn't present in older kernels and going so far to confirm that there's no electrical activity!

Here is sample log from my WPJ428 (IPQ40xx).

First, there are a few ETIMEDOUTs:

[44462.672294] spi-nor spi0.0: SPI transfer failed: -110
[44462.672440] spi_master spi0: failed to transfer one message from queue
[44462.676430] blk_update_request: I/O error, dev mtdblock10, sector 6496 op 0x0:(READ) flags 0x800 phys_seg 7 prio class 0
...
[44468.911877] spi-nor spi0.0: SPI transfer failed: -110
[44468.911957] spi_master spi0: failed to transfer one message from queue
[44468.915970] blk_update_request: I/O error, dev mtdblock10, sector 6544 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0

Then, squashfs complains unless something else clears out the cache:

[44468.923267] SQUASHFS error: Failed to read block 0x325e36: -5 (EIO)
[44468.933433] SQUASHFS error: Unable to read fragment cache entry [325e36]
[44468.939023] SQUASHFS error: Unable to read page, block 325e36, size c978
[44468.946209] SQUASHFS error: Unable to read fragment cache entry [325e36]
[44468.952507] SQUASHFS error: Unable to read page, block 325e36, size c978
[130394.427717] SQUASHFS error: Unable to read fragment cache entry [325e36]
[130394.427771] SQUASHFS error: Unable to read page, block 325e36, size c978
[217321.079498] SQUASHFS error: Unable to read fragment cache entry [325e36]
[217321.079696] SQUASHFS error: Unable to read page, block 325e36, size c978
[304457.592185] SQUASHFS error: Unable to read fragment cache entry [325e36]
[304457.592237] SQUASHFS error: Unable to read page, block 325e36, size c978
[389614.163585] SQUASHFS error: Unable to read fragment cache entry [325e36]
[389614.163676] SQUASHFS error: Unable to read page, block 325e36, size c978

On the WPJ428, I haven't seen the problem occur when it's powered through the DC jack - it only happens when PoE is used.
The issue can appear at various times and, thus, affect various blocks.
When I reboot a hundred devices, about a third will encounter the error.
Most of the time, the failing devices can be recovered with another reboot. Sometimes it takes two or three attempts.

Changing the frequency has no effect. I haven't been able to work out whether the distribution of failing block numbers changes with different frequencies.
The driver is spi_qup. The timeout occurs either in spi_qup_do_dma or spi_qup_do_pio - I didn't bother investigating which because I thought the cause would be purely electrical.

Maybe I'll look into it again now that I know that it worked with other kernels...

Greetings!

EDIT:
Now that I'm digging back into it, I found a thread on the NXP forum that has SPI timeouts appearing after an update from 4.14.62 and 4.14.98.
It stil could be anything - transition from u-boot to kernel, DMA perhaps (hmm. fallback_pio - no DMA issue)..

NoTengoBattery · May 2, 2022, 6:08am

I couldn't find the root cause, so, what I did was patching SQUASHFS directly. It is a dirty and nasty hack (tho, the code is clean) but it works and I use it in my custom build with no issues and complains so far. If you can get to the root, let us know and otherwise if you end up like me and apply those patches let us know if those patches make any sense!

yogo1212 · May 5, 2022, 12:51pm

@NoTengoBattery I don't think I'm ever going to find the underlying issue
And I don't think there'll ever be new revisions for this board so I think patching is right way.
Your patch is quite complex which makes it less attractive for long-term maintenance with upstream compatability but I looks to me like it's the best option for most cases (with the older kernel).

Your patch isn't applicable for newer kernels but there is a panic on squashfs error in linux upstream.

For the WPJ428 I decided to go with the kernel panic because the read errors only occur in ~1 out of 5 boots and only during boot. Just reboot until it works.
Here's a commit for Openwrt.

For any devices that have SPI read errors during normal operation, this likely isn't a good idea and can definitely lead to data corruption/loss!

NoTengoBattery · May 5, 2022, 5:20pm

I don't agree with the last statement. The SQUASHFS is read-only, and therefore there is no chance of corruption.

yogo1212 · May 5, 2022, 6:51pm

You're right about the squashfs itself.
But any other write process can be interrupted by a kernel panic as well (including the overlay jffs).

plunet · May 9, 2022, 3:28pm

So I'm after some clue here.... I'm running an Ubiquity ER-X with OpenWRT 21.02.1
I noticed recently that the web/luci interface was broken, so I went into take a look... noting that I now only have SSH access....

root@OpenWrt:/# ps | grep uhttp
 3869 root      1252 S    grep uhttp
root@OpenWrt:/# /etc/init.d/uhttpd start
root@OpenWrt:/# logread
Mon May  9 15:14:31 2022 kern.err kernel: [ 3778.551841] SQUASHFS error: xz decompression failed, data probably corrupt
Mon May  9 15:14:31 2022 kern.err kernel: [ 3778.565572] SQUASHFS error: squashfs_read_data failed to read block 0x161c32
Mon May  9 15:14:31 2022 kern.err kernel: [ 3778.623666] SQUASHFS error: xz decompression failed, data probably corrupt
Mon May  9 15:14:31 2022 kern.err kernel: [ 3778.637399] SQUASHFS error: squashfs_read_data failed to read block 0x161c32
Mon May  9 15:14:31 2022 daemon.info procd: Instance uhttpd::instance1 s in a crash loop 7 crashes, 0 seconds since last crash

So it seems I've got the same problem as described above...
I have tried to reboot the ER-X, it does reboot OK but some of the additional modules I have loaded crash on startup with similar SquashFS errors.
One of the ways to get around it seems to be to just reflash the firmware, it's a sticky plaster but should provide some temporary relief...

root@OpenWrt:/# cd /tmp
root@OpenWrt:/tmp# wget https://downloads.openwrt.org/releases/21.02.3/targets/ramips/mt7621/openwrt-21.02.3-ramips-mt7621-ubnt_edgerouter-x-sfp-squashfs-sysupgrad
e.bin
Bus error

and looking in the log

Mon May  9 15:18:17 2022 kern.err kernel: [ 4004.042655] SQUASHFS error: xz decompression failed, data probably corrupt
Mon May  9 15:18:17 2022 kern.err kernel: [ 4004.056382] SQUASHFS error: squashfs_read_data failed to read block 0x161c32
Mon May  9 15:18:17 2022 kern.err kernel: [ 4004.110919] SQUASHFS error: xz decompression failed, data probably corrupt
Mon May  9 15:18:17 2022 kern.err kernel: [ 4004.124652] SQUASHFS error: squashfs_read_data failed to read block 0x161c32

On this occasion nothing was seemingly written to /tmp although I can see some evidence that some previous attempts to download did write a coredump to /tmp

-rw-------    1 root     root       1462272 Feb  6 12:14 uclient-fetch.1644149684.3224.10.core
-rw-------    1 root     root       1462272 May  9 14:12 uclient-fetch.1652105525.3721.10.core
-rw-------    1 root     root       1462272 May  9 14:12 uclient-fetch.1652105531.3730.10.core

How could I escape from this situation, if the solution is to reflash the unit (or update some squashfs binaries) but I seemingly can't write anything to the flash of the ER-X, it seems I'm possibly a bit stuck? If I should open a new thread rather than clutter this one with my own personal issues, please say so...