This is actually a solution
This is not an "issue" what I'm posting here, instead, it's a solution. A patch for the Linux kernel that might solve the SQUASHFS problem for some devices [don't expect this to be a magic solution if your hardware is actually faulty].
The background behind this research
I'm a developer, I started my journey by developing for AOSP when I was younger (like when I was 15 or 16). I may not be the smartest or the wiser, neither the more expert. However, I do have a strong willpower. My device, the EA6350v3, works fine. It's a nice device, some issues with the VLAN and the switch (which I've fixed myself) and with the fact that ath10k is far from perfect. Still, a nice experience.
It worked flawlessly and amazingly running Linux 4.14 and Linux 4.19, but that quickly changed when Linux 5.4 came to master
. Since then, when using the USB port and the WiFi intensively at the same time (say, using it as a NAS), and particularly with Samba since the binary weigths like 30MB, the device will spam the log like this:
[ 6282.919915] spi-nand spi0.1: SPI transfer failed: -110
[ 6282.919983] spi_master spi0: failed to transfer one message from queue
[ 6282.924107] ubi0 warning: ubi_io_read: error -110 while reading 4096 bytes from PEB 261:73728, read only 0 bytes, retry
[ 6283.008726] SQUASHFS: error: xz decompression failed, data probably corrupt
[ 6283.008737] SQUASHFS: error: squashfs_read_data failed to read block 0x10d0386
[ 6283.009113] SQUASHFS error: Unable to read fragment cache entry [10d0386]
Spamming the whole kernel log is not the actual issue: the process needing the resources from the ROM will eventually crash, and it can be init
, since it may need a shared library, some code swapped-out of memory, from the ROM, since init
must live in the flashable SQUASHFS partition; or to run some busybox
tool, and if init
crashes, the kernel will immediately panic.
I debugged the issue in multiple ways:
- I'm not the only owner of the device with this very same issue, and it's not the only device
- Doesn't happen at all using previous kernels
- The flash is not damaged or worn out
- The image is not corrupted or bad in any form, correctly generated and correctly flashed
- The SPI signal is perfectly fine in an oscilloscope, but when the error happens, the SPI bus is quiet (like if doesn't even try to read)
- Changing the SPI, the flash and the UBI drivers [in the source code] in many ways didn't stopped the problem
- Soldering some capacitors near the RAM, CPU, SPI and WiFi didn't helped that much, it arguably manages to reduce the spamming
- Changing the SPI frequency to the maximum for SPI-QUP (far from the chip's maximum) or the minimum didn't helped
- Balancing the interrupts across all CPUs in all possible ways didn't helped either
- Changing the real-time properties of the SPI, UBI, USB, ksoftirqd and ath10k kernel threads in any possible ways didn't helped
So, the problem is not phyisical. The problem only happens when using Linux 5.4, it's a software problem, it's a issue related to how the CPU, the RAM, the DMA and the respective drivers [ath10k, usbcore, ubi/ubiblock/ubifs, spi-qup, mtd/mtdblock] interact, but I can't find out what's wrong.
Investigating about UBI, I found that UBI is resistant to errors, it can detect and correct many problems with actual bad flash chips [not the issue here], and indeed UBI detects a failure and tries again as the log clearly states, and most of the times it succeed. This means that programs or drivers that talk directly to the UBI layer succeed, like the UBIFS for example, and the ubiupdatevol
and many more. SQUASHFS should do the same.
This means: SQUASHFS needs to be fixed, since UBI corrects the error, and it's transparent to the upper layer, why isn't SQUASHFS satisfied? Because when UBI fixes the error, the data is correct in the memory, it should be transparent to SQUASHFS, what's the deal?
The technical background
The Linux kernel will not allow a corrupted page in the memory. It will fight back with any driver attempting to corrupt the kernel's memory [corrupting the memory on purpose it's another story], including the SPI and the MTD and NAND drivers. UBI itself taints it's buffer, tries to read, and perform a checksum over the buffer. This means that the data, after UBI retries, is indeed present in memory, and it's correct, but SQUASHFS is unable to detect it as such. Instead, SQUASHFS will enter in a fail loop, even when the data is technically correct
This is because somehow the page cache (a section of memory used by Linux to optimize disk usage) keep the "corrupted" pages in memory. Almost all file systems in the kernel know how to deal with that situation, like F2FS in a bad flash or EXT4 with a faulty data cable, but SQUASHFS doesn't. From this point and onwars, "corrupted" in quotes means that it looks corrupted to SQUASHFS but no corruption is present as it's guaranteed by UBI, it's synonymous when I write "poison".
There are two main issues with SQUASHFS which need to be fixed to allow SQUASHFS to actually decompress the data from memory, both doesn't need to coexist but they can and I do recommend it.
1. Pages and fragments will not be invalidated from the "front" cache
This is the first step to fix SQUASHFS: invalidate the pages and fragments that are "known to be corrupted", making them never stay in the "front" cache. The squashfs_cache_get
function in the cache.c
file contains the "front" cache verification logic of SQUASHFS, it contains the call that, eventually, will read the data into memory and decompress it either to a buffer or directly to the page cache [not discussed here].
If you see the logs closely, SQUASHFS only attempts to decompress once. This is because the data goes along to SQUASHFS's "front" cache even if it does not work. This means that upon the decompression failure, the cache will remain poissoned until Linux needs to free some RAM and the page it's lucky enough to get evicted, and only then it's when Linux will attempt to read again.
Invalidating the "front" cache entries, by changing their block
property to the SQUASHFS_INVALID_BLK
will force SQUASHFS to try to read and decompress the data again, otherwise, the page marked as faulty will stay in RAM for an indefinite amount of time. However, this does not resolve any poisioning in the "back" cache, and if it's poisoned, it will fail over and onver again.
This alone may fix some problems, but the problem with doing only this is that the "corrupted" data will be used by all waiting processes at least once. It will be attempted again only when another process waits for the same data, but after all the original waiters have already used the corrupted data, or received the SIGBUS
killing signal from the kernel, or as the page is not held corrupted in the page cahce, it will enter in contention. A better solution is try to avoid poisoning the "front" cache in the first place.
2. The double cache poisoning
SQUASHFS manages it's "front" cache (the data decompressed and actually used) in the function described above. The front cache can be invalidated and it will force SQUASHFS to try to read the backing device and decompress again. Here it's the second cache: the "back" cache which is the cache that belongs to the raw SQUASHFS image, the compressed data stored in the disk. SQUASHFS uses the ll_rw_block
function (not anymore in the mainstream kernel) to read pages from the backing device. Linux optimizes these calls by putting the data, again, into it's page cache.
So, the very first time, the data will be actually read from disk, but the second and subsequent calls, even when it's requested to do so, will not; and the "corrupted" compressed data will remain in memory indefinitely. Two things need to be done: implement the retry logic for the squashfs_read_data
function, so it retries without poisoning the "front" cache and evict the data from RAM, so the function ll_rw_block
may attempt to re-fetch the data from disk.
The first one is trivial: change the name of the original function to __squashfs_read_data
, make it static inline
, wrap it in a fake squashfs_read_data
with the same signature that will loop n
times when the returning code of the real function is error. This allows retrying without poisoning the "front" cache.
However, this is not enough since the "back" cache still poisoned. Therefore, all attempts may fail since the data is held in memory indefinitely. Here it's the second part: ll_rw_block
ask for a buffer head pointer, and once done, it will return in that pointer a chain of buffer heads that, in consequence, indicates the actual location of the data in memory, where the SQUASHFS decompression algorithm begins it's decompression.
When data is corrupted by any means, file systems must kill
or zap
these pages. SQUASHFS doesn't because it doesn't implement the retry logic that most file systems do. To solve this, when a failure is detected and before freeng the buffer head, we shall call our newly implemented __squashfs_kill_pages
. This function will kill (or zap) all the pages from all the buffer heads, efectively saying to the Linux's VM that the pages need to be read again.
It's inspired in the AFS's afs_kill_pages
, found in the write.c
file. This function loops the buffer head's pointers array that SQUASHFS allocated, for every head, it loops all the chain of the buffer heads and kills the corresponding buffer head pages one by one.
The killing process is as follows:
- Lock the page (which is a requirement)
- Call the
delete_from_page_cache
function which most likely will delete the page from the page cache and from the LRU - Clear the page's "up to date" flag, which is a flag set by the kernel when the page is synced with disk and contains valid data
- Sets the page's "error" flag, which tells the kernel that the file system found an error and the data in the page is not reliable
- Unlock the page
- Clear the buffer's "up to date" head flag, which is similar to the page's flag
This is not the solution that should be used
Even if it works and makes SQUASHFS robust, you should fix your underlaying problem, but if you can't (like me), just give it a try. I think this patch is worthful, anyway, because it provides SQUASHFS the robustness it lacks. The code is free to audit and test; and doesn't impact the performance (except for a couple of checks) when the data is successfully decompressed. It blocks all the requesting processes until all attempts are depleted, or the opperation succeeds.
The blocked processes will eventually progress using this patch, giving that the read problem is only transitory or that the underlaying driver can correct errors (like UBI does). When using the default implementation, the blocked processes will "progress" immediately, but they will be killed, from the SIGBUS
sent when an I/O error cannot be recovered.