[solved] Strange mksquashfs4 issue

lexa2 · June 10, 2019, 11:15am

Hi dear all,

I'm facing somewhat strange behavior of mksquashfs4 depending on the OS it was built on.
If mksquashfs4 was built from sources as shipped with OpenWrt on cygwin64 then compressing rootfs using it results in non-mountable FS on target board (Banana Pi R2 in my case) unless block size is set to 32 or less:

[ 6210.154192] SQUASHFS error: xz decompression failed, data probably corrupt
[ 6210.162319] SQUASHFS error: squashfs_read_data failed to read block 0x268ab5
[ 6210.170747] SQUASHFS error: Unable to read metadata cache entry [268ab5]
[ 6210.178719] SQUASHFS error: Unable to read inode 0x13721733

But it seems that there's nothing wrong with the produced fs image as (a) it is possible to decompress it back using unsquashfs4 both on cygwin host and on any other linux host using fresh enough squashfs tools and (b) there are no problems mounting produced image on linux through squashfsfuse.

On the other hand if I compile the same OpenWrt squashfs4 sources on linux and unsquashfs and recompress back using them then I'm able to mount produced image on the BPi R2 board no matter the block size.

I'm wondering what might be the cause of the issue? Apparently there are some differences how the squashfstools and/or liblzma work under cygwin compared to linux that - for some strange reason - make difference when trying to mount on embedded board but are irrelevant for squashfstools/liblzma on the normal linux host. What might it be, any thoughts?

jeff · June 10, 2019, 11:44am

Past Cygwin (as well as Windows’ Linux-compatibility layer) being imperfect, to be polite, check case-insensitive file systems and byte/word order. Edit: Command-line utilities on a 64-bit "desktop" are likely much more tolerant of errors or configuration options than are the kernel-resident drivers on an SoC.

Or just install VirtualBox and Debian.

Cross-compilation and image assembly is a complex enough thing that I’d not trust it using a Linux-based tool chain under either FreeBSD or macOS myself, as much as I prefer those OSes.

lexa2 · June 11, 2019, 3:56am

Please try to actually read what I wrote in the first post. I'm not looking for workarounds - being an experienced unix system engineer I've got plenty of *nix based systems on hand. I'm curious about what's really going on under the hood.

Blaming "cygwin is bad" is an easy route for lazy people - it does not solve anything ;-). Nevertheless (a) I've got case sensitivity set up properly for cygwin (it is turned on for Windows kernel and for example OpenWrt's internal checks for FS case sensitivity pass OK) and (b) error in decompressing xz stream of data on SoC has extremely low chances to be caused by case-insensitivity. It is xz decompression error what's being reported, not some filesystem case sensitivity problem. As for byte/word ordering - linux host I used for tests in this case is a CentOS7 x86_64 VM running in VirtualBox alongside the Windows/Cygwin. Thus it is the same CPU/arch/endiannes. Plus you'd expect all block sizes to be screwed under cygwin if the problem was related to endianes.

Point here is that this is a developer's forum, isn't it? And developers tend to like working on solving complex problems, that's a kind of a game in itself. The entirety of my life I've been working in IT field doing complex thing and solving complicated troubles. Developers do things like this not because they are easy, but because they are hard and it is satisfying to have it solved yourself knowing that others gave up on this. For example if something that was designed to work under linux userspace fails under FreeBSD's emulation of it with the same set of userspace libraries used in both cases - it is not a matter of trust for me, it is more like a challenge and curiosity call as it means that either there are bugs or some assumptions in code that are needlessly linux-specific OR/AND it is a valid case of linux-only code that, maybe, is heavily reliant on some linux-specific system call or interface that is not provided by the emulation layer. Either way I want to know what's the deal as knowing the answer might reveal bugs and end up in making code more robust.

For this specific case we are not even talking about the firmware image creation under cygwin (while I had proven for myself that is is possible - take a look here). It's a matter of using a single standalone tool, mksquashfs. I hadn't provided details on the actual workload that is happening as it is not that relevant for the question itself but screw it, here it is:

We've got a custom board with some vendor-shipped firmware and we have no access to and haven't got sources for. Looking at the bootlog over UART makes me believe that firmware is OpenWrt or LEDE based.
We've got end-users using these boards for their needs and these users are ordinary people - they know nothing about linux, squashfs, e.t.c., e.t.c.
Vendor's approach for providing data to the board is to insert an SD card into it. On a card there should be a single FAT32 partition with a file named datafs.squash stored on it. This file is expected to be a squashfs 4.x filesystem storing user-specific data in a some vendor-defined folders structure. Don't ask me why vendor thought that this is a good idea - my guess would be that weeds were smoky at vendor's land by the time the decision was made.
Vendor provided a tool in a form of a bash script to be run under Ubuntu that automates the creation of a datafs.squash. Rationale provided by vendor was that it is "cross-platform as it works both under linux and under Windows using WSL". Catch is that most of our end-users use Windows 7 or 8.1 thus having no access to WSL and they don't want/know/able to install/use Ubuntu VM on their PCs.
Conclusion for us was that it would be worth trying to re-implement vendors script using cygwin and check if that would fly. But it failed with board being unable to mount squashfs image that was generated under cygwin. So I started to debug the issue and was easily able to reproduce "unable to mount error" on BPi R2 board I was toying with during last several months. Then I went on and started to experiment with the way squashfs tools are compiled, with various compression options available in mksquashfs and found out all the info I provided in the first post of this thread.

That's basically it. I know what are the next possible debugging steps here but before taking a deep dive into it I posted the info here hoping that someone from experienced devs might have a better idea on what's going on that will save me from wasting tons of time on a tedious debugging.

jeff · June 11, 2019, 4:33am

I did read what you wrote in your first post, as well as what you didn't -- namely that you failed to state that you had already confirmed that the file system was case sensitive.

Since you've already stated how surprised you are that there's any problem, doesn't that mean that you're looking for things that have "extremely low chances" to be the cause? Such as a build system or component within that system that might have unexpectedly corrupted an image or even placed the wrong file(s) in place.

But I guess you're beyond all that.

I guess your intuition is so good that the idea that the SoC on which you are trying to decompress the data potentially has a different word-size and endianess than your VM or your Cygwin.

Since you're such a seasoned professional, you should already know how to indicate what you've tried, be it a bug report or a request for help.

Nah, a seasoned professional like yourself already knows "the next possible debugging steps".

lexa2 · June 11, 2019, 5:40am

Once again: image generated by cygwin's build of mksquashfs passes consistency test on both cygwin and linux. What corruption are you talking about?

We're not reaching file level here. Problem with mounting fails at an earlier phase. Decompressing filesystem reveals that all expected files are in place. Recompressing them back on linux with the same compression options makes the problem go away as was stated in the very first post.

Cygwin and linux VM on x86_64: LE and LP64 data model. Both target boards - the original where the problem was encountered and BPi R2 I used to reproduce the problem with - are ARMv7 running in LE mode. Word size is 32. That's all nice and neat but it does not explain why xz block size makes all the difference here when compressing under cygwin but does not matter when compressed under linux.

Once again: it is not a something obvious from a first glance that is happening here. Your answers, being sarcastic or not, unfortunately does not help to reveal what's really going on. In case you've got the real practical ideas on how to debug it properly in the easiest way - you're welcome to share.

lexa2 · June 11, 2019, 7:57am

So I nailed it.

First I tracked it down to be openwrt-specific as if I build squashfs tools without openwrt patches mksquashfs create byte-to-byte identical FS archives on cygwin and linux if fed with the same source set of files. Then I applied OpenWrt-supplied patches and compared the differences in the images created under cygwin vs linux. The only differences were in the 12 bytes header known as COMPRESSOR OPTIONS. Rest of the file was bit-to-bit identical. Looking and headers differences it was easy to spot that for some yet unknown reason on cygwin compressor options header was generated in big endian while on linux is was in expected little endian format.

From here it was a quick and easy ride to find the problem. It is a bug in OpenWrt's 160-expose_lzma_xz_options.patch patch. Here is a fix for it:

diff --git a/tools/squashfs4/patches/160-expose_lzma_xz_options.patch b/tools/squashfs4/patches/160-expose_lzma_xz_options.patch
index 9e1c1fbb..209a91c8 100644
--- a/tools/squashfs4/patches/160-expose_lzma_xz_options.patch
+++ b/tools/squashfs4/patches/160-expose_lzma_xz_options.patch
@@ -26,15 +26,15 @@
 +
 +#include <stdint.h>
 +
-+#ifndef linux
++#if defined(linux) || defined(__CYGWIN__)
++#include <endian.h>
++#else
 +#ifdef __FreeBSD__
 +#include <machine/endian.h>
 +#endif
 +#define __BYTE_ORDER BYTE_ORDER
 +#define __BIG_ENDIAN BIG_ENDIAN
 +#define __LITTLE_ENDIAN LITTLE_ENDIAN
-+#else
-+#include <endian.h>
 +#endif
 +
 +

As typical for these cases it wasn't a cygwin's fault, it was software fault treating cygwin differently compared to linux. About 90% of cases I've seen with problems getting some linux software working under cygwin were due to software's build configuration system trying to do something special for cygwin while it was appropriate just to leave it alone and behave the same was as under linux.

With the above patch for the patch in place filesystem images produced under cygwin are bit to bit identical to images produced under linux which is the result I expect to get from a properly written software. Case closed.

system · June 21, 2019, 8:12am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.