Fixed position partition splitting and bad blocks on nand flash

I am in desperate need of some education. I assume the issue and solution is documented somewhere, but haven't found an answer I like.... I guess what I am looking for is the "Universally Undisputed Correct Solution" :slight_smile:

The basic problem is installation of OpenWrt on a device with NAND flash where the OEM firmware provides no writable rootfs, and the bootloader has no UBI support. So we want to split the OEM firmware partition in a kernel mtd the bootloader can read and a UBI mtd for the rootfs and other data the bootloader don't need ot see.

This is simple in device tree: We just create an additional UBI partition starting where we want to put it.

The question is: How do we bootstrap the split properly, when installing from either OEM firmware or bootloader? They obviously only know about the OEM partition layout.

Being a simple mind, I've always thaought that this is as simple as: Create a "factory" image which is the concatination of the kernel and ubi partitions, where the kernel is padded to the split point. This image can then be flashed like the OEM firmware from e.g. the bootloader, and when booted it will known about the proper split from device tree and mount the rootfs and rootfs_data found in the UBI image.

Now I've been "lucky" enough to get hold of a device with a bad block. As expressed by the bootloader:

Check image validation:
Image1 Header Magic Number --> OK
Image2 Header Magic Number --> OK
Image1 Header Checksum --> OK
Image2 Header Checksum --> OK
Image1 Data Checksum --> ................................................ranand_read: skip reading a fact bad block 440000 -> 460000
....................................................OK
Image2 Data Checksum --> ....................................................................................................OK
Image1 Stable Flag --> Not stable
Image1 Try Counter --> 0

Image1: OK Image2: OK

The bad block is in the part of the OEM firmware partition which I must use for the kernel. So I realize that the "factory image" I described is a recipe for disaster. The bad block is properly skipped both when reading and writing. Which means that the kernel partition is one block smaller that my padding calculation assumed, and the UBI image with the rootfs ends up at the wrong position. It's no longer where the device tree says it should be, but one block offset. Resulting in

[ 3.411400] UBI error: no valid UBI magic found inside mtd4

since that magic now is actually in the second block of mtd4 instead of the first.

So I know the problem. But I still don't have a solution I like...

I could require a two step installation process, where the user has to boot an initramfs with the proper device tree first and then sysupgrade to the real installation from there. But I would prefer avoiding the additional step if possible.

Or I could replace the fixed partition split with some dynamic splitting method, allowing a combined kernel+rootfs image without any specific assumption on rootfs placement. But this is not possible with UBI AFAIK. Keeping PEM counters and doing proper wear levelling means that we can't move, add or remove underlying flash blocks. And I believe we do want UBI under any writable file system on nand flash? So this method is pretty much ruled out.

Any comments are appreciated, whether you have a solution or not.

Like I commented to you in the other discussion, the problem has been observed with R7800 earlier. So far, the initramfs+sysupgrade approach is the easiest fix, as it allows writing kernel and rootfs to the correct places separately.

Likely the proper long-term solution would be bad block management built into the kernel, so that rootfs loading with bad blocks would happen ok. Some OEMs have built it, and I think that it is implemented for some targets also with OpenWrt.

A recent OpenWrt implementation of bad block management is
https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=11425c9de29c8b9c5e4d7eec163a6afbb7fbdce2
I haven't really tried to analyse if that one solves the rootfs location prob, but it is at least a step to that direction.

You might also read discussion in :
R7800 -> Flashing openwrt causes bootloop (bad block in kernel area)

especially the few messages around here, where I quoted the BB logic from Netgear GPL sources:

I have not tried to port/implement it to OpenWrt, but it gives one answer how OEMs tackle the NAND bad block handling with their firmware.

Hi @bmork! Is your problem the same with mine?

Have you already found a solution?

1 Like

Yes. seems so.

Not really. I have opted to not create "factory" images trying to pad the rootfs to a specific offset. It's better to recommend a two-step initial installation, writing an initramfs first and then sysupgrade after booting into the initramfs with the proper split defined by device tree

Could you give me an affected device model, please? I would like to add this info to the bug description.

Unfortunately, it doesn't work for me. Sysupgrade writes kernel at 0x400000 (according to layout in DTS file), but U-boot tries 0x42000 (0x40000 + bad block shift) only.

1 Like

I have a ZyXEL NR7101 with a bad block early in the first firmware partition. It's after kernel offset, but before any reasonable rootfs location.

So this isn't as bad as a block moving the kernel offset and confusing the boot loader. That would affect the oem firmware too?

Hmm, thinking about it: shouldn't the boot loader ignore the bad blocks when looking for the kernel address? I can't understand how it's supposed to work if it doesn't.

FWIW, the NR7101 has no problem loading it secondary/recovery image, which will be "misplaced" if you count usable blocks instead of using the absolute address.

I think that it should do so and this behavour is correct. For the second slot of my device

*************************************
Boot Flag : Sercomm1
*************************************
Fw header Magic check OK!
FW header checksum should: 0x47b7b0fe, crc32 result: 0x47b7b0fe, Fw header CRC check OK!
kernel : real offset: 0x00a20100, data length: 0x003dd3f8,checksum :0x67d1e2fd
kernel result: 0x67d1e2fd, kernel check sum ok!
rootfs:  real offset: 0x02820000, data length: 0x0093e000, checksum :0x81bdb3ef
rootfs result: 0x81bdb3ef, rootfs check sum ok!

So what we have:
Case 1. There are no bad blocks. The layout of the stock firmware and OpenWrt are the same. Everyone is happy

Case 2. Bad blocks. U-Boot and stock firmware uses adaptive layout (shifted according badblocks). While openwrt continues to use static offsets from DTS. So we have different layouts. It could cause various weird issues:

  1. Device bricking;
  2. Boot loops;
  3. Problems with reverting to stock using mtd utils.
    That is the reason why real users in real life prefer to use Breed instead the stock U-Boot.

At the same time, if the OpenWrt behavior will be changed in favor of an adaptive layout (harmonized with U-Boot and the stock) current users who hammered openwrt using some hacks could get bricks and bootlops.

However, I find it more correct to use adaptive layout. It's safer, more universal and less tricky. Openwrt is used by the end-users who do not have UART, NAND programmers, soldering irons. :wink:

1 Like