What can/should be done with NAND bitflips?

I am flashing BT HH5a with LEDE and I see quite a few correctable bitflips:

root@LEDE:/# nanddump --file /tmp/mounts/USB-A/home/hh5a.nanddump /dev/mtd4
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x08000000...
ECC: 1 corrected bitflip(s) at offset 0x01a7e000
ECC: 1 corrected bitflip(s) at offset 0x01dbf000
ECC: 1 corrected bitflip(s) at offset 0x01ddc800
...

I understand that these errors can be corrected so far at hardware level (is it true?) But is there a more permanent solution to this problem? Is it possible to mark those blocks as bad blocks now, before I put LEDE on the device? I would rather lose some nand space now, than risk instability later, then the router is put in service.

Thanks for your advice.

tl dr: it's normal for NAND flash and already dealt with, no need to worry about this.

.
.

More in depth answer:

Note how it says "correctED bitflips", and not "correctABLE bitflips".

The messages you are seeing there are normal for NAND flash. They mean that the ECC (error detection and correction) logic has detected issues and corrected the data on read using the parity bits.

A "bitflip" is a situation where a bit changes its state on its own, a 1 becomes a 0 or the reverse. It's not a bad block (permanent damage), it's just something that randomly happens in NAND because they are less reliable by design (to keep costs down), and work around this drawback by having ECC logic implemented to correct any bitflip (and this is still cheaper than making them more reliable at the hardware level).

So in a NAND flash device you have some space that is actually storing data, and some space that stores parity information, so the system can use ECC logic and correct bitflips. This is automatic.

The same thing happens inside SSDs, usb flash drives, SD cards and Smartphones (all use NAND flash), you just don't see this happening it because it's all handled by the storage controller, while in an embedded device where read/write speed isn't important (like a router) the flash memory is accessed raw, there is no such controller, so the system itself uses ECC when reading/writing it.

Note that with the ECC logic implemented (by either a storage controller or the system), the NAND flash is as reliable as you would expect a storage device to be. It's not "unreliable" or anything.

EDIT: bonus fact: the reason you are using "nanddump" tool instead of the more common "dd" tool to create a backup of the flash partition is that nanddump is aware of all the above and reads the NAND with ECC logic, and stores only the actual data (correcting bitflips) in the backup you are creating.
dd tool is not aware of NAND ECC logic, so it will read all the NAND partition, both data AND parity and will generate a backup that is exactly the same as actually on flash, but is completely useless and unreadable as it will contain both data and hardware-specific parity information which cannot be restored on a different device.
The dd tool (or something more advanced like pv ) can only be used on block devices. Hard drives, SSDs, SDcards and so on where any ECC is done by the storage controller, not by the system, so whatever it reads is pure data, no metadata or parity for ECC logic.

3 Likes

Many thanks for perfect explanation!

Well written!
Would be a good addition to https://wiki.openwrt.org/doc/techref/flash.
Would you mind adding it? I could do so, but I'm not sure how to properly integrate it in the existing "Bad Blocks" section...

Ok, added. I split my text in an informative part about bitflips (which have nothing to do with bad blocks so they go in their own paragraph), and another paragraph about the tools that MUST be used to read/write on raw NAND due to all this ECC thing.

Feel free to rearrange them if needed.

1 Like

Perfect, thank you!

A post was split to a new topic: Wink Hub v1 NAND bitflips