Does NAND flash go bad over time?

I have been using a WNDR3700v4 for several years, with its outsized RAM and NAND flash. I paid about $15 for a used one on eBay. It worked flawlessly until recently, when my WiFi became unstable and soon was completely inoperative. Reboots, power cycling, unplugged, nothing brought back the WiFi functions. I figured some chip wore out, so I ordered another cheap used WNDR3700v4 on eBay. I also could not resist buying another unit, listed as bricked, for $0.99 plus shipping. The "bricked" unit came up immediately as DD-WRT, go figure. So I backed up my working configuration, used tftp to load OEM firmware to the two newly arrived routers, flashed the latest OpenWRT, and restored my configuration. Both newly purchased units worked perfectly, including WiFi.

So I did the same thing on the failed unit - tftp flashed OEM, flashed OpenWRT, restored configuration from backup. And it now works perfectly, too.

Does the flash in consumer routers flip a random bit from time to time? Is the unusual NAND flash on these units more susceptible to subtle corruption? They are a great OpenWRT router for the low used prices. Someday I may load the special build that makes all 128MB of NAND flash available.

Yes.

NAND flash contains some faulty blocks pretty much by definition, straight from the the manufacturer, and their number will inevitably increase over time. Devices using NAND flash, such as SSDs, sdhc, USB sticks, etc., do hide this from the user, by retaining a (considerably large) amount of hidden spare blocks to replace the faulty ones by using a smart on-device firmware, which handles these and employs a rather sophisticated wear-leveling technique to spread the wear equally over all blocks (including the spares). Embedded devices, like routers, usually use raw flash, without these abstracted wear-levelling techniques and then have to replicate them in software (bootloader, kernel, etc.), as part of the main device firmware. Depending on how serious the vendor took this, crucial parts of the firmware might not be covered by wear-levelling and bad block management (usually at least the tiny bootloader in front, often neither the kernel - in very bad cases neither the wifi calibration data or worst, the rootfs), a bad block 'in the wrong place' can then easily brick the device.

For NAND, aside from 'just' old age, any kind or writing or even reading can facilitate the creation of bad blocks, hard power cuts are also not really liked by the flash (good vendors use capacitors to provide a semi-clean emergency shutdown before the power cuts out completely).

Considering the vintage of these devices, especially the power supply (both the wall-wart and the onboard voltage regulators) should also always be suspect, but also the rather hot running SOC and wireless chipset (including its attached rf chain and in-line amplifiers) do age over time.

So, yes - electronics and especially NAND flash don't have an unlimited shelf live, especially when operated under unknown or potentially bad conditions (hot climate, dust build-up in/ around the device, improper ventilation, blocked air vents (e.g. sitting on a carpet, too close to other equipment, etc.) or incorrect mounting orientation; aside from some vendors/ models just being designed in a 'sub-optimal' way in terms of heat dissipation. With bad blocks being a natural/ common/ 'normal' ocurrance, which happen more often/ quicker the more reads/ writes are happening to NAND and the more often it's switched on/ off hard.

6 Likes

As an addendum, in most consumer devices it's the power distribution components (or the wifi chip itself) that usually fail, often because of overheating putting stress on them.

You have to keep in mind that they are built to be cheap, not to last indefinitely when run 24/7.

Which is why they use NAND instead of NOR flash. At the same storage size, NAND is cheaper and while it is less reliable it is "good enough" for the job.

3 Likes

Two small additions to complete the picture: USB sticks are probably the least protected against wear. And writing to flash RAM has worse effects than reading. But recent hardware from the big-name brands is far better than the old, even unused, hardware.

There's not much we can do about the flash RAM in the hardware we run openWRT on, but in this era of chip-supply problems, watch out for the crooks.

AFAIK all USB sticks have basic wear leveling, this was benchmarked a decade ago and all usb drives tested were lasting A LOT more than they would last if they didn't have wear leveling. https://www.zdnet.com/article/usb-drive-life-fact-or-fiction/
Also SD cards have basic wear leveling in this day and age.

In general, the main difference is between flash that is addressed as a block device (i.e. like a normal hard drive) and the types of flash that are accessed raw.

Flash that is addressed as block device has a controller that is presenting the storage device as virtual thing (the Flash Translation Layer) where all sectors/blocks are NOT physical sectors and the controller can freely assign them to different physical flash sectors to do wear leveling and possibly caching to improve performance.
For example, if you write to the first sector of this storage device, the first write will actually go to a flash sector, then to another and then to another one again. The controller will remember what flash sector is actually containing the data you wrote in the first sector.
If a block device flash has bad sectors, the controller keeps doing the same game and remaps them so they "disappear" and software running on the CPU does not need to care about bad blocks in flash. For the software running on CPU there are no bad blocks and everything is contiguous and perfect.
This is how USB flash drives, SD cards, solid state drives and eMMC (which is like a "solid state drive on a chip") work. The only difference is the power and intelligence of the controller, which determines how good the performance is.

Flash that is addressed raw is when the CPU of the device is actually connected to the flash chip directly and writes each flash sector manually. This is the situation in many OpenWrt devices, and is the norm for the lowest tier of embedded devices, where "cheap" is the key word.
In this kind of situation the software running on CPU has to write intelligently to do the wear leveling, if it writes many times to the same flash sector it will wear out and die.
Also since there is no controller that remaps the bad blocks, the software running on CPU has to be able to deal with bad blocks, aka skipping them. Not all software is smart enough to do that, many bootloaders still fail hard if the kernel partition has bad blocks (this means the device will never boot again)

Back in the day where NAND flash was less cheap, devices were using NOR flash which is A LOT more reliable and has a much bigger write endurance, so in most cases while it was still accessed raw the wear leveling and bad blocks were not a problem.

T the moment, in higher tiers of embedded (for example Turris Omnia router) or in industrial embedded hardware and also in mobile aka smaprthones the raw flash is not used anymore and they exclusively use eMMC or better. This means they are working with a block device, and the controller in the device will deal with wear leveling and bad flash sectors.
This is still flash and it can still wear out, but it's not a huge mindboggling pain in the back to manage with wear leveling and bad blocks and whatnot.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.