NAND is special in many ways, foremost in the fact that it wears down over time, by writing and even mere reading - many even have bad blocks straight from the factory (at best, the vendor guarantees that there are no bad blocks at the very beginning, where bootloader and maybe the kernel tend to be stored). Other than for 'consumer' flash (SSDs, sdhc/ eMMC, USB sticks, etc.), the parallel NAND destined for embedded uses usually lacks the wear levelling controller as a cost cutting measure. But even for a router, the wear levelling for NAND flash is essential, meaning it needs to be implemented in software (in the kernel, for cheap). While this works quite well (and arguably, the kernel's wear levelling via ubi might even be better than some dedicated wear levelling controllers), it is not transparent to the kernel and the system at large; the bad block table and other effects of the wear levelling are visible (in other words, you need NAND aware tools for low level storage access; dd/ cat are not).
But the internal boot ROM of most SOCs tends to be tiny and very limited, it usually doesn't know anything about wear levelling or ubi and merely loads $x blocks into RAM and executes it (remember, flash vendors guaranteeing the beginning of the flash to be error free). As a result, the bootloader (and sometimes more than that) doesn't have in-band wear levelling information, meaning to the kernel and its NAND driver it appears to be defective (in bad cases there might even be more of a mix-up, with second stage bootloaders or wireless calibration data eventually using a different wear levelling algorithm than OpenWrt's kernel...). When encountering situations like this, the kernel complains, loudly - as it cannot know if those blocks are genuinely defective or are intentionally lacking in-band wear-levelling information. In order to avoid these false error messages, it is possible to mark these areas of the flash as don't-care to the kernel via the device tree - but that DTS moniker is rather new and might not he used for your device.
Sorry, I do not understand what you mean. I have two routers of the same model that are extended/overlayed by USB devices of the same model, but only one of the devices has such a log, and this device will restart from time to time, so I wonder if the problem lies here.
It would have really helped if you had mentioned this.
and even more this issue.
We're also still lacking the information which router model we're talking about, because that may very well play a role here.
I've provided some background why those error messages don't necessarily imply an issue with the fash, especially at the very beginning of the flash chip - but they nevertheless can (and that's where the a/b comparison between two seemingly identical routers can be beneficial), just as well as the knowledge what device we're talking about (as that means we could cross-check which partitions woulod be affected). As mentioned before, with NAND block failures aren't black or white - it depends on the details (which we're lacking).
All of this shouldn't deter from the the possibility that there might indeed be a failing chip, that can very well be the case.
Disclaimer: I never owned this particular device (nor other NAND using ath79 devices), so I can't speak specifically about it.
The wndr4300 does not define nand-is-boot-medium/ boot_pages_size, so apparently it doesn't try to mask away the aforementioned in-band wear levelling information. Warnings about sectors 16, 32, 40, 88, 120 are accordingly very suspicious in this regard to being bogus. The important parts of your firmware (in terms of stability, not ability to boot up at all) only start 6 MB into the flash, while the highest reported block error (sector 462) would be 59'136 KB into the image, so way below kernel/ rootfs/ overlay.
Hardware damage might not necessarily be caused by the flash, considering the age of your devices, capacitors (both on the mainboard and especially the external PSU) might come into play, as well as general component aging and heat related issues.
Comprehensive You have been emphasizing that the information I said is incomplete and there is no device model information, but knowing the specific model cannot explain the reason. In addition, I just guess that the failure can be caused by the failure of the NAND FLASH, and I don't need to know why the NAND FLASH fails.
Actually, depending on quality, NANDs can be expected to have bad blocks from the factory. So basically, yes - if you don't wish to provide more details. I read nowhere you were expected to guess anything; it seems you want us to do the guessing, rather.
Also, some devices have a method to format and mark those sections before flashing to them (e.g. MikroTik RouterBoot allows this)...but I'd have to ask you for the logs and model - and I'm not being funny...but you already seem to rip into people who ask for more details.
I was gonna mention the NAND thing before @slh made a more detailed (and better) response.
So I'll tell you what you're looking for:
In the beginning of the Kernel boot (i.e. the first ~5 seconds), it should list blocks, if they're marked etc. like I said. Hope this helps.
Something else not mentioned:
OpenWrt version (this depends because I've seen threads where you proceed to mention EOL versions, and I know for a fact some devices had NAND issues in older versions)...also some devices are having unknown issues on 21
device model (easier to know OEM of chip and if there may be a bootloader you could use to format, but you have noted you believe this is irrelevant)
EDIT - here's a big one, have you simply tried sysupgrading over the current install since you observed the issue? (this is actually suggested often in the forum)
Now the main thing is to find out why the machine restarts from time to time?
According to my observations, it may be caused by a problem with the NAND FLASH hardware. Because there are many related error prompts in the system, what I want to know is whether this prompt can confirm my analysis. As for why the NAND FLASH is damaged or caused by a system bug, it is not my concern.
Because my native language is not English, the communication between us requires the help of machine translation, so many meanings cannot be expressed correctly.
I haven't upgraded the system yet, but I have considered upgrading to version 20.02, but there is still a remote machine. I am worried that after upgrading the system, the configuration of another system will be lost and cannot connect to the network. It will be troublesome, so I have to find an opportunity to do the upgrade operation.
At present, the problem is the local device, and the remote device is all normal. If there is a problem during the remote update process and cannot connect to the network, the problem will be serious. But just updating the local device may be problematic if the two devices are paired with different versions of some software.