Does such a log indicate that the router's NAND FLASH is damaged?

The router extends /overlay through a USB device. Does the following system log indicate that the NAND FLASH of the router is damaged?

kern.err kernel: [   26.669221] blk_update_request: I/O error, dev mtdblock0, sector 16
kern.err kernel: [   26.721564] blk_update_request: I/O error, dev mtdblock0, sector 120
kern.err kernel: [   26.746073] blk_update_request: I/O error, dev mtdblock0, sector 120
kern.err kernel: [   26.752539] Buffer I/O error on dev mtdblock0, logical block 15, async page read
kern.err kernel: [   26.784012] blk_update_request: I/O error, dev mtdblock0, sector 120
kern.err kernel: [   26.790476] Buffer I/O error on dev mtdblock0, logical block 15, async page read
kern.err kernel: [   27.035361] blk_update_request: I/O error, dev mtdblock4, sector 32
kern.err kernel: [   27.053381] blk_update_request: I/O error, dev mtdblock4, sector 40
kern.err kernel: [   27.077240] blk_update_request: I/O error, dev mtdblock4, sector 128
kern.err kernel: [   27.092646] blk_update_request: I/O error, dev mtdblock4, sector 128
kern.err kernel: [   27.099109] Buffer I/O error on dev mtdblock4, logical block 16, async page read
kern.err kernel: [   27.171238] blk_update_request: I/O error, dev mtdblock4, sector 128
kern.err kernel: [   27.177702] Buffer I/O error on dev mtdblock4, logical block 16, async page read
kern.err kernel: [   27.365285] blk_update_request: I/O error, dev mtdblock7, sector 88
kern.err kernel: [   27.398742] Buffer I/O error on dev mtdblock7, logical block 462, async page read
kern.err kernel: [   27.542394] Buffer I/O error on dev mtdblock9, logical block 11, async page read
daemon.err block: /dev/ubiblock0_0 is already mounted on /rom
daemon.err block: /dev/sda1 is already mounted on /overlay
daemon.err block: /dev/sda2 is already mounted on ***

Please mention the device in question.

Not necessarily...
NAND is special in many ways, foremost in the fact that it wears down over time, by writing and even mere reading - many even have bad blocks straight from the factory (at best, the vendor guarantees that there are no bad blocks at the very beginning, where bootloader and maybe the kernel tend to be stored). Other than for 'consumer' flash (SSDs, sdhc/ eMMC, USB sticks, etc.), the parallel NAND destined for embedded uses usually lacks the wear levelling controller as a cost cutting measure. But even for a router, the wear levelling for NAND flash is essential, meaning it needs to be implemented in software (in the kernel, for cheap). While this works quite well (and arguably, the kernel's wear levelling via ubi might even be better than some dedicated wear levelling controllers), it is not transparent to the kernel and the system at large; the bad block table and other effects of the wear levelling are visible (in other words, you need NAND aware tools for low level storage access; dd/ cat are not).

But the internal boot ROM of most SOCs tends to be tiny and very limited, it usually doesn't know anything about wear levelling or ubi and merely loads $x blocks into RAM and executes it (remember, flash vendors guaranteeing the beginning of the flash to be error free). As a result, the bootloader (and sometimes more than that) doesn't have in-band wear levelling information, meaning to the kernel and its NAND driver it appears to be defective (in bad cases there might even be more of a mix-up, with second stage bootloaders or wireless calibration data eventually using a different wear levelling algorithm than OpenWrt's kernel...). When encountering situations like this, the kernel complains, loudly - as it cannot know if those blocks are genuinely defective or are intentionally lacking in-band wear-levelling information. In order to avoid these false error messages, it is possible to mark these areas of the flash as don't-care to the kernel via the device tree - but that DTS moniker is rather new and might not he used for your device.

2 Likes

Sorry, I do not understand what you mean. I have two routers of the same model that are extended/overlayed by USB devices of the same model, but only one of the devices has such a log, and this device will restart from time to time, so I wonder if the problem lies here.

It would have really helped if you had mentioned this.

and even more this issue.

We're also still lacking the information which router model we're talking about, because that may very well play a role here.

I've provided some background why those error messages don't necessarily imply an issue with the fash, especially at the very beginning of the flash chip - but they nevertheless can (and that's where the a/b comparison between two seemingly identical routers can be beneficial), just as well as the knowledge what device we're talking about (as that means we could cross-check which partitions woulod be affected). As mentioned before, with NAND block failures aren't black or white - it depends on the details (which we're lacking).

All of this shouldn't deter from the the possibility that there might indeed be a failing chip, that can very well be the case.

Disclaimer: I never owned this particular device (nor other NAND using ath79 devices), so I can't speak specifically about it.

The wndr4300 does not define nand-is-boot-medium/ boot_pages_size, so apparently it doesn't try to mask away the aforementioned in-band wear levelling information. Warnings about sectors 16, 32, 40, 88, 120 are accordingly very suspicious in this regard to being bogus. The important parts of your firmware (in terms of stability, not ability to boot up at all) only start 6 MB into the flash, while the highest reported block error (sector 462) would be 59'136 KB into the image, so way below kernel/ rootfs/ overlay.

Hardware damage might not necessarily be caused by the flash, considering the age of your devices, capacitors (both on the mainboard and especially the external PSU) might come into play, as well as general component aging and heat related issues.

Comprehensive You have been emphasizing that the information I said is incomplete and there is no device model information, but knowing the specific model cannot explain the reason. In addition, I just guess that the failure can be caused by the failure of the NAND FLASH, and I don't need to know why the NAND FLASH fails.

mount -t ubifs /dev/ubi0_1 /mnt/

Previously, the command could mount successfully, but now the following prompt appears when using this command:

mount: mounting /dev/ubi0_1 on /mnt/ failed: Bad message

kern.err kernel: [298926.090888] UBIFS error (ubi0:1 pid 18939): 0x8019f39c: reading 11 bytes from LEB 7:32768 failed, error -77
kern.warn kernel: [298926.100981] CPU: 0 PID: 18939 Comm: mount Not tainted 4.9.182 #0
kern.warn kernel: [298926.107222] Stack : 804c75c2 00000034 00000000 00000001 00000000 00000000 00000000 00000000
kern.warn kernel: [298926.115928]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
kern.warn kernel: [298926.124595]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
kern.warn kernel: [298926.133352]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
kern.warn kernel: [298926.142010]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
kern.warn kernel: [298926.150679]         ...
kern.warn kernel: [298926.153257] Call Trace:
kern.warn kernel: [298926.155771] [<8006accc>] 0x8006accc
kern.warn kernel: [298926.159441] [<8006accc>] 0x8006accc
kern.warn kernel: [298926.163064] [<8019f3a4>] 0x8019f3a4
kern.warn kernel: [298926.166722] [<801b4124>] 0x801b4124
kern.warn kernel: [298926.170383] [<801a0f2c>] 0x801a0f2c
kern.warn kernel: [298926.174054] [<801b4150>] 0x801b4150
kern.warn kernel: [298926.177725] [<801a9c50>] 0x801a9c50
kern.warn kernel: [298926.181352] [<80097ff0>] 0x80097ff0
kern.warn kernel: [298926.185039] [<8019d6dc>] 0x8019d6dc
kern.warn kernel: [298926.188667] [<80138074>] 0x80138074
kern.warn kernel: [298926.192380] [<8019b01c>] 0x8019b01c
kern.warn kernel: [298926.196063] [<8011e718>] 0x8011e718
kern.warn kernel: [298926.199712] [<80138538>] 0x80138538
kern.warn kernel: [298926.203365] [<8013b030>] 0x8013b030
kern.warn kernel: [298926.207012] [<80099618>] 0x80099618
kern.warn kernel: [298926.210699] [<80099b54>] 0x80099b54
kern.warn kernel: [298926.214326] [<8013b578>] 0x8013b578
kern.warn kernel: [298926.218030] [<8006f40c>] 0x8006f40c
kern.warn kernel: [298926.221680]
kern.notice kernel: [298926.227596] UBIFS (ubi0:1): background thread "ubifs_bgt0_1" stops

Actually, depending on quality, NANDs can be expected to have bad blocks from the factory. So basically, yes - if you don't wish to provide more details. I read nowhere you were expected to guess anything; it seems you want us to do the guessing, rather.

Also, some devices have a method to format and mark those sections before flashing to them (e.g. MikroTik RouterBoot allows this)...but I'd have to ask you for the logs and model - and I'm not being funny...but you already seem to rip into people who ask for more details.

I was gonna mention the NAND thing before @slh made a more detailed (and better) response.

So I'll tell you what you're looking for:

In the beginning of the Kernel boot (i.e. the first ~5 seconds), it should list blocks, if they're marked etc. like I said. Hope this helps.

Something else not mentioned:

  • OpenWrt version (this depends because I've seen threads where you proceed to mention EOL versions, and I know for a fact some devices had NAND issues in older versions)...also some devices are having unknown issues on 21
  • device model (easier to know OEM of chip and if there may be a bootloader you could use to format, but you have noted you believe this is irrelevant)

EDIT - here's a big one, have you simply tried sysupgrading over the current install since you observed the issue? (this is actually suggested often in the forum)

Now the main thing is to find out why the machine restarts from time to time?

According to my observations, it may be caused by a problem with the NAND FLASH hardware. Because there are many related error prompts in the system, what I want to know is whether this prompt can confirm my analysis. As for why the NAND FLASH is damaged or caused by a system bug, it is not my concern.

Because my native language is not English, the communication between us requires the help of machine translation, so many meanings cannot be expressed correctly.

1 Like

What language?

We could use https://translate.google.com/

Because you are missing information being given, and this is an OpenWrt-related forum.

That information was provided to you.

:confused:

This must be a language issue.

I just use google translate, but obviously a lot of meanings can't be translated accurately

1 Like

I will use it.

:man_facepalming:

Please tell us your native language?
Veuillez nous indiquer votre langue maternelle ?
请告诉我们您的母语?

my native language is chinese

1 Like

OK.

The log will tell you if the blocks have been tagged as bad. This is in the first 5 seconds. The log you showed provides no details.

日志会告诉您这些块是否已被标记为坏块。 这是在前 5 秒内。 您显示的日志没有提供详细信息。

Have you tried to reformat or system upgrade the bad router?

您是否尝试过重新格式化或系统升级坏路由器?

Some devices allow formatting from the loader.

某些设备允许从加载程序进行格式化。

I haven't upgraded the system yet, but I have considered upgrading to version 20.02, but there is still a remote machine. I am worried that after upgrading the system, the configuration of another system will be lost and cannot connect to the network. It will be troublesome, so I have to find an opportunity to do the upgrade operation.

目前没有升级过系统,不过考虑过升级到20.02这个版本,只是远程还有一台机器我担心升级系统后另外一个系统配置丢失无法连接网络,那就麻烦了,所以得找个时机才能做升级操作。

1 Like

In this case, I mean the same version. I understand the remote concerns.

在这种情况下,我的意思是相同的版本。 我理解远程的担忧。

Update us when possible.

尽可能更新我们。

At present, the problem is the local device, and the remote device is all normal. If there is a problem during the remote update process and cannot connect to the network, the problem will be serious. But just updating the local device may be problematic if the two devices are paired with different versions of some software.

目前出问题的是本地的这台设备,远程设备一切都还正常,如果远程更新过程中出问题了无法连接网络那么问题就严重了。但是仅仅更新本地设备这两台设备配对的一些软件版本不同可能会有问题。

:confused:

OK, again, I repeat 我重复 -

Correction: inform us when you have finished

更正:完成后通知我们

At present, I have upgraded to the 21.02.x version of the firmware, and I can fully use the 128MFLASH space

At present, only the following error message remains in the system log:

kern.err kernel: [ 1201.649352] blk_update_request: I/O error, dev mtdblock0, sector 16 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 0
kern.err kernel: [ 1201.669734] blk_update_request: I/O error, dev mtdblock0, sector 120 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
kern.err kernel: [ 1201.681669] blk_update_request: I/O error, dev mtdblock0, sector 120 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
kern.err kernel: [ 1201.692484] Buffer I/O error on dev mtdblock0, logical block 15, async page read
kern.err kernel: [ 1201.708035] blk_update_request: I/O error, dev mtdblock0, sector 128 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
kern.err kernel: [ 1201.720830] blk_update_request: I/O error, dev mtdblock0, sector 144 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
kern.err kernel: [ 1201.748155] blk_update_request: I/O error, dev mtdblock0, sector 128 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
kern.err kernel: [ 1201.758931] Buffer I/O error on dev mtdblock0, logical block 16, async page read

It's not that serious now

I have a Pogoplug e02 and when I run block info or blkid new error logs are added

[49492.782810] __nand_correct_data: uncorrectable ECC error
[49492.788161] blk_update_request: I/O error, dev mtdblock0, sector 2040 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[49492.799362] __nand_correct_data: uncorrectable ECC error
[49492.804715] blk_update_request: I/O error, dev mtdblock0, sector 2040 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[49492.815384] Buffer I/O error on dev mtdblock0, logical block 255, async page read

My WNDR4300 does not have this problem.

Pogo E02: 21.02.3
WNDR4300: 21.02.1