Belkin RT3200/Linksys E8450 WiFi AX discussion

ghoffman · April 23, 2024, 10:43am

these are key questions.
maybe it's not OKD, but KD?

grauerfuchs · April 23, 2024, 11:50am

It may still be OKD based on the theory itself. There are multiple parts to this floating around at present. I personally believe the most likely culprit is the use of the reduced drive current for the chip. Now, here's where it gets complicated. We know that the current is certainly lower in the v2.9 version of the bootloader. That means updater 1.0.3 and newer certainly have the lower drive current. We also know that the factory firmware (and bootloader?) uses a higher drive current. What we do not know for certain is when the change was first implemented within the bootloader. It is certainly possible that the lower drive current is present all the way back through the first OpenWRT-compatible boot chain. We may simply be seeing a lot more of this now because the router has become cheaply available and because we've recently had both more usage and more firmware revisions supported.

The end result is that this could indeed go all the way back, or it could go partway back. Until someone dissects all of the patches available and used for each version up until 1.0.3, we won't know where the change was implemented. Either way, that knowledge is academic. We know an issue exists, and the current theory is that by raising the drive current to match that used by the factory, the writes from that point on will become more stable and therefore the data will be less prone to future misreads.

quarky · April 23, 2024, 12:22pm

One way to quickly find out is for someone who has a unit prone to OKD to upgrade to a version of bootloader and OpenWRT that has higher drive current configured. If OKD is still encountered, we can rule it out.

Personally I would think the ‘driving current’ theory is unlikely because IMHO, this would result in random corruption during writes into the flash chip, making bricking incident a lot more likely, especially during firmware upgrade or configuration changes.

From what I can tell from the various reports, it appears that OKD happens when power is cut off from the router without proper shutdown. This suggests filesystem corruption due to cached changes to the filesystem not getting flushed to the flash chip.

fda · April 23, 2024, 12:37pm

German telekom calls its crap dsl lines also "high speed" so just ignore what it says °-°
breitbandmessung.de tells me 59mbit download for the 100 mbit line, which is capped constantly to 80 mbit, and i can tell you this company shits on evey law they can! They refused to only 59% payment
So i switched to Vodafone 50 (15 eu/mon) but this Assi-A experiments did not finaly stopped. But i get this problem with pppoe-login

grauerfuchs · April 23, 2024, 12:51pm

The problem is, we can't yet predict when OKD is going to occur. Therefore, we can't trigger it on demand.

Couple that with not being able to modify the bootloader yet because it's waiting on an upstream patch, and we end up with the waiting game. Given that the device runs "ARM Trusted Firmware-A" as a bootloader, there may very well be some part of the payload or process we can't compile for testing. This is the ARM version of "Secure Boot" as I understand it, so there is probably a cryptographic signature necessary in order to sign the compiled binary so that it will run.

As for corruption during writes, that's not actually what happens. NAND memory works by imparting a static charge. If that charge is weak, it can approach the threshold for being read differently. Sadly, outside influences like heat or adjacent or repeated reads can lower that charge into the mud, so to speak, thereby temporarily making it even more ambiguous. The theory is that data itself was written properly, just weakly enough that these outside influences can conditionally cause a misread. This may or may not also involve timing issues with this particular memory chip, and those are also being addressed as part of the new bootloader patch set.

If the filesystem were actually corrupt, it would entirely invalidate the recovery process on the wiki. As we have seen by every report so far, that process (read the data from flash into memory, re-write the data in memory back into flash) works perfectly in restoring units affected by OKD. Therefore, logic dictates that the filesystem is not corrupt. If that weren't enough to sway the balance of logic, the issue occurs when accessing the .fip. The .fip is only ever written during specific firmware update operations. Otherwise, it is neither mounted nor written during normal operation. Therefore, there is nothing that could be left to become corrupted. Then, extend that one more time. The production firmware itself is read-only during normal operations. You're not working with the written image, but with a superimposed overlay to it. If corruption were to occur, it would be exclusively limited to that overlay. The end result is that either the overlay would be discarded, or the router would otherwise revert to running in recovery/initramfs instead. It could not cause a failure to read the .fip/bl31 bootloader stage.

quarky · April 23, 2024, 1:17pm

Maybe I don't really understand how NAND works, but as I understand digital ICs, when a digital signal is sent to an IC chip, said chip will have buffers at its pin to determine whether the signal is a 1 or 0. Once that is determine, that signal is then sent further into its circuit for processing with it's internal signal driver.

So once a signal is received at the NAND IC pin and a signal level is determine, how that signal is stored within it should not depend on whether the original signal is strong or weak, it is just either a 1 or 0. The NAND's internal circuitry should then stores the received signal with sufficient power so as not to affect it's stored data quality.

That is what I understand how all digital ICs works.

Right, but I suspect that UBIFS is an additive FS, i.e. changes are added on top of existing stored changes. Whatever that gets stored never gets over-written. So if the additions are somehow corrupted, you can still read from the originally stored data.

I have to admit that I know next to nothing about how UBIFS works.

Squashfs/JFFS/JFFS2 are more "mature" in this sense, but it does not handle wear levelling from what I read, but UBIFS offers wear levelling. Once data are written using squashfs/jffs/jffs2 into MTD devices, it can never change, so it is a write once read many FS. So it'll be interesting to see if there is a case out there where they also see OKD when not using UBI. Then it debunks the theory that it is caused by UBI.

If someone is using squashfs but never encounters OKD but occasional loses all their router configurations after rebooting due to a power outage, then it points to corruption of the r/w filesystem.

grauerfuchs · April 23, 2024, 1:43pm

You're missing a number of important things, including the reports of people that are in fact using systems running on versions prior to moving the .fip into UBI space. The recovery in the wiki works for mtd-based devices that are prior to installer v. 1.1.x with the same technique, by reading and re-writing the data over the top of itself. UBI is indeed not involved with the issues seen on those devices.

Digital ICs still have to interpret the data that they handle. For instance, a CMOS digital logic gate (3.3v) may safely read 0-1.25VDC as a logic "0" and 2.25-3.3VDC as a logic "1" reliably. However, if whatever is connected to the input has a voltage of 1.75VDC, how will the gate interpret that input? Many data sheets will show a logic threshold to indicate approximately where the cutoff voltage will be. However, signals around or below that threshold become ambiguous. Now, let's expand that to two gates of the same chip design. Let's say the source chip is given a weak drive current and outputs a digital high of 3.3v at what is supposed to be 12mA, but the weak drive current limits it to 8mA. The receiving chip is expecting the 3.3V at 12mA. Because the weak drive on the source chip can't put out the 12mA, the voltage between the chips sags down to 2.21v and its value becomes ambiguous.

quarky · April 23, 2024, 2:02pm

I'm not trying to fight you if I'm coming across as such, but as I understand it, mt7622 openwrt firmware has two versions up until v1.1.0, where the boot loaders gets migrated to UBI because of the threat of bad-blocks.

Priori to installer v1.1.0, mt7622 firmware comes in two flavours: squashfs and UBI. That means the firmware portion is either written using squashfs or UBIFS, while the boot loaders are still on MTD devices. So, I'm asking if anyone is using the squashfs variant of openwrt and also sees OKD. So far I think this is unclear. Happy to be corrected tho.

But it will still ultimately gets decoded to either a 1 or 0 correct? And that ultimately gets stored? If so, isn't this a corruption during writes? It doesn't make sense to me that we have ambiguous writes and then we get perfect reads. I don't think a NAND cell can store a "flip a coin" state.

grauerfuchs · April 23, 2024, 2:15pm

I understand you're simply trying to learn and understand what's going on. Here's another piece of information you were missing:

Up until a patch added post ARM Trusted Firmware-A v. 2.9, the preloader was not capable of loading a .fip from a UBI volume. (corrected; the preloader does not actually read UBIFS.) Therefore, the .fip was kept in mtd space instead. Installer v. 1.0.3 was released with ATF-A 2.9, but it did not have the UBI support enabled in the preloader yet. It was installer v1.1.0 that first had that support. Therefore, people that had used installer v1.0.3 and before still had the .fip within mtd space. The .fip contains part of the bootloader (bl31, aka U-Boot), so that's where confusion can enter.

One of the reasons we moved away from the non-UBI is because of the limitations in the data layout the factory had set, along with problems with quality, wear leveling, and data scrubbing. (Thank you @daniel for your corrections here and above!) The factory layout used an A/B backup layout, limited support for error handling, and it had other limitations that made it undesirable for regular users. Therefore, when OpenWRT had sufficient cause to make a big change, the changes included moving the recovery, OS images, and user data into UBI. However, the bootloader and .fip still had to be stored in mtd space. Because this change happened quite a while back, many people will have long since moved to the UBI layout.

If what you are seeking is an answer for whether or not people are experiencing the issue under conditions where the involvement of UBI can be explicitly ruled out, then the answer is Yes. Anyone that has run any of the installers prior to 1.1.x fits within that category, because the fip was still in mtd space until that time and it's the fip that is having the read issues.

Correct, it will ultimately be decoded to 1 or 0 regardless of whether or not it is ambiguous. However, anything ambiguous may be misread. Now, if the data was written in a borderline ambiguous state, the outside influences could affect the decode upon read. However, the data as written is still what was written. Eliminate those outside influences and you "may" get a better decode. As for how NAND cells store the data, you may want to brush up on the memory technology and how it works. Sadly, it is indeed possible to have data that is "good but borderline quality" become poorly read due to external influences such as heat in such minuscule amounts as that which is imparted by a prior read to the same or a nearby block. Further up in this very thread were links to research articles explaining exactly how reads can go bad due to this very phenomenon.

ghoffman · April 23, 2024, 4:38pm

@grauerfuchs - as i recall, the bad block handling fof the native mtd was troublesome, and that was stated as a major reason for moving to /preferring the UBI builds. yes/no?
is there any possible relationship between the underlying bad block fragility and the OKD ?

grauerfuchs · April 23, 2024, 5:04pm

You're right in that the bad block handling is troublesome, but that's more from a management and wear leveling perspective. U-Boot and the preloader are smart enough to recognize a bad flash block. When U-Boot finds one during a write operation, it effectively marks the block as bad and continues on with the next block receiving the data that would have gone into the bad block. This means that over time, when referencing data by position in flash as one must do in mtd space, the data can become discontiguous and misaligned. This wreaks havoc with unaware applications because you need to recognize bad blocks and know the block size to skip ahead in order to read the remaining data. To handle this, you also need to preallocate extra trailing space in the mtd "partition" so that a bad block can be skipped. Eventually, this space may be exhausted and you could reach the end of the designated boundary for the data.

Since mtd is unpartitioned space, it relies on hard-coded offsets stored in the bootloader and in the OS image to know where data should start and where it should end. If those offsets become corrupted or are set improperly, woe be the one who writes anything into the flash memory from that point on.

UBI eliminates this problem by adding real partitioning information as well as setting aside and internally managing spare blocks for substitution when bad blocks are identified. Such operations are transparent to the data access layer. The data no longer needs to be contiguous in any form, since UBI can point at a substitute block anywhere within the available space rather than simply using the next sequential block. Therefore, you no longer have to worry about logical boundaries for your data. You only have to worry about the total space allocated to each partition.

daniel · April 23, 2024, 6:01pm

That's not true, in many ways.
First of all, MediaTek's ARM TrustedFirmware-A was never and is still not able to read UBIFS. Capabilities to read FIP from a UBI volume (not a UBIFS filesystem) has been added at some point after their rebase to v2.9. However, in OpenWrt we only started enabling and using that capability with version v2.10.

That's one reason, but certainly not the only one. The lack of wear-leveling, scrubbing (both well-known necessities when dealing with NAND flash!) was the much bigger reason. Sometimes, it can be argued, that not having those is still better than having the replace the stock bootloader. However, in this particular case the stock bootloader also came with many flaws (such as dropping onto a U-Boot shell when something goes wrong -- despite it being a dual-boot design, that would then require a skilled user to connect the serial console), so it was really unacceptable to stay like that, as there were no recovery options, no use of the RESET button, no fall-back to TFTP, ...

This is also not true at all. What you are stating as a fact here must be the fruit of your mind and as such it would be good if you add words "I assume..." or "Maybe" to your statement. Please do that in future unless you are 100% sure you are stating a fact.

OpenWrt easily fits into the ~ 40 megs offered by the stock layout for each slot. The reason was something completely different: That the SPI-NAND driver of the stock firmware was completely ignoring ECC (in this case BCH) coding and the factory partition had been written with it's OOB area all empty. That results in nasty errors when the Ethernet and WiFi drivers try to read from that partition.
However, there are work-arounds for that, such as re-writing the factory partition with ECC data, or patching the drivers to be fine with "recoverable" read errors.
Up to today we also offer images for using the device with the vendor layout.

That doesn't even parse in my brain, even with a fair amount of error detection and correction. So I can't comment on that.

I agree on that part, as that is evident by now.

In general, yes.

There are several underlying bad block handling facilities, some are hardware (BCH coding of the ECC engine of the SoC), others are software (CRC check of UBI, hash check of TF-A). The fact that a simple re-write can fix the issue kinda means that a retry of the read could also succeed in the case of a CRC error, and in that sense you could consider that a bug of the software reading the flash in first place being a bit too fragile.
Also, adding additional redundancy could help as well, obviously, also that is doable in software.

MediaTek engineers by now have received a sample of an OKD'ed RT3200 and should help us to find the best solution.

Lynx · April 23, 2024, 6:12pm

Oh wow. That’s super cool.

grauerfuchs · April 23, 2024, 6:16pm

Thank you for your corrections. I have updated the post accordingly, so as to not confuse people. My apologies; my understanding of the entirety of the issues was inexact. I had been under the impression that the space in the layout was also one of the reasons for the change. It seems I've been working on too many devices and must have gotten this element confused with another. Again, the post has been corrected.

As for my strange way of thinking, you're right. I probably could have framed that response a little better. The point was intended to show that we have indeed received proof that the issue is not caused by something within the UBI code.

quarky · April 24, 2024, 3:02am

grauerfuchs:

Correct, it will ultimately be decoded to 1 or 0 regardless of whether or not it is ambiguous. However, anything ambiguous may be misread. Now, if the data was written in a borderline ambiguous state, the outside influences could affect the decode upon read. However, the data as written is still what was written. Eliminate those outside influences and you "may" get a better decode. As for how NAND cells store the data, you may want to brush up on the memory technology and how it works. Sadly, it is indeed possible to have data that is "good but borderline quality" become poorly read due to external influences such as heat in such minuscule amounts as that which is imparted by a prior read to the same or a nearby block. Further up in this very thread were links to research articles explaining exactly how reads can go bad due to this very phenomenon.

I'm still trying to wrap my brain around data being stored in an ambigous state but can be successfully read later on as correct. Won't we be seeing wide spread boot errors if this is the case, because from what I can piece together 100% of RT3200/E8450 running OpenWrt have their flash contents written with the 8mA current, so it follows that we have a pretty high chance of ambigous data bits written with megabytes of data written every time a firmware is upgraded.

I don't believe we're seeing this reported, unless it is not reported.

Those research papers probably are done in extreme condition. I also find it extremely unlikely for a production flash chip to produce inconsistent results when written data is read. This looks like a software error to me.

Why is there boot failure reported only when router is encountering sudden power loss?

I don't remember folks reporting failure to boot during a warm boot. The only time is when they have kernel oops recorded in the pstore FS and router boots into recovery. Power cycling brings router back with configurations intact.

grauerfuchs · April 24, 2024, 6:24am

It could very well be a software error, it could be a design issue or compatibility issue that the design engineers didn't see or consider, or it could all be down to a matter of technical imperfections and electrical tolerances along with external influences. We don't have a complete answer at this time, only various theories and suppositions. This is what the design engineers are trying to work out now that they have an example of the failure. Hopefully, they'll be able to find some strategy to mitigate or eliminate the oddities we're seeing here.

In general, all we can go on is what we know of the components, their tolerances, and the most likely causes based on our knowledge of the technology and our own experiences. When our knowledge proves insufficient, we have to turn to those with even more detailed knowledge of the concepts and situations involved. Daniel's knowledge far exceeds mine, and it seems that even his wealth of understanding may be strained by this one. Therefore, hopefully those design engineers will be able to shed more light on the situation once they have analyzed the failure. Those research papers are likely the end result of situations like this, bringing to light previously unknown interactions so that they can be taken into account in the future.

dalutou · April 24, 2024, 11:55am

Should be Openwrt 23.05.2

Mhisani · April 24, 2024, 12:58pm

Right. So I ran the installer and then upgraded to snapshot.

I'm running the following:
OpenWrt SNAPSHOT r25978-ea609fe486 / LuCI Master 24.102.51352~4cffc9f|

|Kernel Version|6.1.86|

But if I try to use attended sysupgrade I get
" user.info upgrade: The device is supported, but this image is incompatible for sysupgrade based on the image version (1.0->2.0).
user.info upgrade: SPI-NAND flash layout changes require bootloader update. Please run the UBI installer version 1.1.0+ (unsigned) first."

Is there a way I can check for certain. Whether or not things are updated as needed and also obviously how I can move forward from this error. I am sorry to drop this in here but it's been difficult to find out."

Anteus · April 24, 2024, 1:38pm

I think you could create a luci backup, restore settings and try to flash again and see if the warning persists. If sysupgrade doesn't give any warning you could upgrade and restore your settings and afterwards update the compat_version. See: Belkin RT3200/Linksys E8450 WiFi AX discussion - #4317 by hnyman

grauerfuchs · April 24, 2024, 1:41pm

After you loaded the snapshot, did you restore a configuration from before you had run the snapshot? If so, then that's the cause of the error. Although in this case it "should" be ok to change the flag as suggested by @Anteus, it's overall bad practice since the error usually means it's not safe to use that configuration on the version you're installing.

If your configuration isn't too complex, you may want to do a factory reset, perform the update, and then rebuild the configuration from scratch just to be sure there isn't something that might cause problems hiding in there somewhere. Snapshot builds are, after all, builds in testing and there's no telling what might break or how.