19.07.xx: Strange kernel jffs2 bug

So after doing more tests, I think this a multicausal problem. Because I did revert /etc/rc.local to NOT contain anything, the router flashes, comes up with the settings preserved, I'm looking at dmesg (in this case tested on 19.07.2) and always find the message

jffs2: Newly-erased block contained word 0x19852003 at offset 0x00000000

I bet this is NOT hardware failure. This is also exactly the same message (0x19852003) as in:

To get this nailed down to root cause and be able to report back to you, I need an exact info in which OpenWrt stable 19.07.xx release the fix mentioned in the other thread ( ref: https://github.com/gl-inet/openwrt/commit/4b1f073f843dc4be655a868f6a6e31f74baa727c ) is contained, if yet released. @mk24 Can you please shed some light?

1 Like

Did you notice that that code is not even from an OpenWrt repo?
It is from gl-inet's repo of their derivative firmware for their devices based on an old kernel version...
(Both the ar71xx target and also the kernel 4.9 have already been deprecated in OpenWrt)

@hnyman I'm just a user, so just deploying and upgrading devices when needed. I'm not so into it what which commit or repo means. Can you offer a solution because it seems to be exactly the problem I'm suffering.

  • I've tested on 19.07.2 upgrading to 19.07.2 -> jffs error occurs.
  • I've tested on 19.07.2 upgrading to 19.07.3 -> jffs error occurs.
  • I've tested on 19.07.3 upgrading to 19.07.3 -> jffs error occurs.
  • I've tested on 19.07.3 upgrading to 19.07.4 -> jffs error occurs.
  • I've tested on 19.07.4 upgrading to 19.07.4 -> jffs error does NOT occur.

So it seems the SPI write flash issue is indeed a software issue. Testing around, I assume that during a sysupgrade the old kernel still boots one time, flashes the new image and then directly fires up everything without rebooting. So that might be why I'm experiencing the config loss on next reboot (no matter if my scripts do it or if I do it manually). If this is true, that explains why going from an unaffected firmware to another one works without the config loss, e.g. 19.07.4 to 19.07.4. That's why I asked for confirmation if OpenWrt did solve this issue in the meantime. Can you answer it? If OpenWrt didn't do anything about it, maybe it will return some day and all of us should be aware of this.

Just another thesis: It maybe that no one experiences the issue if only a "small" amount of overlay data is persisted during sysupgrade. I've got 9 MB overlay capacity and using only 3 MB of it to persist, so that definitely is well calculated and is expected to work. If not, OpenWrt (or kernel) has a bug and it should be investigated and fixed.

This is one of the drawbacks of MIPS memory-mapping. If you use memory-mapping to read data smaller than 32bytes from flash, then erase flash, and then use memory-mapping to read data smaller than 32bytes from the same location, the data will wrong.
As far as I know, the latest OpenWRT does not fix this problem.

1 Like

@luochongjun Can I avoid this issue somehow? I'm asking because ocassionally logging important things to /root/ from my 24/7 running bash scripts didn't cause any "config loss on reboot" issues over years. What's also unclear to me is why only /etc/config/* is lost while the rest, for example, /etc/passwd, /root/* and soon is still there. Before the second reboot after flash, everything - including /etc/config/ - is correctly in place. What could OpenWrt cause to just drop a valid /etc/config ? If I connect to 192.168.1.1 after the config loss and reupload it from documentation, reboot, the config is well running (so no errors in it) and peristed across future reboots.

I think I'm having the same problem like described in [ SOLVED ] Getting error failed to sync jffs2 overlay where jeff posted that it's "bad timing handling the erasure mark on the tplink archer c7v5". What is different in my case is that the only jffs2 error in dmesg is the above mentioned "word at offset" thing. I do not get the "failed to sync overlay" message.

It can be reproduced scp uploading ca. 4 mbytes to the tp link archer c7v5's /root . Set sysupgrade.conf to persist /root . Upload openwrt-sysupgrade.bin image to /tmp and run sysupgrade -k via ssh. When the router reboots after flash (like @jeff described it as first reboot after flash in the topic) I had 28 failures and 3 successes trying the same procedure over and again. (This can be reproduced without using the offline installer scripts from my first post.) Failure means the jffs2 erase error message was there.

1 Like

Don't you think if someone took time to mention it, that it may be important?

  • Commit (or 4b1f073f843dc4be655a868f6a6e31f74baa727c) is the serial number the versioning system gave to the specific change/addition/removal of code to a repository (or code repo)
  • Repo (or code repository) is where you're downloading software from (it's kind of a security issue that you don't know where the code is from) - It means your script is not from OpenWrt, just look at the link. Have you asked GL-Inet?

The I'd suggest not developing a script to upgrade it. Simply use the normal sysupgrade method.

Sure, I even loaded the files manually to /root I need persisted and no matter what they are, I go to web Ui, flash upgrade, and the jffs2 error sometimes comes up and some times not . It has nothing to do with the scripts. If it would work fine the manual way in all cases the script would also be okay.

Having kilobytes to persist -- always works
Having megabytes to persist (part free over 70% still) -- often fails.

@lleachii please help correcting the bug

is this not a sysupgrade data size thing?

2 Likes

I'm not sure if you're purposely ignoring me; or didn't understand again...

...or perhaps the information you offered prior to tagging me wasn't in response to my post. I asked why haven't you inquired with the developers of these scripts (it isn't OpenWrt):

screen115

Are you sure you're calculating this correctly...because the sysupgrade image itself is bigger than the 3 MB you just quoted:

I'm estimating you have less than a few hundred KBs...before the sysupgrade bundles all persistent data.

I'm quite confused on how your logic always comes to a bug. Also, please be mindful, even if you use the approach to "scream" there's a bug to get assistance, you still have to clearly identity it. Lastly, why would OpenWrt investigate problem in code written by another firmware manufacturer - for their firmware? :confused:

Without your cooperation (i.e. mentioning things should work manually, but not informing us if things do work manually now) and gleaning information from the OpenWrt Wiki and this post, I'm simply starting to agree with @mk24 that you may have overfilled your device.

Can you actually provide the real free and file totals; and other information previously requested?

I think you'll find it's over 9 MB.

@lleachii I'm NOT using this gl-inet stuff , I use official openwrt stable builds, 19.07.4 for tpl archer c7v5 . So why getting a foreign image,discuss about foreign code or sth? Just the problem seems to be the same looking at glinet.

If I upgrade via web ui - problem. Via ssh sysupgrade, too.

Ssh, running df -h , find /overlay , du -sh /overlay made sure its not over filled. Having ca. 2.2 MByte filled before running sysupgrade, while it executes it goes up to 4.4, i guess because it backups things for later.

Sysupgrade image is bigger, but not relevant for the used space in the overlay. Web ui and myself if doing via scp places the image in /tmp/firmware.bin . There are over 50 mbytes free on it.

@lleachii do you need screenshots? Or is it enough if some experienced it expert says he's tested 2 days and is sure there is sth wrong and he has observed it correctly.

@Catfriend1, I'm lost at how you think someone has tested two days.

And if you're referring to yourself, please provide the solution in a code report, developer list, etc.. To reiterate, just saying "there's a bug" doesn't identify where it is; nor get it fixed. OK, [hypothetically], there's a bug...now where is it and what's the fix?

This is community-based support and you provide little and conflicting information. I've had the following draft post saved for those 2 days:


:man_facepalming:

The script if from GL-INET! Look at your link:

:arrow_down:

GL-Inet does-not-equal OpenWrt. Are you identifying that code as the isue?

If so, locate the code in OpenWrt and note the bug there.

:open_mouth: /tmp and overlay are not the same...I'm not even sure how you honestly presented a number greater than 3 MB; but greater than your installed flash chip...hummm.

Yea, I think you may not understand.

@lleachii yeah your comments come up to me like side tracking my facts by interpreting my posts misunderstandingly and not willing to help. I'm no coder but admin and user. I know a lot from those perspectives about openwrt and I do know how to make a good bug report, investigate and clearly sum up what has proved to be wrong. I need help from a developer willing to take the report serious and looking at code with his expertise. I don't need your comments far away from doing so and accusing me why I can't communicate like a dev and do the bug hunting in code myself. I have done hours of flashing, resetting, starting afresh to prove there is a problem when openwrt handles the spi reads concurrently while restoring a 2 mbyte jffs overlay after sysupgrade. Period. Don't troll me. Let the others read , understand and help. You just did jump roughly over what I wrote in an offensive sounding way. I did not use glinet, so it's okay not to search if it is their bug too. But that is written there is a good pointer what could be the cause of this verified symptom in openwrt. If it's not useful, okay, then an expert needs to make theories, but the problem is real and will stay.

Please explain the Gl-INET link then (because I know I'm not the only once confused)...then you said:

So please explain what the link means???

Maybe this will help.

:laughing:

(A developer, or at least a Core Team Member, responded to you already! - See: Post No. 7 - You refused to explain why there's a GL-INET link.)

The link to glinet came up googling the jffs erase error message because I wanted to know (as openwrt user) if anyone else had the same or similar message in their dmesg. When I had read this, I thought this could also be a pointer to the root cause why it's happening with openwrt reproducibly on all of my 5 devices. Before finding that link with the exact same offset and wrong word mentioned I was also thinking the flash might be faulty. Now I know for sure that 5 devices cannot have the same "hardware fault" at the exact same offset by accident. It might be a kernel driver issue. Just wanted to give this ray of light to someone probably reading this and concluding "aha, it might be the spi read driver stuff for us at openwrt as well". I'm not arguing that both projects have a codebase in common where the other should care about. I'm only interested to see this problem fixed in openwrt in some future.

1 Like

(Moved to the For Developers section.)

1 Like

For me on ubnt device (e.g. nanostation ac loco) the jffs2 is working on 4.19 but I have the same errors on 5.4... :confused:

1 Like

@PolynomialDivision thanks for making a PR :slight_smile:

1 Like

Does it fix your issue? :smiley: