I'm currently writing a script that allows for offline upgrading OpenWrt through a debian remote management server. The scenario is as follows:
MANUAL ONE-TIME PREPARATION:
- I've put a line in /etc/rc.local that triggers "/bin/sh /root/post-upgrade.sh &" on every boot. post-upgrade.sh looks for a marker that gets removed by OpenWRT on sysupgrade (/etc/_OPKG_INSTALL_DONE). If the marker is missing, it one-time shots /root/opkg_reinstall.sh (which is capable to do offline reinstallation of packages uploaded to /root/packages/ before the sysupgrade commenced). After successful opkg reinstallation, it creates the marker (/etc/_OPKG_INSTALL_DONE).
- Management server downloads openwrt_release and board.json via ssh from the router we'd like to upgrade with a new (official stable) firmware image
- It reaches out to downloads.openwrt.org to fetch the specified firmware version and the opkg lists consisting of multiple "packages.gz" files.
- It extracts the package lists and looks through them for mission-critical packages that are required for a mesh AP to come online again via wireless (batman-adv, wpad-mesh-openssl, kmod-ath10k, non-ct firmware) because the stations I need to upgrade don't have LAN access.
- It downloads the mission-critical IPK files
- It uploads the new firmware image to /tmp/sysupgrade.bin of the AP and then uploads the IPK package files to /root/packages/ of the AP.
- It remotely triggers "sysupgrade -k -v /tmp/sysupgrade.bin" to commence the firmware upgrade.
- The router itself then does a reboot after flash.
- The router automatically fires /root/post-upgrade.sh (and subsequently /root/opkg_reinstall.sh) after 30 secs time of sleep.
- I'm logging what happens during the opkg_reinstall.sh to /root/opkg_reinstall.log so I can later have a look that everything went fine.
- The scripts trigger "reboot" after they complete. Because batman-adv, kmod-ath10k, wpad-mesh-openssl and the ath10k-firmware need a reboot to bring up the wifi and get the device online via mesh again.
- If I initiate the sysupgrade via the script on the management server, sysupgrade is carried out successfully and the OpenWRT router comes back to life short after, it preserved ALL settings and its IP address. So after the flash (first) reboot everything is fine.
- The scripts complete and reboot a second time. According to the logs, every IPK was installed successfully. After the second reboot, my /etc/config has been reverted to stock defaults and I don't know why that (reproducibly) happens. The device is then reachable through 192.168.1.1/24 instead of the normal IP I expect it to have. The contents of /etc/config/ are defaults now, other files that are directly in /etc/ are still there (no revert occured) and /root/ is fully there (no revert occured). I can also see my logs on /root/.
WHAT I TRIED TO REMEDIATE THE PROBLEM:
- I've tried to remove the "reboot" command at the end of /root/post-upgrade.sh. When I log on via SSH, all IPK had been installed successfully and there is only one single indication of error in the dmesg.
[after 74s kernel boot time] jffs2: Newly-erased block contained word xxxxxxx at offset xxxxxxx
I think that's the reason why I've got the "/etc/config revert phenomenon on next reboot". I've checked, it does not matter if the script issues "reboot" or if I do that manually from SSH. On next reboot, my /etc/config is gone and the dmesg says "jffs2 whiteout" because of errors detected.
I've verified flash is fine. I think it's not flash failure. I can manually put /etc/config back from my documentation, power cycle or reboot the router, and it will start operating correctly with all settings in place.
I've verified my overlay doesn't run out of space. I've got around 9 MByte overlay free. With my 11 IPK files (temporarily) stores onto /root/packages, I'm around 47% space used. So no trouble expected from this.
I've removed the "/bin/sh /root/post-upgrade.sh &" line from "/etc/rc.local". Started all over, let the sysupgrade flash go through (again triggered by the management server). Then I logged on via SSH and manually issued "/bin/sh /root/post-upgrade.sh" from command line. SUPRISINGLY - this always works (tried multiple times). The jffs2 doesn't "corrupt", a second reboot follows and the router has all settings (preserved and effective). Normal operation.
This somehow led me here to report it to you. Please let me know if I should file the bug elsewhere in another place if it fits better there. I suspect it could be a kernel/jffs2 bug.
Especially when changing my "opkg_reinstall.sh" file and removing the parts "opkg install|remove .... 2>&1 | tee -a /root/opkg_reinstall.log". If I remove them, I can also run from /etc/rc.local and the corruption does NOT occur.
Observing the dmesg/logread -f while the opkg_reinstall.sh takes place, I see kmodloader lines loading the new kernel modules (e.g. BATMAN-ADV).
May it be that the kernel doesn't allow writing to the overlay (for the log) for just some microseconds? It's really reproducible. Adding the "tee -a" log lines and running from "/etc/rc.local" kills my partition. Even if I wait for minutes (sleep command) before the second reboot. Removing the tee log to /root/... or running from SSH manually, no corruption occurs.
Thanks for your support on this.
Here is a list of files I'm using for reference, viewed on the management server: