[SOLVED] Meraki_mx60 Installing current SNAPSHOT router booted into receovery

lleachii · May 26, 2018, 5:42am

meraki_mx60-squashfs-sysupgrade.tar ba4dcaa2a013646da309c08e57026cbc3bd5851f715c7dd3807c66a2b3a179d0 4970.2 KB Sat May 26 02:04:51 2018

From console:

kernel volume not found
Volume recovery found at volume id 1

lleachii · June 5, 2018, 3:37pm

Still, same issue (no kernel volume found) flashing snapshot (master):

meraki_mx60-squashfs-sysupgrade.tar cccd72dd83e8da7724830442def0826bd3ad6324d0660dda861f9ee9b4c0ed0a 4970.2 KB Tue Jun 5 08:04:41 2018

I installed 18.06 Snapshot (branch) instead. After Installation I exprienced the same issue I noted here: [SOLVED] Username/Password doesn't work on 6984-fa0275b

Flashing a second time and doing a:

firstboot
reboot

solved the issue this time.

chunkeey · June 6, 2018, 10:51pm

I don't have a MX60(W) so I can't really say if it's a bug or not. But If I had to guess, this message is generated by u-boot, right?

And since you say, that this issue only happens when flashing the sysupgrade.tar image (via sysupgrade), then I guess something bad must have happened right at this point:

github.com

openwrt/openwrt/blob/master/package/base-files/files/lib/upgrade/nand.sh#L166


      
          }
          
          nand_detach_ubi() {
          	local ubipart="$1"
          
          	local mtdnum="$( find_mtd_index "$ubipart" )"
          	if [ ! "$mtdnum" ]; then
          		echo "cannot find ubi mtd partition $ubipart"
          		return 1
          	fi
          
          	local ubidev="$( nand_find_ubi "$ubipart" )"
          	if [ "$ubidev" ]; then
          		for ubivol in $(find /dev -name "${ubidev}_*" -maxdepth 1 | sort); do
          			ubivol="${ubivol:5}"
          			nand_remove_ubiblock "$ubivol" || :
          			umount "/dev/$ubivol" && echo "unmounted /dev/$ubivol" || :
          		done
          		if ! ubidetach -m "$mtdnum"; then
          			echo "cannot detach ubi mtd partition $ubipart"
          			return 1

As you can see the nand.sh sysupgrade script deletes(aka "kill volumes") the kernel, rootfs and rootfs_data volumes before recreating them again. So, if something breaks here, the device will be bricked in the very way you described. However, without logs from the failing sysupgrade ( -v ! Attach serial console)
I can't tell you why the kernel volume couldn't be created.

As for the WA:

Do you have a special "firstboot" script? (I know there are some variations out there). The firstboot that comes with OpenWrt just calls jffs2reset:

Which to my knowledge just inserts the deadc0de magic value to the start of the rootfs_data volume (if it's not mounted of course). And on the reboot the rootfs_data volume gets reformatted during preinit by the fstool code.

This all sounds a bit weird though. Maybe just a hunch: Do you have older bootlogs around?
The u-boot prints some information about the state of the nand chip during boot. It looks like this (taken from MX60's wiki):

UBI: number of good PEBs: 8164
UBI: number of bad PEBs: 10
UBI: max. allowed volumes: 128
UBI: wear-leveling threshold: 4096
UBI: number of internal volumes: 1 
UBI: number of user volumes: 5
UBI: available PEBs: 69
UBI: total number of reserved PEBs: 8095
UBI: number of PEBs reserved for bad PEB handling: 81
UBI: max/mean erase counter: 29/26

Can you check if the number of bad PEBs increased?

lleachii · June 6, 2018, 11:35pm

Correct.

Default OpenWrt.

PEBs have not increased. This is occurring on multiple devices as well. It has to be some change (perhaps the one you describe); because 17.01 and earlier snapshots worked.

chunkeey · June 7, 2018, 10:24am

k, I don't have a MX60 and I haven't seen any logs. Maybe you could bisect the issue?
I know that Chris Blake's MX60 worked back when we were fixing this:

That was 425 commits ago (as of "kernel: bump 4.14 to 4.14.48") so that would be about 8-9 bisection steps.

lleachii · June 7, 2018, 5:24pm

This issue has not occurred very long. It's only failed within the past few versions.

I distinctly recall making a thread regarding Wireguard Preshared Keys in 17.01. (May 15, 2018)

A few days later, I made a thread that I upgraded to Snapshot in order to use this feature. This is when I observed the firmware issues (no Kernel partition flashing master, config and/or username/password issues after flashing 18.06-Snapshot, etc.), the working 18.06-Snapshot was version: fa0275b, the corresponding master failed. (May 25, 2018)

Somewhere around this time, the master snapshot began to fail. I can only use 18.06-Snapshot at this time.

EDIT: I'm currently using OpenWrt 18.06-SNAPSHOT r6996-b295e3a (18.06-Snapshot), and just upgraded to OpenWrt 18.06-SNAPSHOT r7004-1199a91 (18.06-Snapshot)...but no kmod-wireguard for 4.14.48-1 available at this time!!! (This issue was solved by using opkg with the--force-depends argument (Kernel version mismatch when install kmod).

chunkeey · June 7, 2018, 7:11pm

k, I think it's the mtd-utils update to 2.0.2 that broke it.

https://github.com/openwrt/openwrt/commit/f37f63f38ccb706b196fe4934d0d9d92537eb832#diff-535725098f2bb8a78a2a196c9b6ad108

more specifically:

http://git.infradead.org/mtd-utils.git/commit/dede98ffb706676309488d7cc660f569548d5930

this changed the errno code to ENODEV instead of ENOENT. This would be innocent enough, however this will break the ubi volume enumeration code... since it still operates on the ENOENT error.
http://git.infradead.org/mtd-utils.git/blob/bc63d36e39f389c8c17f6a8e9db47f2acc884659:/lib/libubi.c#l1331

TL;DR: the kernel partition gets removed alright... But this completely kills the volume name lookup. so ubimkvol will fail and the sysupgrade is aborted. The device reboots but it can't find any kernel.

Yeaah, this is probably going to need a fix. That said, other devices are likely affected by this as well.

lleachii · June 7, 2018, 7:13pm

Yes, I also noticed something very odd.

Basically, I have to flash a device twice to get a good flash.

chunkeey · June 7, 2018, 8:37pm

well, I reported it upstream:

http://lists.infradead.org/pipermail/linux-mtd/2018-June/081562.html

if you want to be mentioned I need your email.

EDIT: also made a patch. If it compiles and works I sent it to the openwrt-ml.
https://github.com/chunkeey/apm82181-lede/commit/885fb68bdb6f448f560dcfa3cc750cdf896dd975
(And please for the future: just open up a bug-report on the bug-tracker bugs.openwrt.org - since then I can simply reference the bug# from there. Thanks.)

lleachii · June 7, 2018, 9:07pm

I'll private message you.

chunkeey · June 7, 2018, 9:31pm

Done.

http://lists.infradead.org/pipermail/openwrt-devel/2018-June/012774.html

lleachii · June 8, 2018, 2:13pm

OK, on both MX60W's:

I just upgraded to OpenWrt 18.06-SNAPSHOT, r7009-48c5d6a using the LuCI web GUI
After flashing the first time, I noticed that user-installed packages were still present, so I flashed the devices again
It appeard to normally flash
The real test will be to flash master once the buildbot updates it, and/or the next 18.06-SNAPSHOT to verify the bug fix

Thanks again @chunkeey!

lleachii · June 11, 2018, 2:42pm

Installing OpenWrt 18.06-SNAPSHOT r7013-f18f08d over OpenWrt 18.06-SNAPSHOT r7009-48c5d6a:

It appears to have flashed normally
Upon reboot, the configs were present and installed packages were gone (as normal)

It appears to be fixed!

tmomas · June 17, 2018, 2:42pm

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.