Multiple sysupgrade failures on TP-Link TL-WR710N

I think there may be a bug causing intermittent sysupgrade flash failures on the TP-Link TL-WR710N and similar models. I recently lost four devices to random sysupgrade failures and I'm reading a number of similar reports from others in recent months. I don't know if there is a real issue here or not, but I'm at least going to report it.

I have a small fleet of TP-Link TL-WR710N hardware version 1.0 devices. I build my own custom images with the packages I need and then send out the image to my fleet with some scripts that automatically manage the process.

Several weeks ago I upgraded six devices and two of them failed for unknown reasons. They simply didn't recover after the sysupgrade reboot. All six devices are configured almost identically and use the same image. Timestamp date of the build was 20171008020100 based on v17.01.3 / head df54a8f583a9afad356fb99a575d75b69c8c0dd4. Also note that this build was tested on a test unit where I flashed it at least twice without any problem.

I sent out two spares to the affected locations and the failed units were sent back to me. The spares were upgraded to my current build at this time without any trouble.

I thought that maybe the image didn't have an included hash in the file format and that maybe the image had been corrupted in transit, so I added a step in my script that does an sha256sum of the file before the sysupgrade happens. I thoroughly tested this verification system it worked perfectly. My test device was sysupgrade-flashed at least three times during this testing.

Two days ago I decided to push out another upgrade. This one had a build timestamp of 20171103195944, v17.01.4 / head 444add156f2a6d92fc15005c5ade2208a978966c. Another two out of six systems failed their sysupgrade. I don't have these units in my possession yet, but I assume they failed in the same way. One of the failed units was a replacement which I had sent out, and the other was one of the four which had successfully sysupgraded the first time.

So that's a total of eight devices, flashed in two events, and four failures total.

Just to be clear, my build doesn't have any patches or modifications to the system which I think could be causing this. I've added a few files and I customize the packages on the system, but that's about it -- nothing crazy going on there. The same image is being used on all of the devices.

Also, all of these devices have been upgraded many many times in previous years; at least 20 times each. No issues before this recent trouble.

I have attached a boot log from one of the failed devices below.

Similar reports I found:
https://bugs.lede-project.org/index.php?do=details&task_id=478

This one is a different model but very similar behavior (intermittent failure) and it has a very similar/same SoC, but he has a hardware modified flash chip:
https://bugs.lede-project.org/index.php?do=details&task_id=48

This looks similar. Has an ar9xxx series SoC too. Mentions gargoyle so I'm not sure this is actually related:
https://eko.one.pl/forum/viewtopic.php?id=15628

In this thread someone mentions an unexpected softbrick on a WR1043ND and his log shows similar errors:

Finally, WTF? No text file or log uploads? "Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif)."

Log below since I can't attach it.

U-Boot 1.1.4 (Apr 19 2013 - 15:56:20)

AP121 (ar9330) U-boot

DRAM:  32 MB
led turning on for 1s...
id read 0x100000ff
flash size 8388608, sector count = 128
Flash:  8 MB
Using default environment

In:    serial
Out:   serial
Err:   serial
Net:   ag7240_enet_initialize...
No valid address in Flash. Using fixed address
No valid address in Flash. Using fixed address
: cfg1 0x5 cfg2 0x7114
eth0: 00:03:7f:02:f4:bb
ag7240_phy_setup 
eth0 up
: cfg1 0xf cfg2 0x7214
eth1: 00:03:7f:02:f4:bb
athrs26_reg_init_lan
ATHRS26: resetting s26
ATHRS26: s26 reset done
ag7240_phy_setup 
eth1 up
eth0, eth1
Autobooting in 1 seconds
## Booting image at 9f020000 ...
   Uncompressing Kernel Image ... OK

Starting kernel ...

[    0.000000] Linux version 4.4.89 (shara@debstable) (gcc version 5.4.0 (LEDE GCC 5.4.0 r3533-d0bf257c46) ) #0 Tue Oct 3 11:03:27 2017
[    0.000000] bootconsole [early0] enabled
[    0.000000] CPU0 revision is: 00019374 (MIPS 24Kc)
[    0.000000] SoC: Atheros AR9330 rev 1
[    0.000000] Determined physical RAM map:
[    0.000000]  memory: 02000000 @ 00000000 (usable)
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] No valid device tree found, continuing without
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x0000000001ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x0000000001ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000001ffffff]
[    0.000000] Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 8128
[    0.000000] Kernel command line:  board=TL-WR710N  console=ttyATH0,115200 rootfstype=squashfs,jffs2 noinitrd
[    0.000000] PID hash table entries: 128 (order: -3, 512 bytes)
[    0.000000] Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
[    0.000000] Writing ErrCtl register=00000000
[    0.000000] Readback ErrCtl register=00000000
[    0.000000] Memory: 27548K/32768K available (3057K kernel code, 159K rwdata, 764K rodata, 300K init, 200K bss, 5220K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS:51
[    0.000000] Clocks: CPU:400.000MHz, DDR:400.000MHz, AHB:200.000MHz, Ref:25.000MHz
[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 9556302233 ns
[    0.000013] sched_clock: 32 bits at 200MHz, resolution 5ns, wraps every 10737418237ns
[    0.007537] Calibrating delay loop... 265.42 BogoMIPS (lpj=1327104)
[    0.088738] pid_max: default: 32768 minimum: 301
[    0.093335] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.099491] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
[    0.109942] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.117999] futex hash table entries: 256 (order: -1, 3072 bytes)
[    0.124954] NET: Registered protocol family 16
[    0.130076] MIPS: machine is TP-LINK TL-WR710N v1
[    0.619928] clocksource: Switched to clocksource MIPS
[    0.625279] NET: Registered protocol family 2
[    0.629366] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
[    0.634810] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
[    0.640856] TCP: Hash tables configured (established 1024 bind 1024)
[    0.647023] UDP hash table entries: 256 (order: 0, 4096 bytes)
[    0.652509] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
[    0.658844] NET: Registered protocol family 1
[    0.668697] Crashlog allocated RAM at address 0x1f00000
[    0.694022] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.698239] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[    0.711451] io scheduler noop registered
[    0.713834] io scheduler deadline registered (default)
[    0.719044] Serial: 8250/16550 driver, 1 ports, IRQ sharing disabled
[    0.725582] ar933x-uart: ttyATH0 at MMIO 0x18020000 (irq = 11, base_baud = 1562500) is a AR933X UART
[    0.734038] console [ttyATH0] enabled
[    0.734038] console [ttyATH0] enabled
[    0.740826] bootconsole [early0] disabled
[    0.740826] bootconsole [early0] disabled
[    0.752782] m25p80 spi0.0: found s25fl064k, expected m25p80
[    0.756921] m25p80 spi0.0: s25fl064k (8192 Kbytes)
[    0.763263] 5 tp-link partitions found on MTD device spi0.0
[    0.767386] Creating 5 MTD partitions on "spi0.0":
[    0.772227] 0x000000000000-0x000000020000 : "u-boot"
[    0.779157] 0x000000020000-0x0000001684e4 : "kernel"
[    0.785042] 0x0000001684e4-0x0000007f0000 : "rootfs"
[    0.790823] mtd: device 2 (rootfs) set to be root filesystem
[    0.795070] 1 squashfs-split partitions found on MTD device rootfs
[    0.801275] 0x0000006a0000-0x0000007f0000 : "rootfs_data"
[    0.808825] 0x0000007f0000-0x000000800000 : "art"
[    0.814429] 0x000000020000-0x0000007f0000 : "firmware"
[    0.838358] libphy: ag71xx_mdio: probed
[    1.431638] ag71xx ag71xx.0: connected to PHY at ag71xx-mdio.1:04 [uid=004dd041, driver=Generic PHY]
[    1.440402] eth0: Atheros AG71xx at 0xb9000000, irq 4, mode:MII
[    2.031406] ag71xx-mdio.1: Found an AR7240/AR9330 built-in switch
[    2.063516] eth1: Atheros AG71xx at 0xba000000, irq 5, mode:GMII
[    2.070875] NET: Registered protocol family 10
[    2.078697] NET: Registered protocol family 17
[    2.081888] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
[    2.094514] 8021q: 802.1Q VLAN Support v1.8
[    2.104328] squashfs: SQUASHFS error: unable to read id index table
[    2.109600] jffs2: Flash size not aligned to erasesize, reducing to 6656KiB
[    2.116418] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000000: 0x6873 instead
[    2.125582] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000004: 0x0505 instead
[    2.135036] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000008: 0x7f6e instead
[    2.144498] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000010: 0x1f00 instead
[    2.153960] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000014: 0x0400 instead
[    2.163422] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000018: 0xc006 instead
[    2.172883] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0000001c: 0x0400 instead
[    2.182344] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000020: 0x9407 instead
[    2.191806] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000028: 0xae8a instead
[    2.201268] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00000030: 0xa68a instead
[    2.210722] jffs2: Further such events for this erase block will not be printed
[    2.219616] jffs2: Empty flash at 0x00000038 ends at 0x00000040
[    2.256243] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00010000: 0xeba4 instead
[    2.264275] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00010004: 0xce01 instead
[    2.273734] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00010008: 0xea2c instead
[    2.283196] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0001000c: 0x15b2 instead
[    2.292659] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00010010: 0x1ce0 instead
[    2.302119] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00010014: 0x6948 instead
[    2.311582] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00010018: 0x7a8d instead
[    2.321044] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0001001c: 0x3dc4 instead
[    2.330505] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00010020: 0xc8a2 instead
[    2.339968] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00010024: 0xfef5 instead
[    2.349409] jffs2: Further such events for this erase block will not be printed
[    2.390520] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00020000: 0x9b1c instead
[    2.398541] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00020004: 0xe245 instead
[    2.408016] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00020008: 0xf031 instead
[    2.417475] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0002000c: 0x464a instead
[    2.426937] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00020010: 0x5b4e instead
[    2.436488] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00020014: 0xddd8 instead
[    2.445868] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00020018: 0x01f8 instead
[    2.455323] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0002001c: 0x9305 instead
[    2.464785] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00020020: 0x3420 instead
[    2.474246] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00020024: 0xf66f instead
[    2.483702] jffs2: Further such events for this erase block will not be printed
[    2.524794] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00030000: 0xaf29 instead
[    2.532827] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00030004: 0x5a4e instead
[    2.542294] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00030008: 0x8148 instead
[    2.551748] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0003000c: 0xba37 instead
[    2.561209] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00030010: 0xea2a instead
[    2.570671] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00030014: 0x4641 instead
[    2.580133] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00030018: 0x2932 instead
[    2.589584] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0003001c: 0x35ac instead
[    2.599057] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00030020: 0x1a23 instead
[    2.608518] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00030024: 0x3880 instead
[    2.617972] jffs2: Further such events for this erase block will not be printed
[    2.659060] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00040000: 0x0ac3 instead
[    2.667088] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00040004: 0x2d75 instead
[    2.676548] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00040008: 0x8e34 instead
[    2.686010] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0004000c: 0xb5a8 instead
[    2.695471] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00040010: 0xb752 instead
[    2.704932] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00040014: 0xf548 instead
[    2.714398] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00040018: 0x1ec7 instead
[    2.723859] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0004001c: 0x3626 instead
[    2.733320] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00040020: 0x6870 instead
[    2.742782] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00040024: 0x46dd instead
[    2.752236] jffs2: Further such events for this erase block will not be printed
[    2.793325] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00050000: 0xbe96 instead
[    2.801360] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00050004: 0xba41 instead
[    2.810819] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00050008: 0xadd4 instead
[    2.820282] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0005000c: 0xb9b4 instead
[    2.829732] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00050010: 0x9830 instead
[    2.839206] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00050014: 0x331d instead
[    2.848666] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00050018: 0x1a77 instead
[    2.858128] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0005001c: 0x8e2d instead
[    2.867590] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00050020: 0x6b51 instead
[    2.877052] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00050024: 0xf8eb instead
[    2.886506] jffs2: Further such events for this erase block will not be printed
[    2.927593] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00060000: 0xb4e7 instead
[    2.935621] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00060004: 0xdfa2 instead
[    2.945083] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00060008: 0xd569 instead
[    2.954545] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0006000c: 0xcb4a instead
[    2.964006] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00060010: 0x7174 instead
[    2.973468] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00060014: 0x76c3 instead
[    2.982930] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00060018: 0x2100 instead
[    2.992392] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0006001c: 0xa194 instead
[    3.001854] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00060020: 0x5d45 instead
[    3.011319] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00060024: 0xbab5 instead
[    3.020769] jffs2: Further such events for this erase block will not be printed
[    3.061858] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00070000: 0x793b instead
[    3.069871] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00070004: 0xc4c5 instead
[    3.079347] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00070008: 0x8619 instead
[    3.088808] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0007000c: 0xc211 instead
[    3.098269] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00070010: 0x7347 instead
[    3.107731] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00070014: 0x34d9 instead
[    3.117192] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00070018: 0xd183 instead
[    3.126654] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0007001c: 0x0bfd instead
[    3.136117] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00070020: 0x0599 instead
[    3.145577] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00070024: 0x992d instead
[    3.155032] jffs2: Further such events for this erase block will not be printed
[    3.196128] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00080000: 0x14b2 instead
[    3.204158] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00080004: 0x6271 instead
[    3.213617] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00080008: 0x0d18 instead
[    3.223078] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0008000c: 0xc48c instead
[    3.232541] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00080010: 0x6f6f instead
[    3.242003] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00080014: 0x2605 instead
[    3.251464] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00080018: 0xbbb0 instead
[    3.260926] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0008001c: 0xafb7 instead
[    3.270387] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00080020: 0x46ee instead
[    3.279837] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00080024: 0xdac4 instead
[    3.289303] jffs2: Further such events for this erase block will not be printed
[    3.300395] jffs2: Old JFFS2 bitmask found at 0x000812d0
[    3.304240] jffs2: You cannot use older JFFS2 filesystems with newer kernels
...
[   13.419111] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00560024: 0xbd98 instead
[   13.428569] jffs2: Further such events for this erase block will not be printed
[   13.448947] jffs2: Empty flash at 0x00560b70 ends at 0x00567b1c
[   13.453441] jffs2: CLEANMARKER node found at 0x00567b1c, not first node in block (0x00560000)
[   13.477703] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00670000: 0xfbdb instead
[   13.485736] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00670004: 0xc699 instead
[   13.495196] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00670008: 0x2b6e instead
[   13.504657] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0067000c: 0xc784 instead
[   13.514119] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00670010: 0xcf54 instead
[   13.523581] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00670014: 0xfbe3 instead
[   13.533043] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00670018: 0xbd1f instead
[   13.542505] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x0067001c: 0x71f1 instead
[   13.551967] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00670020: 0x07f6 instead
[   13.561428] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x00670024: 0xfd8d instead
[   13.570881] jffs2: Further such events for this erase block will not be printed
[   13.588589] jffs2: CLEANMARKER node found at 0x00677b1c, not first node in block (0x00670000)
[   13.609213] jffs2: notice: (1) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
[   13.623127] VFS: Mounted root (jffs2 filesystem) readonly on device 31:2.
[   13.632046] Freeing unused kernel memory: 300K
[   13.635131] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/init.txt for guidance.
[   13.659442] Rebooting in 1 seconds..

I am up to three failed upgraded events now, where each time two devices have failed the upgrade, for a total of six failed devices.

All failed devices have since been recovered by soldering in a console cable.

There are anywhere from 6-9 units which get upgraded each time. There doesn't seem to be any pattern as to which devices fail.

All devices are at different locations with four different ISPs and the system does a sha256sum on the sysupgrade file before flashing.

The only pattern I have noticed is that it's exactly two devices which fail each time, but that's just telling me that there seems to be a 25-35% chance of devices failing.

The key error message to look for when these sysupgrade failures happens seems to be this:

"squashfs: SQUASHFS error: unable to read id index table"

Whereas normally we would see this:

"VFS: Mounted root (squashfs filesystem) readonly on device 31:2."

All of the jffs2 errors are a loud distraction from the main issue, which is that the squashfs rootfs filesystem is corrupt.

Because the squashfs rootfs is corrupt, the system tries to mount the rootfs as a jffs2 filesystem, which it isn't and never was. This obviously fails. So, ignore all of the jff2s filessytem errors, as they are just noise.

Why the squashfs is corrupt, I don't know yet. I just went through a hundred or so upgrade attempts on three of my devices here at home and I've been unable to reproduce the issue.

I wonder if there isn't something environmental that goes on to be causing this.

The chance for RAM errors or flash failure rises under thermal stress.

This week I had some spare time to mess around with my device fleet while passing out some new configurations.

I ran a couple of hundred sysupgrade cycles on three devices here at home and had no problems at all. I tried everything I could think of to reproduce the environmental conditions of the devices out in the field in regards to USB devices and network configurations, but still nothing.

So, I decided to push an upgrade to five devices yesterday. Three ended up not coming back online and I assume their squashfs got corrupted just like the rest. These were being upgraded from LEDE 17.01.05 to OpenWRT 18.06.01.

So my total for failures is now 9 out of 26.

I have no idea what might be causing these problems.

The devices are spread across the southwest USA in mostly warm dry-air environments, inside medical office buildings.

It's notable that all of these locations have PET chemotherapy machines and some other radiological devices or medicines, though I do not have any reason to suspect that's related to my problems. Still I couldn't help read the recent story about a helium leak killing Apple devices and think about the possibilities:

When these three new failed devices come in, I'll see if I can dump out the partitions and examine exactly what is wrong with the squashfs filesystems. I have not been able to do that in the past because the default uboot which comes on these devices doesn't support any method of uploading the partition images.

Did you reboot the devices before the sysupgrade?
The only other thing I can think of is the device 8M/32M.
It might have OOM

Yes, on the third batch of failures I rebooted all devices about five minutes before the sysupgrades. System uptime doesn't seem to have anything to do with it.

Sysupgrade kills almost all processes to clear up memory anyway.

Finally, I am tracking memory usage via SNMP graphs. There's no evidence of any leaks or anything.

Any chance of sharing the full bootlog?

Thanks

A bootlog is the second post.

Everything goes bad at this line:
[ 2.104328] squashfs: SQUASHFS error: unable to read id index table

That's really the only clue that the squashfs is corrupt. Everything after that is just a symptom.

Since I can't reproduce the problem here, it's hard to give more info.

I have three recently failed units being mailed to me in the next week. When they get here I'll be able to dump out the mtd partitions after I install pepe2k's uboot.

I just found this old bug which sounds similar:

https://dev.archive.openwrt.org/ticket/13914

The author blames kmod-usb-storage, but I'm not sure that has anything to do with it. I do include kmod-usb-storage built into my images though, so that's similar.

One of the commentators mentions "i have 1000+ devices ,only have 6 devices find this err", which sounds familiar, though my failure rate is more like 35% rather than 0.6%.

I got one of my failed devices back in the mail today. I hooked up the serial console and verified that it's stuck with a failed squashfs, just like all the previous devices.

The bootlog on this one looks slightly different, but I think that's because of the new 4.9 kernel -- all the previous failures were on the old 4.4 kernel.

The kernel itself looks fine and I'm seeing the classic "squashfs: SQUASHFS error: unable to read id index table" error.

I successfully installed the pepe2k uboot and TFTP dumped the NOR flash image to my local system so that I can look at it.

A casual hexdump review of the rootfs squash+overlay looks valid. I see the header and the file tool correctly IDs this as a squashfs v4.0 filesystem. The end of the file also looks reasonable based on a casual guess of what I should see.

I need to figure out what tools I can use to look at this. I've got no idea what I'm doing or how to debug a filesystem like this. I'll diff it from what I think it should be and use some tools to try and extract them image or otherwise figure out what is wrong with it.

The squashfs on this failed device is definitely incomplete, and it's quite strange.

My squashfs image is about 4.2MB in size. On the corrupt image, the data stream of the squashfs is interrupted about 800KB short of where it should have ended. Everything before that is identical.

After a break on the eraseblock, there is a bunch of unfamiliar data which looks like squashfs. Not sure yet what it belongs to, but given that the NOR is erased prior to the new image being installed, I can assume that it was written there as a result of the sysupgrade process or later on an attempted restore.

Later on I see what looks like the sysupgrade.tgz file and eventually some jffs2 stuff.

Once I can isolate the overlay data and prune it off, I'll see about making the image files available if anyone wants to look at it. I have private keys and other data on this thing I don't want to disclose accidentally.

I've been kind of busy but every once in awhile I pick these broken devices up and play around with them.

I now know much more about how jffs2 and eraseblocks works on-flash than I ever wanted to know. So at least I learned something I guess.

I have two failed units here that I've been looking at.

One of them looks like it got 80% of the way through writing it's new squashfs rootfs image. The second one only got 18% of the way in before it died. I am certain that nothing else after the squashfs was touched.

So, the image writing process is being interrupted for some reason.

I still have no idea why that is happening, and I can't reproduce it here at home at all! I've used the same sysupgrade process on five devices here hundreds of times each.

I've also rebooted devices immediately before failed sysupgrades, so I don't think this is an out-of-memory issue.

The one major environmental difference is that I live in a high-humidity location while these are out in a low-humidity area.

Next up I'm going to seal these things in a box full of dry rice or those little desiccant packets or otherwise see if I can lower the humidity for them.

And if I still can't figure it out all of them are going into the trashcan.

If you have a location that is particular prone to failures, I'd consider trying to shield a device in metal (as you mentioned radiology, RAM in these low-end devices often isn't top rated) - as long as that won't induce overheating….

you did come up with possible issue cause, yet you wish to trash the devices. haven't considered putting new devices there and trying failing ones in clean environment?

Seems like the router is rebooted before the upgrade process is completed.

Sounds like a reasonable explanation. Which would mean, a timing/buffering bug in sysupgrade.

I'm on the same situation @jmomo reported. I'm managing a fleet of 15K TP-Link routers on a small brazillian ISP and using some basic bash scripts to roll out remote firmware upgrades using sysupgrade. I'm always rebooting the router and shutdown network services before issuing an upgrade. I'm also doing image checksum to check image integrity. The routers are fairly new (around 1 year of usage) and i'm always seeing about 2% of bricked routers after a sysupgrade... When evaluating the flash of a bricked router using a EEPROM reader it's very similar to what @jmomo reported... The flash seems consistent until a point where I can see an erased block and from that point further I can verify some contents of the old image. It seems that particularly after mtd erases a block it reboots after trying to write a block of the new image.

I've done similar procedures as @jmomo reported but still no luck. On the beginning I thought users might be turning off devices during a sysupgrade but I have already reproduced the same situation on my lab (sadly I can't reproduce it consistently to debug this case).

@jmomo have you reached to a conclusion?

No idea. I was never able to reproduce the issue locally. For this particular fleet we plan to retire them all here in the next few months and replace with RPis.