Sysupgrade bricking devices

My team has two qualcomm SOC based openwrt hardware platforms we sell and compile our own images for

(e.g. DSI0177 is one of the whitelabel products / hardware)

The issue we're experiencing is sometimes, sysupgrade fails to write either the kernel or the rootfs, and some of our devices get bricked while updating as a result - the failure rate is suspiciously high, up to a few % of devices, but doesn't seem to impact all devices (e.g. a loop of sysupgrading on a test device would go hundreds of times without failure in house, so it's hard for us to reproduce it while happening, we can only get back hardware after the fact)

Here is an example of the uboot output from one of these bricked devices where it looks like the kernel write was corrupted during sysupgrading, we also get some failures where the kernel wrote properly and boots but the rootfs appears to be corrupted from the sysupgrade

[04020D09][04020D08]
DDR Calibration DQS reg = 00008989

U-Boot 1.1.3 (Nov 22 2022 - 14:10:02)

Board: Ralink APSoC DRAM:  64 MB
relocate_code Pointer at: 83f9c000
flash manufacture id: ef, device id 40 20
find flash: W25Q512JV
*** Warning - bad CRC, using default environment

============================================
Ralink UBoot Version: 5.0.0.0
--------------------------------------------
ASIC 7628_MP (Port5<->None)
DRAM component: 512 Mbits DDR, width 16
DRAM bus: 16 bit
Total memory: 64 MBytes
Flash component: SPI Flash
Date:Nov 22 2022  Time:14:10:02
============================================
icache: sets:512, ways:4, linesz:32 ,total:65536
dcache: sets:256, ways:4, linesz:32 ,total:32768

 ##### The CPU freq = 580 MHZ ####
 estimate memory size =64 Mbytes
RESET MT7628 PHY!!!!!!wps val:4
Detect WPS Button!
wps val:4
default: 3

Please choose the operation:
   1: Load system code to SDRAM via TFTP.
   2: Load system code then write to Flash via TFTP.
   3: Boot system code via Flash (default).
   4: Entr boot command line interface.
   7: Load Boot Loader code then write to Flash via Serial.
   9: Load Boot Loader code then write to Flash via TFTP.                     0

3: System Boot system code via Flash.
## Booting image at bc050000 ...
   Image Name:   MIPS OpenWrt Linux-4.14.267
   Image Type:   MIPS Linux Kernel Image (lzma compressed)
   Data Size:    1707600 Bytes =  1.6 MB
   Load Address: 80000000
   Entry Point:  80000000
   Verifying Checksum ... Bad Data CRC
bootm kernel failed! start net httpd server


 NetTxPacket = 0x83FE8A00

 KSEG1ADDR(NetTxPacket) = 0xA3FE8A00

 NetLoopHttpd,call eth_halt !
Trying Eth0 (10/100-M)

Waitting for RX_DMA_BUSY status Start... done
ETH_STATE_ACTIVE!!
HTTP server is starting at IP: 192.168.100.250

For additional context, here's what my update process looks like at a high level and the checks we're doing before sysupgrading

We have update software that checks for an available update (images we compile and version) and just downloads the sysupgrade.bin file we made and a corresponding .md5sum file for the file and reboots the device, we want so sysupgrade after a fresh boot as one of our hardware platforms had a much lower failure rate historically if sysupgrade is preformed soon after booting as opposed to after a high uptime

At startup, one of our earliest init.d scripts that runs on boot tries to apply an update if we've downloaded one, so this will do the following:

Runs sync
Checks that we have the .bin and .md5sum
Compares the md5sum to the bin
Stops any of our other processes if they're running
drops caches, e.g. "echo 3 > /proc/sys/vm/drop_caches"
verifies the bin isn't bigger than our rootfs partition
verifies that free memory is at least 5/4 the .bin size
moves our .bin to /tmp (RAM)
Verifies some files are corrrect (passwd / shadow)
Saves a sysupgrade backup file (e.g. sysupgrade -b /data/etc/updates/sysupgrade_backup.tar.gz)
Logs a bunch of state info into a log file, e.g. the outputs of a few commands like: mount, df -h, ubinfo -a, ps -ef, free (so we can see as much as possible about the state just before updating)
Run the sysupgrade (sysupgrade -v $SYSUPGRADE_BIN >> /data/etc/updates/sysupgrade.log 2>&1)

My question is then is there any obvious steps/checks we're missing before preforming sysupgrade? Or any suspicions as to why it's failing to write properly in some cases?

Hi, what version of openwrt are you using?

also, that’s a mediatek bootlog, not Qualcomm?

1 Like

That's Linux 4.14.267, used in OpenWrt 19.07.09 and EOL for over a year now.

1 Like

To clarify, one of the older platforms is a qualcomm device using a 15.05 "Chaos Calmer" qsdk, that's a qualcomm SOC.

This is pretty old hardware we've discontinued production of, but still would like to update

The newer platform is 19.07 ( OpenWrt 19.07-SNAPSHOT, r11405-2a3558b ) and i believe as you pointed out a mediatek SOC

The issue exists on both systems

Then you should ask Qualcomm about the QSDK specific issues.

1 Like

Then it's very likely that the issue is caused by some modification that either Qualcomm, Mediatek or your company made.

1 Like

Sysupgrade flow in general is pretty generic, so i'm looking to see if anyone else has had similar problems with sysupgrade and if our update flow here addresses all the common sources of failure or if there are any we're missing.

Additionally, if it's some modification I've done to cause this issue, that's what i'm trying to find out

Help you with you black box code ? Awesome...

Yes, it consists of running sysupgrade. All the other stuff you mentioned was added by your company. We can't help you with code that we don't know/have/see.
And even then I'm asking myself why I should help a company fix their fork of OpenWrt while they did not contribute back - for free?

1 Like