In my setup (using openwrt as OS for a non-router embedded device) I used to call sysupgrade from a preparation shell script.
This script relied on sysupgrade returning with a non zero exit code on error (such as invalid image file path etc.), and not returning (normally) or returning zero (not expected, but the script can handle that as well).
This has worked on hundreds of sysupgrade runs with 17.01. Now, since the update to 18.06, sometimessysupgrade seems to return non-zero but then the upgrade still occurs, i.e. rootfs is flashed, system reboots, end result is fully ok.
Now I wonder if my assumptions above are false, or something else is wrong.
I noticed that while 17.04 sysupgrade did end in calling do_upgrade, it now passes the task via ubus call system upgrade. The docs for that call are currently in *TODO* state, so it's not clear what the possible returns are, and when (does it just trigger the update or wait for it?). Still, even if it returns, I would expect the exit code be zero. Wrong?
As this happens on devices in the field out of my direct control, I could not yet observe it live. But my updater script is reporting non-zero return from sysupgrade back to my upgrade server, so I see this is happening. Here's the relevant part from the script (maintutil is a tool which can report problems via https to the upgrade server):
# ... at this point, /tmp/fwimg.bin is a valid openwrt image
cd /tmp
sysupgrade fwimg.bin
if [[ $? != 0 ]]; then
# failed, report
/usr/bin/maintutil --reporterror "sysupgrade failed"
exit 1
fi
# in case we get here (normally not, sysupgrade does not terminate when successful
reboot -f
exit 0
Yes, that's how I changed my code for the next upgrade, too I'm curious to see what exit code a successful sysupgrade now returns... Still, I think it should return 0 / SUCCESS unless something really went wrong.
not really, my code is not 1:1 the same, and would have reported back sysupgrade exiting with 0.
And while I could not observe sysupgrade returning with any of my test devices so far, I have now one occurrence in the field - sysupgrade returned 10 (but again: apparently the actual update was fine).
In case of sysupgrade not running in failsafe mode, the last command ran seems to be ubus:
Last lines of package/base-files/files/sbin/sysupgrade:
if [ -n "$FAILSAFE" ]; then
printf '%s\x00%s\x00%s' "$RAM_ROOT" "$IMAGE" "$COMMAND" >/tmp/sysupgrade
lock -u /tmp/.failsafe
else
ubus call system sysupgrade "{
\"prefix\": $(json_string "$RAM_ROOT"),
\"path\": $(json_string "$IMAGE"),
\"command\": $(json_string "$COMMAND")
}"
fi
Looking a bit into the implementation of the sysupgrade subcommand in ubus, I found that this 10 is probably UBUS_STATUS_CONNECTION_FAILED. There are two places where this can be generated, one of them is in ubus_complete_request() in libubus-req.c.
As the sysupgrade operation as such cannot return, it is plausible that the ubus request cannot be properly completed. But then, processes are still running a while into the actual reboot, and maybe, depending on exact timing, the ubus tool might still have enough time to detect and report a ubus "failure" which in fact is just a side effect of the system being on its way down already.
Maybe sysupgrade should just end with a sleep 0, or a sleep 3600; reboot, because calling sysupgrade over ubus is most likely a point of no return anyway. All recoverable errors such as invalid image etc. are checked further up in sysupgrade and properly cause exit 1.
That would convert the return code 10 into 1. But that does not solve the problem - sysupgrade should not return or return 0 when it is successful. But as it is, it sometimes returns 10 (1 with your patch) for a perfectly working sysupgrade.
The delicate problem is how to silence that false error, without suppressing real errors that could happen.
The following code would simulate "not returning" for both 10 and 0 exit codes, so essentially behaving like sysupgrade did in 17.01:
else
ubus call system sysupgrade "{
\"prefix\": $(json_string "$RAM_ROOT"),
\"path\": $(json_string "$IMAGE"),
\"command\": $(json_string "$COMMAND")
}"
if [ $? -eq 10 -o $? -eq 0 ]; then
sleep 3600
reboot
fi
fi
But it is still ugly to suppress an arbitrary error code, I'd rather want to have the reason fully understood why this happens, and have it fixed at the source...
May be, my following mail to developers fits into this thread, regarding sysupgrade issues:
I am an unhappy user of sysupgrade for remote installed devices.
After having several unpleasant encounters using sysupgrade, I had a quick glance at the code, after more or less successfully implementing workarounds for incomplete sysupgrades, resulting in inconsistent systems.
My questions are:
Is it safe, simply to kill running processes udring sysupgrade ? As there might be services, restarted automatically (by procd ?).
What about a killed process, simply taking some time to shut down ? (example: squid closing lot of open files; having internal shutdown timer 30s by default)
What about open swap file on block-device ?
What about mounted block-device for mass storage ?
What about (slow) wwan connection, managed by pppd. When killed by sysupgrade, will netifd restart pppd ?
As a workaround, before sysupgrade I
explicitly use /etc/init.d/service stop
explicitly kill squid and wait for termination
explicitly disable swap
explicitly dismount mounted block-device
ifdown wwan
Before I had several cases, that
sysupgrade -n -v -f /tmp/newfiles.tar.gz /tmp/new_fw.bin
updated all files from /tmp/newfiles.tar.gz, but not performing the flash of new_fw.bin
Resulting in inconsistent system.