Sysupgrade return code in 18.06 vs 17.01?

In my setup (using openwrt as OS for a non-router embedded device) I used to call sysupgrade from a preparation shell script.

This script relied on sysupgrade returning with a non zero exit code on error (such as invalid image file path etc.), and not returning (normally) or returning zero (not expected, but the script can handle that as well).

This has worked on hundreds of sysupgrade runs with 17.01. Now, since the update to 18.06, sometimes sysupgrade seems to return non-zero but then the upgrade still occurs, i.e. rootfs is flashed, system reboots, end result is fully ok.

Now I wonder if my assumptions above are false, or something else is wrong.

I noticed that while 17.04 sysupgrade did end in calling do_upgrade, it now passes the task via ubus call system upgrade. The docs for that call are currently in *TODO* state, so it's not clear what the possible returns are, and when (does it just trigger the update or wait for it?). Still, even if it returns, I would expect the exit code be zero. Wrong?

As this happens on devices in the field out of my direct control, I could not yet observe it live. But my updater script is reporting non-zero return from sysupgrade back to my upgrade server, so I see this is happening. Here's the relevant part from the script (maintutil is a tool which can report problems via https to the upgrade server):

    # ... at this point, /tmp/fwimg.bin is a valid openwrt image
    cd /tmp
    sysupgrade fwimg.bin
    if [[ $? != 0 ]]; then
      # failed, report
      /usr/bin/maintutil --reporterror "sysupgrade failed"
      exit 1
    fi
    # in case we get here (normally not, sysupgrade does not terminate when successful
    reboot -f
    exit 0

Any hint welcome!
luz

Yeah i noticed it too, sysupgrade is closed now, before it would never reach the end.

This way you at least have the returncode from sysupgrade.

sysupgrade fwimg.bin
result=$?
[ $result -eq 0 ] || /usr/bin/maintutil --reporterror "sysupgrade failed with error code: $result" 
exit $result

Yes, that's how I changed my code for the next upgrade, too :wink: I'm curious to see what exit code a successful sysupgrade now returns... Still, I think it should return 0 / SUCCESS unless something really went wrong.

Oooh you made the same mistake i did :wink: in that case you might want to do this. (the other one only returns on errors, not on success.)

sysupgrade fwimg.bin
result=$?
/usr/bin/maintutil --reporterror "sysupgrade failed with error code: $result" 
exit $result

:wink: not really, my code is not 1:1 the same, and would have reported back sysupgrade exiting with 0.

And while I could not observe sysupgrade returning with any of my test devices so far, I have now one occurrence in the field - sysupgrade returned 10 (but again: apparently the actual update was fine).

Any idea what exit code 10 means in that context?

Sorry mate its been a while, its the return code of the last process that ran. Which doesn't help you much. But it is what it is =/

In case of sysupgrade not running in failsafe mode, the last command ran seems to be ubus:

Last lines of package/base-files/files/sbin/sysupgrade:

if [ -n "$FAILSAFE" ]; then
	printf '%s\x00%s\x00%s' "$RAM_ROOT" "$IMAGE" "$COMMAND" >/tmp/sysupgrade
	lock -u /tmp/.failsafe
else
	ubus call system sysupgrade "{
		\"prefix\": $(json_string "$RAM_ROOT"),
		\"path\": $(json_string "$IMAGE"),
		\"command\": $(json_string "$COMMAND")
	}"
fi

Looking a bit into the implementation of the sysupgrade subcommand in ubus, I found that this 10 is probably UBUS_STATUS_CONNECTION_FAILED. There are two places where this can be generated, one of them is in ubus_complete_request() in libubus-req.c.

As the sysupgrade operation as such cannot return, it is plausible that the ubus request cannot be properly completed. But then, processes are still running a while into the actual reboot, and maybe, depending on exact timing, the ubus tool might still have enough time to detect and report a ubus "failure" which in fact is just a side effect of the system being on its way down already.

Maybe sysupgrade should just end with a sleep 0, or a sleep 3600; reboot, because calling sysupgrade over ubus is most likely a point of no return anyway. All recoverable errors such as invalid image etc. are checked further up in sysupgrade and properly cause exit 1.

I get your point, maybe you could give this a try. (its untested)

else
	ubus call system sysupgrade "{
		\"prefix\": $(json_string "$RAM_ROOT"),
		\"path\": $(json_string "$IMAGE"),
		\"command\": $(json_string "$COMMAND")
	}" && exit 0 || exit 1
fi

That would convert the return code 10 into 1. But that does not solve the problem - sysupgrade should not return or return 0 when it is successful. But as it is, it sometimes returns 10 (1 with your patch) for a perfectly working sysupgrade.

The delicate problem is how to silence that false error, without suppressing real errors that could happen.
The following code would simulate "not returning" for both 10 and 0 exit codes, so essentially behaving like sysupgrade did in 17.01:

else
  ubus call system sysupgrade "{
    \"prefix\": $(json_string "$RAM_ROOT"),
    \"path\": $(json_string "$IMAGE"),
    \"command\": $(json_string "$COMMAND")
  }"
  if [ $? -eq 10 -o $? -eq 0 ]; then
    sleep 3600
    reboot
  fi
fi

But it is still ugly to suppress an arbitrary error code, I'd rather want to have the reason fully understood why this happens, and have it fixed at the source...

May be, my following mail to developers fits into this thread, regarding sysupgrade issues:

I am an unhappy user of sysupgrade for remote installed devices.
After having several unpleasant encounters using sysupgrade, I had a quick glance at the code, after more or less successfully implementing workarounds for incomplete sysupgrades, resulting in inconsistent systems.
My questions are:

  • Is it safe, simply to kill running processes udring sysupgrade ? As there might be services, restarted automatically (by procd ?).
  • What about a killed process, simply taking some time to shut down ? (example: squid closing lot of open files; having internal shutdown timer 30s by default)
  • What about open swap file on block-device ?
  • What about mounted block-device for mass storage ?
  • What about (slow) wwan connection, managed by pppd. When killed by sysupgrade, will netifd restart pppd ?

As a workaround, before sysupgrade I

  • explicitly use /etc/init.d/service stop
  • explicitly kill squid and wait for termination
  • explicitly disable swap
  • explicitly dismount mounted block-device
  • ifdown wwan

Before I had several cases, that
sysupgrade -n -v -f /tmp/newfiles.tar.gz /tmp/new_fw.bin
updated all files from /tmp/newfiles.tar.gz, but not performing the flash of new_fw.bin
Resulting in inconsistent system.