Sysupgrade aborted with return code 256?

cvocvo · May 13, 2019, 10:02pm

Some system information first:

base firmware version: 18.06.2
architecture: ramips and mips (this happens on both)

So sometimes when I run sysupgrade -n s_upgrade.bin, my router just shuts off. So I opened up a terminal via serial and found this:

Sending KILL to remaining processes ... hostapd hostapd hostapd hostapd
hostapd hostapd hostapd hostapd hostapd hostapd
Failed to kill a[ 125.891451] reboot: Restarting system
ll processes.
sysupgrade aborted with return code: 256

Everything I've read about other people having this issue say to upgrade the firmware (tried that), or kill the process that is causing sysupgrade to give this error, but it's a different process almost every time. Is there anything else I could try to fix this? It's really annoying more than anything since it doesn't actually break anything, you can just turn it back on and try again and it usually works.

Any insight will be helpful, or just let me know if more information would be helpful to you.

Thanks!

slh · May 13, 2019, 11:50pm

Try wifi down and killall -9 hostapd before invoking sysupgrade, there has been an issue with sysupgrade killing off hostapd in the past.

cvocvo · May 14, 2019, 1:03am

Thanks for the reply! At the moment I’m not running a build of OpenWrt with any wireless drivers and so hostapd isn’t running on the system. When digging into this a bit I did run across that WiFi might cause an issue on the issue tracker or on a mailing list. In trying to isolate what the problem could be I’m running a build without any WiFi. I will look more in the morning to see if I can jog my memory on which process caused the issue more than once.

reinerotto · May 14, 2019, 4:32am

Will there be an official patch ?
Asking, because I also had issues regarding sysupgrade not doing anything, blaming open swapfile to be the reason, but, may be I was wrong.

cvocvo · May 14, 2019, 11:11pm

Some more testing today and the two processes I have seen cause sysupgrade to fail more than once are ntpd and odhcpd. Can I just run killall -9 ntpd and killall -9 odhcpd before sysupgrades to prevent that?

EDIT: mount_root is another process that I have seen causes this problem

slh · May 14, 2019, 11:14pm

Yes, that shouldn't cause any issues.

cvocvo · May 15, 2019, 7:40pm

So when I run killall -9 ntpd and killall -9 odhcpd the processes just restart themselves. I suspect there's some underlying mechanism that restarts them that I also need to stop?
I'm also trying to do this programmatically in C to run sysupgrade, but when I run this, ntpd and odhcpd restart and then the C app hangs so sysupgrade doesn't execute.

char *kill_ntpd[] = { "/usr/bin/killall", "ntpd", NULL };
char *kill_odhcpd[] = { "/usr/bin/killall", "odhcpd", NULL };
execv(kill_ntpd[0], kill_nptd);
execv(kill_odhcpd[0], kill_odhcpd);

Any ideas?

reinerotto · May 17, 2019, 4:49am

I suspect, procd does the auto restart.

To verify this, you might try
/etc/init.d/ntpd disable
killall -9 ntpd

cvocvo · May 21, 2019, 3:33pm

Hmm I couldn't get that to work for ntpd; the process still restarted. I checked and ntpd isn't in the init.d directory, so maybe something else is starting it?

However, the disable for odhcpd seemed to work.

Edited my post above, but I also see this happening with mount_root. Would running /etc/init.d/umount disable disable mount_root from restarting? (and not cause any detrimental effects)

Thanks!

jeff · May 27, 2019, 6:05pm

I'm seeing this with a build off master (from the 25th?) for an EA8300 as well

root@test:/etc# sysupgrade /tmp/OpenWrt-2019-05-27_1029-0700-ipq40xx-linksys_ea8
300-squashfs-sysupgrade.bin ESC[J
Saving config files...
Commencing upgrade. Closing all shell sessions.
Watchdog handover: fd=3
- watchdog -
killall: telnetd: no process killed
killall: dropbear: no process killed
Sending TERM to remaining processes ... syslog-ng syslog-ng ntpd ubusd wpa_supplicant wpa_supplicant [ 1570.713652] batman_adv: bat0: Interface deactivated: mesh0
[ 1570.714442] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 0
[ 1570.718103] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 0
logd netifd sshd 
[ 1570.827718] ath10k_ahb a800000.wifi: peer-unmap-event: unknown peer id 0
[ 1570.827763] ath10k_ahb a800000.wifi: peer-unmap-event: unknown peer id 0
[ 1572.970913] ath10k_pci 0000:01:00.0: 10.4 wmi init: vdevs: 16  peers: 48  tid: 96
[ 1572.970957] ath10k_pci 0000:01:00.0: msdu-desc: 2500  skid: 32
[ 1573.025588] ath10k_pci 0000:01:00.0: wmi print 'P 48/48 V 16 K 144 PH 176 T 186  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[ 1573.027158] ath10k_pci 0000:01:00.0: wmi print 'free: 117936 iram: 22500 sram: 29708'
[ 1573.341881] ath10k_pci 0000:01:00.0: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[ 1573.343208] IPv6: ADDRCONF(NETDEV_UP): mesh0: link is not ready
[ 1573.352427] batman_adv: bat0: Interface activated: mesh0
Sending KILL to remaining processes ... wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant wpa_supplicant 
Failed to kill all processes.
sysupgrade aborted with return code: 256

Merge base is

commit ace241014c (master)
Author: Petr Štetiar <redacted>
Date:   Fri May 24 15:36:44 2019 +0200

Killing wifi is a work-around

root@test:/# wifi down
'radio2' is disabled
root@test:/# [  276.044774] batman_adv: bat0: Interface deactivated: mesh0
[  276.045465] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 0
[  276.049234] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 0
[  276.202968] ath10k_ahb a800000.wifi: peer-unmap-event: unknown peer id 0
[  276.203014] ath10k_ahb a800000.wifi: peer-unmap-event: unknown peer id 0
[  276.233717] batman_adv: bat0: Removing interface: mesh0
[  278.355547] ath10k_pci 0000:01:00.0: 10.4 wmi init: vdevs: 16  peers: 48  tid: 96
[  278.355592] ath10k_pci 0000:01:00.0: msdu-desc: 2500  skid: 32
[  278.410340] ath10k_pci 0000:01:00.0: wmi print 'P 48/48 V 16 K 144 PH 176 T 186  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[  278.412524] ath10k_pci 0000:01:00.0: wmi print 'free: 117936 iram: 22500 sram: 29708'
[  278.726857] ath10k_pci 0000:01:00.0: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[  278.728195] IPv6: ADDRCONF(NETDEV_UP): mesh0: link is not ready
[  280.213360] ath10k_ahb a800000.wifi: 10.4 wmi init: vdevs: 16  peers: 48  tid: 96
[  280.213406] ath10k_ahb a800000.wifi: msdu-desc: 2500  skid: 32
[  280.260600] ath10k_ahb a800000.wifi: wmi print 'P 48/48 V 16 K 144 PH 176 T 186  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[  280.262178] ath10k_ahb a800000.wifi: wmi print 'free: 56576 iram: 23368 sram: 35968'
[  280.564819] ath10k_ahb a800000.wifi: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[  280.565166] IPv6: ADDRCONF(NETDEV_UP): mesh8: link is not ready

root@test:/# sysupgrade /tmp/OpenWrt-2019-05-27_1029-0700-ipq40xx-linksys_ea8300
-squashfs-sysupgrade.bin
Saving config files...
Commencing upgrade. Closing all shell sessions.
Watchdog handover: fd=3
- watchdog -
killall: telnetd: no process killed
killall: dropbear: no process killed
Sending TERM to remaining processes ... syslog-ng syslog-ng ntpd ubusd logd netifd sshd 
Sending KILL to remaining processes ... 
Switching to ramdisk...
[  321.122581] UBIFS (ubi0:1): background thread "ubifs_bgt0_1" stops
[  321.248580] UBIFS (ubi0:1): un-mount UBI device 0
Performing system upgrade...

reinerotto · May 28, 2019, 12:50pm

There is an open bug-report from me. May be, now you can confirm serious issues with sysupgrade, at least. I suspect, it needs a careful overhaul, as sysupgrade is very dangerous, when not properly working.

jeff · May 28, 2019, 2:23pm

Feel free to copy the above and add to the bug report. I don't see a link for it here.

If it were only wireless causing the issue, a "hack" of wifi down would likely cover it. However, I see an issue with mount_root described by @cvocvo as well.

reinerotto · May 28, 2019, 2:41pm

My bugreport: FS#2024
As it is another issue regarding sysupgrade, you might better file your own, and reference mine, please.
Also here on the forum various other "effects" described, when using sysupgrade.

cvocvo · May 30, 2019, 1:28pm

Should I also file a bug? If so, where do I do that at?

cvocvo · November 15, 2019, 8:39pm

Bumping this back up; it appears @reinerotto logged the bug here: https://bugs.openwrt.org/index.php?do=details&task_id=2024

I'm going to link this post to that bug report.

jeff · November 15, 2019, 8:47pm

Which still hasn't been fixed as, like far too many things, the OpenWrt devs "don't like" the fix.

https://patchwork.ozlabs.org/patch/1116473/

the sleep 1 is really not good. could you try to figure out what
actually causes the 256 and try to fix that instead please ?

John

Note that procd is effectively undocumented and if a one-second delay to let hostapd gracefully shut down is "not good", what negative adjective then properly describes this mess?

kill_remaining() { # [ <signal> [ <loop> ] ]
        local loop_limit=10

        local sig="${1:-TERM}"
        local loop="${2:-0}"
        local run=true
        local stat
        local proc_ppid=$(cut -d' ' -f4  /proc/$$/stat)

        echo -n "Sending $sig to remaining processes ... "

        while $run; do
                run=false
                for stat in /proc/[0-9]*/stat; do
                        [ -f "$stat" ] || continue

                        local pid name state ppid rest
                        read pid name state ppid rest < $stat
                        name="${name#(}"; name="${name%)}"

                        # Skip PID1, our parent, ourself and our children
                        [ $pid -ne 1 -a $pid -ne $proc_ppid -a $pid -ne $$ -a $ppid -ne $$ ] || continue

                        local cmdline
                        read cmdline < /proc/$pid/cmdline

                        # Skip kernel threads
                        [ -n "$cmdline" ] || continue

                        echo -n "$name "
                        kill -$sig $pid 2>/dev/null

                        [ $loop -eq 1 ] && run=true
                done

                let loop_limit--
                [ $loop_limit -eq 0 ] && {
                        echo
                        echo "Failed to kill all processes."
                        exit 1
                }
        done
        echo
}

indicate_upgrade

killall -9 telnetd
killall -9 dropbear
killall -9 ash

kill_remaining TERM
sleep 3
kill_remaining KILL 1

sleep 1

cvocvo · November 15, 2019, 9:02pm

Seems like a nominal delay would be preferable over a system crashing. In our case we have some devices at remote locations, so crashing and requiring a manual reboot + retry to upgrade loop until it works isn't a great option.

I commented a link to your patch and post on the submitted bug.