Reboot 3 times = alternative partition to boot from?

mrgenie · September 25, 2017, 1:31pm

I have written a script that checks if my remote (over openVPN connection) is still alive, available.

If I can't see any of the remove servers, reboot!

This works fine on a TP-Link TL-WR1043N/ND v2 running EDE Reboot SNAPSHOT r3979-226e194 / LuCI Master (git-17.109.24760-2eda14b) self compiled firmware

a self compiled firmware for my WRT1200AC, 3200ACM and 1900ACS same snapshot and actually 100% same settings however this also works, but after the 3rd reboot these linksys WRT routers boot from the alternative partition!

Why would rebooting 3 times using this script cause the alternative partition to be booted?
When I manually reboot 3 times it doesn't happen!

here's the script:

#!/bin/sh /etc/rc.common
#Sometimes routers for whatever reason need a reboot. This script is being used in combination  with an openVPN and thus we implement the logic: "the moment routers/servers on
#other physical locations which are only accessible over the oVPN connection are not there anymore, then something must be wrong. Since we don't know what, we reboot and let the
#local router reinitialize"

START=92  #after oVPN(90) and ebtables (91)

#Domain Servers, all must be accessible at all times
ANDHOSTS='172.22.56.2 172.22.64.2 172.22.67.2'

BOOTDELAY=120  		# Time to give all services to boot before we start probing
LOOPDELAY=180  		# Time to delay between probes
PROBECOUNT=2   		# Number of probes per PING
NRLOOPSTOREBOOT=3 	# How many loops we have to go through with a broken Hosts before we actually reboot, NRLOOPSTOREBOOT * LOOPDELAY = Offline time for any server before reboot.
LOOPMAX=3      		# Amount of retries each round of probes


FILE="/etc/monitorNetworkLog"

boot() 
{
	echo "$(date +"%d-%m-%Y-%T") - System Up" >> $FILE
	echo "ProcessID: $$" >> $FILE
	sleep $BOOTDELAY
	LOOPCOUNT=0
	while :
		do
			DOREBOOT=0			
			for myHost in $ANDHOSTS
				do
					count=$(ping -c $PROBECOUNT $myHost | grep 'received' | awk -F',' '{ print $2 }' | awk '{ print $1 }')
					if [ $count -eq 0 ]; 
						then
							# at least one HOST is down and declared as AND, so reboot
							echo "$(date +"%d-%m-%Y-%T") - $myHost is down" >> $FILE
							DOREBOOT=1
					fi
			done

			if [ $DOREBOOT -eq 0 ];
				then
					echo "$(date +"%d-%m-%Y-%T") - All VPN connections OK!" >> $FILE
					LOOPCOUNT=0
					sleep $LOOPDELAY
				else
					LOOPCOUNT=`expr $LOOPCOUNT + 1`
					echo "$(date +"%d-%m-%Y-%T") - System Down: $LOOPCOUNT" >> $FILE
					if [ $LOOPCOUNT -gt 3 ];
						then
							echo "$(date +"%d-%m-%Y-%T") - System Reboot" >> $FILE
							reboot -d 30
							kill -9 $$
							break
					fi
			fi
	done
}

mrgenie · September 25, 2017, 6:11pm

Thus far I found out that rebooting from a script will cause the router to reboot from the other booting partition.

There are 2 and thus rebooting from script keeps ping ponging the boot partitions.

How do I prevent that?

mrgenie · September 25, 2017, 9:12pm

Ok, so when I run that script as a cronjob, problem solved!
I have no idea why initiating a reboot from a script cause reboot from other partition after exactly 3 times rebooting.

anomeome · September 25, 2017, 9:18pm

Should be able to disable in uboot env:

fw_setenv auto_recovery yes/no

but you of course lose your fail-safe mechanism.

mrgenie · September 25, 2017, 10:28pm

Thanks anomeome! Didn't know this was possible.

During tests with new drivers the fail-safe is great of course. Once you have a stable working environment,
the automatic fail-safe isn't really needed. Can always manually trigger it if really needed.

hnyman · September 26, 2017, 6:28am

That is a well-known and discussed failsafe feature of the WRT1900AC series/family. Mentioned in wiki and discussed here and in Openwrt forum.

But the failsafe triggering of boot partition switching after three reboots, is supposed to get triggered only after three failed reboots. Each successful boot process resets the failure counter.

It sounds like your script caused a new reboot before the previous reboot had fully completed. And this led into several consecutive reboots ultimately leading to switching the boot partition.

Note that the boot process is partially asychnorous. You should not expect the something started at priority 90 has been fully completed by the time the item at priority 92 gets launched.

mrgenie · September 26, 2017, 7:56am

Thank you for that explanation hnyman.

But all processes are up after my script does the reboot.(180+3*120 seconds = 540seconds plus the time to do
the pings is well over 10 minutes)

But when I click reboot from the GUI at a faster rate than my script doing the reboot I don't have this problem. Can even use the GUI to reboot as soon as the web interface is accessible, which is after 15-20 seconds.

I found out, the script itself when it commits a reboot it doesn't reboot. The process hangs. that's why I put the
kill -9 $$ in the code. To kill the script as "reboot" from within the script the script keeps running despite the break command.

Not sure if this could be any reason for it to behave like that. I mean, it shouldn't and on the TP router this script just works fine without the fail-over being activated.

So the real question is, why does this script behave so different on the WRT routers compared to the TP router.

Anyway, using a cron job solves the issue. I don't have an endless loop as script running at boot but instead I let cron job running every 6 minutes and that works as well. When the reboot is given from inside the cron job, that job itself is of course without the while loop and thus stops automatically and doesn't hang.

Ah.. going to test this today..

pull the reboot command out of the while loop!
Maybe because the reboot command is given BEFORE the BREAK that causes some issues..

So I'm going to try today to just set a variable to true and than the break..
and after the while .. if true.. then reboot.

Just an assumption I want to rule out, that the reboot inside a while loop with the break out of the while loop
after the reboot command might cause this strange behavior... although I got it working now with a cron job, out of pure
interests I want to know the reason for this behavior.

So, i'll post later today

hnyman · September 26, 2017, 9:03am

You might also test detaching the reboot command with &

reboot &

or even a subshell

(reboot )&

see e.g. https://forum.openwrt.org/viewtopic.php?pid=345088#p345088