Buttonless Failsafe Mode

See:

For devices that are in hard to access locations like outdoor mounted devices or devices that do not have a physical reset button, wouldn't it be a good idea to facilitate a buttonless failsafe mode?

For example, rather than press a button after 5 seconds from power on, how about a device connected to LAN sends a particular packet sequence or similar? Or an SSID is generated with a name and key.

I'm sure there are flaws in my ideas above, but I hope there might be a viable way to facilitate triggering a buttonless return to failsafe mode.

Having just had to reach out a window with a screwdriver and then purchase a ladder following a bad config that blocked access, I think something along these lines might be very useful.

2 Likes

I initially thought you could just touch /tmp/.failsafe, but that’s going to be deleted on reboot right?

Failing any other more clever ideas, what about modifying this file to look for a trigger file in a persistent storage
https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/base-files/files/lib/preinit/30_failsafe_wait;hb=3cfa465387ee75451d59a37a3a198bba2deed3ed#l88
It would have to wipe this file once in failsafe.

I'm likely being dense here, but once access to router has been lost, how would you write out that file to persistent storage to trigger the failsafe?

To give some more context, I realise that with LuCi there is the fantastic revert feature whereupon absent reconnection following change, the changes are reverted. But no such feature exists in respect of modifying files over ssh such as /etc/config/network. In fact, there is no error checking whatsoever upon 'service network restart', meaning a bad config file setting (perhaps even just a dot in an interface name?) can give rise to loss of access that persists across reboots. Perhaps a separate feature request ought to be to introduce some basic error checking before accepting a modified network configuration on restart and outputting warning of issues likely to give rise to loss of access.

Another option for buttonless failsafe might be to automatically revert to failsafe if no LAN devices connect post reboot or network change, or something along those lines.

Sorry when you were saying failsafe without physical access, I was more thinking access to a fully functional and ok device. Which would then beg the question, why would you want to reboot such a device to failsafe.
Maybe I’m the dense one :smile:

Luci does all its config changes using uci commands without committing.
It then activates eg system network restart.
Only when confirmed does it commit the changes.

You can do this yourself using uci command line.
Then if all goes pear shaped you can revert by powering off and on.

Once you are sure it is working do the uci commit network or whatever.

1 Like

I believe that idea, in one form or another, comes to anyone who had to deal with a nonresponsive device after misconfiguration. Personally, I was thinking about it when I funked up the config on one of my My Book Live NASes (despite having a physical button they have no failsafe).

Not quite an option for outdoor use, but the most sensible approach might be an esp32 for remote power and serial console access.

1 Like

So how does this trigger file idea work? Is it that one makes changes to config, then on reboot it will revert to failsafe absent trigger file having been deleted?

Because the whole reason to want to get to failsafe is because you're locked out and would need to hit the reset button to get back in via failsafe.

That's my thinking behind the buttonless failsafe.

Exactly! My wife gave me a telling off for hanging outside my office window with a screwdriver and spanner. Then I went and bought a ladder. Either way, it was a pain to get access to press that reset button.

And I thought to myself that it would have been much nicer to have been able to initiate a buttonless reset somehow.

For example, send a specific packet sequence from my PC connected to LAN port after having power cycled the router.

Or perhaps just that absent a connection or some miniscule volume of traffic it resets.

Or how about a power off and on sequence trigger. So e.g. power off, power on, wait 2 mins, power off, power on, wait 2 mins, power off and power on. And then we're in failsafe.

Is there a knockd package that could be used for this?

Knockd manpage

3 Likes

Looks promising. Ideally the buttonless failsafe would become a standard in OpenWrt and reliable way to trigger failsafe without button press. I'm not sure what the best way to achieve that would be with various options like power cycling sequence or packet sequence.

IMHO the hardware button-based failsafe trigger exists because no software-based failsafe trigger can be made 100% reliable against all potential ways to soft-brick your device.

If you're concerned that the changes you make can potentially soft-brick your device, how about you get two of them and experiment with one on your desk before you roll out changes to the one which is hard to access?

3 Likes

Hmm, OK but this logic sounds to me a bit like: don't do it because it's not done presently.

I don't think this presents a good reason for giving up on exploring possible ways to establish a reliable buttonless failsafe trigger.

And this software-based classification is somewhat false anyway isn't it because everything here is software based. Acting on the press of the reset button is software-based. OpenWrt is software and we are dealing in software.

We are just trying to think about an alternative trigger to the button press for a hard to access or non-existent button.

Sure don't get to broken state in first place. But there's a reset button (and hopefully buttonless failsafe trigger) because 'shit happens'.

The point is that this would be super useful. Let's give this idea a chance, eh?

No, it sounds like it can't be done.

On the other hand, why solve the problem for which is known that solution exists, some people find it more fun solving problem for which solution is proven not to exist...

Well again this is just an assertion. I sincerely doubt there is not a good way to setup a reliable trigger of one form or another.

If you don't like the idea fine, but you've not provided any solid reasons against exploring it further. Just because a buttonless failsafe trigger isn't physical doesn't mean it wouldn't be incredibly useful when needed. And again, just because it hasn't been don't yet doesn't mean a good option does not exist for doing it.

I'm still holding out hope despite your pessimism. We have a bright and creative community here and mostly an enthusiastic and 'can do' attitude when it comes to solving problems.

Keep searching and learning and making us learn along the way. Keep in mind power outages, brownout, faulty UPS's and timings.

Gazillion IoT devices and growing. You've got your task on table.

2 Likes

I agree with @stangri that it is unlikely that anyone could design a button-less failsafe trigger that would be sufficiently reliable to prevent/rollback bad configs that would result in soft bricking. Worse, though, might be the situation where it triggers on a false positive and puts the router into failsafe mode or resets the router.

In the case of a false positive trigger into failsafe mode, the router would appear to be seriously broken... think about it - no DHCP, default address at 192.168.1.1 (which may not be the user's normal address/subnet), no VLANs, default lan port assignments, no LuCI web interface, no routing, no firewall, etc.. The entire network would basically just break and it wouldn't be clear why until you manually assigned a computer to the 192.168.1.0/24 subnet and attempted to connect to the router by ssh (you might even need to alter the configuration of a managed switch to expect an untagged network).

This kind of reminds me of the Seinfeld episode where George was peeing in the gym shower.... "it's all pipes!"

But back to the point here... the button press is an intentional physical process that must also occur within a specific window of time during the boot cycle... it is a very reliable ( even unequivocal) indicator that failsafe mode is desired/needed. Aside from faulty hardware or someone just messing with the device, it's not going to have a false positive event.

Obviously that's not ideal, but imagine a situation where a false trigger occurred and you needed to connect a wire to a different port or really any situation where you now need to physically access as part of your troubleshooting because it was working and mysteriously stopped. Failsafe really needs to be very reliable for both not having false positives and also triggering properly (no true negatives).

All of this said, given that this is a software feature that you're trying to define, you can write your own scripts to trigger on whatever conditions you want (and also to cancel on whatever counter-triggers you define) and put it in /etc/init.d/ and have it run early in the startup sequence.

If you manage to make a script that is very reliable with no false triggers and successfully engages failsafe mode when there is a real problem (i.e. the logical test conditions can conclusively determine that there is an issue and then trigger failsafe), you can share that with the community so that others can benefit.

3 Likes

Nice post @psherman. So for sure we don't want false positive triggers. But what makes you so sure it wouldn't be possible to design a reliable trigger that is not going to lead to false positives?

For the sake of argument, how about the power on/off sequence? Don't you think it would be possible to set a sequence that is sufficiently complicated that it wouldn't happen by accident but sufficiently simple that it's not unduly burdensome to trigger?

Or how about a special LAN cable sequence? Now I realise that we can't rely on the LAN device getting an IP. But does that preclude being able to send some form of signal over the LAN cable that's reliable? Can't some form of broadcast sequence or alternative be adopted? Such a sequence just like the button press could be an intentional process that must also occur within a specific window of time during the boot cycle.

Those are just two ideas. But there may well be one or more better and more viable options.

Aren't there even devices with no button? Is the best option for those to have to break them apart and attach a serial console?

(Deleted: This approach won't work, see below.)

1 Like

Okay, that doesn't work, and in hindsight I should have known. For failsafe to work, the root must not be mounted yet, which means a file trigger that is set in the root filesystem will not be available for checking (and neither will that preinit script just written to the root filesystem, it would need baking into the rom).

As I mentioned earlier, OpenWrt already has this built in, in the form of UCI uncommitted changes. This is what Luci uses for its rollback process.

An alternative is the approach taken with data centre grade servers where special hardware is used. This is either "boot from lan" or in the form of a small secondary but self contained "computer" that is hard wired into the server and can facilitate remote monitor, reset, config etc functions on the server device, even if the server refuses to boot. This alternative does require additional dedicated hardware of one form or another. Some isp grade routers do indeed support this alternative approach.

For the purposes of this thread though, some sort of enhancement to the standard uci method could be developed to make the uncommitted uci process more foolproof from the cli. Here the Buttonless Failsafe is activated by a power cycle.

1 Like