Buttonless Failsafe Mode

stangri · February 1, 2023, 1:23pm

Elaborating on @takimata's idea, there you go: https://docs.openwrt.melmac.net/brickproof/, code at https://github.com/stangri/source.openwrt.melmac.net/tree/master/brickproof

A shell script which forces Failsafe Mode on next router boot up.

Run brickproof on before you make any changes to the router and if all goes well, run brickproof off to turn the Failsafe Mode off for next boot up.

If things don't go well, reboot the router and it will go into Failsafe Mode.

There's your software trigger.

takimata · February 1, 2023, 1:40pm

Interesting. Well I'm glad my idea works, but now I'm struggling as to how it works in your PoC. Seeing as the root not only is mounted later in 80_mount_root, but even after all preinit functions have been hooked by the preinit scripts, how is your 39_failsafe_brickproof -- presumably residing in the rootfs -- picked up by the preinit mechanism in the first place? What am I missing?

stangri · February 1, 2023, 1:59pm

Ah, I'm on ext4. It should still work on squash if it's backed into firmware with image builder tho, right?

takimata · February 1, 2023, 2:09pm

IMHO it wouldn't work with squashfs. At preinit time -- during the collection of the preinit functions and hooks -- the rootfs overlay is not mounted, no changes on the rootfs overlay would be visible to the preinit process, including the 39_failsafe_brickproof file.

Ricrdsson · February 1, 2023, 5:55pm

It would be enough if the system starts were counted. for example, 5 starts within 5 minutes would throw failsafe mode. then it would be enough to disconnect the power supply a couple of times.

psherman · February 1, 2023, 7:20pm

A power on/off sqeuence could work... but two considerations:

Is it a safe assumption that power connectivity is easy to control when the device is in a generally inaccessable/difficult to access location? Let's say it's a devicein an attic or under the eves of your roof... if the power outlet or PoE source isn't easy to access from a 'safe' location, you need to either get close to the device to be able to do the power cycling (with the physical power connection at the device) or you need to consider flipping a breaker in this sequence that may impact other devices/appliaces. So I'm not sure that this is necessairly a significant improvement over the normal button-based approach.
What happens if there is a sqeuence of power glitches that causes a false positive. Sure, it's not necessairly common... but if you've ever expereinced a situation where a storm or other disruption caused poer to go on and off several times in a short period, that type of event would likely cause a false trigger. Then, once it goes into failsafe mode, as I described earlier, it may be difficult to get it back out and booting normally (possibly requiring a bunch of other work such as reconfiguring switches and IP addresses and such).

I'm not sure what signal you're thinking of here... but maybe one could be engineered. There is the WOL "magic packet" which could be used as an example... however the device would need to be listening on the interface... depending on the situation with a critical need for failsafe, that is not a gurantee... in the case of WOL, it is done by MAC address... what happens if the MAC address was overridden by the user (not entirely uncommon)?

These are, thankfully, very few and far between. Yes, they exist, but I can't think of many of them. And yes, at that point, a serial console cable is often the required method.

This seems like a good way to do achieve the OP's goals.

hnyman · February 1, 2023, 8:07pm

Just remember that nothing is written to the flash during the preinit phase, and the router has no real-time clock, and the flashed firmware image is static...

x86 etc. devices with ext4 drives are different, but the vast majority of devices have traditional flash, where OpenWrt minimises the wear.

Like takimata said, nothing from overlay is read before the failsafe entrance, as the goal is really to avoid mounting the possibly broken things there. The possible trigger needs to be in the flashed firmware... But it is there always, and how would it decide when to select failsafe... Wait, maybe the user could indicate it with something not involving flash...

Lynx · February 1, 2023, 10:19pm

Is there some form of a power on sequence or ethernet signal sequence, or otherwise, that you can conceive of that might offer a viable generalised option notwithstanding this flash/preinit conundrum? Might not this buttonless trigger warrant an exception to managing flash given its importance in restoring a device without having to break it apart in the context of a buttonless device or without having to climb a ladder in the case of an outdoor mounted device having a present, but highly awkward to access button?

It certainly doesn't seem like this is a solution in need of a problem; the demand and technical problem clearly exists, albeit the challenge is clearly to identify a good technical solution. This is very much the sort of thing that might be the subject of an Intel/HP patent or ten, although I'm sure such a buttonless reset trick even in the context of a networking apparatus is as old as the hills now.

In any case, there are many great minds here, and I'm curious to see if a couple of promising candidates might emerge, even if they end up getting ruled out for one reason or another.

And surely an appropriate trade-off can be made between sufficient complexity to prevent accidental triggers and sufficient simplicity to prevent it from being unduly burdensome.

Reference was made to the American Senfield series. As a Brit, I'm reminded of: 'computer says no'.

slh · February 1, 2023, 10:56pm

No, not at all generic - you would have to have vendor buy-in at the bootloader level (akin to Linksys' 3-early-powercuts-in-a-row to switch to the alternative partition).

Ethernet isn't even up (can't be) before the decision between failsafe or normal boot has to be made.

bluewavenet is spot on, without vendor support, the only approach would be out-of-band access using custom (add-on) hardware. I mentioned the esp32 before, add in two relays (on to control primary power, the other to bridge the reset button (option) - and while you're at it, remote serial console access (with that available, you wouldn't even need the second relay for the reset button, as you do have console- and uboot-console access). This is as generic as it gets, but requires opening the device (at least for serial- and/ or reset access), the esp32 can provide its own little low-range AP for access to the management features - bonus point for coming up with a way to switch that off, when not needed (there are already means to cut the AP interface if nothing connects withing $x seconds after power-on).

This is obviously easier to accomplish for indoor devices, than outdoor ones, which need to remain water proof.

Lynx · February 1, 2023, 10:58pm

So in that case I'm clearly not understanding something. Are you asserting that it's simply not possible in any shape or form in OpenWrt to trigger a reboot into failsafe?

I can see how that would be a stumbling block.

But it also seems surprising to me. Is that really impossible? OpenWrt is all about making things possible, right?

slh · February 1, 2023, 11:07pm

The keyword here would be "in a generic way", no, you can't.

Obviously I can imagine quite a few approaches to do it nevertheless, but those would be device specific (e.g. exploiting a feature like Linksys' bootloader based partition switching) or require to make the decision before rebooting the device (setting a u-boot environment variable - and adding code to the early boot stuff to read out its state again, before mount_root, not very generic either, but at least as a concept possible with quite a few devices). The big drawback of this later approach would be that it's only acting as a safety net if you set the failsafe variable before doing anything risky (and then it will enter failsafe on the next reboot, regardless of an actual error condition or you just having forgotten to clear it, you wouldn't be able to test the new (potentially faulty) configuration with a reboot), making this approach not very sensible (you may gain extra failures by having forgotten to clear the variable, you can't really test your new configuration (as you may never reboot as part of the testing), in case of unexpected failures there wouldn't be any way to trigger the failsafe either).

slh · February 1, 2023, 11:27pm

Lynx · February 2, 2023, 8:13am

Ah - WiFi serial bridge - that sounds promising. Would it potentially avoid having to recreate a scene from Back to the Future. Is that a viable option?

And your careful explanation above was very helpful for me. Thank you for that.

So I see the challenge associated with recreating in a generic way what the button does given vendor specific configurations.

So if that's not possible in a generic way, then perhaps an appropriate fallback would be to have OpenWrt reboot or otherwise revert into not the failsafe but a safe mode in which the user settings are not applied or are overridden to apply a first boot like configuration from which the bad settings can be resolved in the same way that they can from the failsafe mode.

Isn't there a way forward along those lines?

psherman · February 2, 2023, 8:02pm

A first boot configuration is a reset. If you did that for real, it would erase all of the settings the user had applied. And, even if it was temporary, a default state could potentially problematic for the network as a whole if it triggers automatically (regardless if it is a false positive or a true positive) -- imagine if the device was a dumb AP, but due to a defaulted state (even if temporary and didn't actually erase the other user settings), enabled the DHCP server and collided with another device like the main router at 192.168.1.1. Although failsafe mode doesn't have a DHCP server enabled, it does use that IP so that could cause a conflict, but at least the button-based approach means that the user is able to physically unplug the device from the upstream network when performing the operations. To be clear, an IP collision would render it unrechable until the conflict is resolved.

The other factor here is determining the triggering criteria... how does the system know that you did something wrong? I mean, sure if there is an unresolvable syntax error, that is a clue. But what if the syntax is correcct but you've locked yourself out... how can it know if that was an accident or intentional? Then, if you say that the system should look for a connection from a specific host to the admin functions (ssh or LuCI web interface) after it boots up, that could fall apart if the device is power-cycled outside of a configuration context (for example: a power outage).

Lynx · February 2, 2023, 10:29pm

There is clearly benefit in facilitating some form of functionality to regain access to a device that has become inaccessible for one reason or another and which either has no reset button or is otherwise hard to physically access.

I can see that you have taken pains to come up with issues relating to a potential solution to the above, but to me it still feels a little 'computer says no'.

Providing the network signal or power cycle signal or other signal is sufficiently complicated then accidental trigger is not an issue. Can you say with certainty it is simply impossible to manage an appropriate tradeoff between complexity to avoid accidental trigger and simplicity to avoid undue burden? I find that hard to imagine.

And if it's intentional then the user will have to accept issues like IP collision just like with failsafe mode. And sure first boot resets so let's not do that. But again are you so certain we can't come up with an appropriate safe mode?

OpenWrt has generally struck me as 'can do'. Where there is a will there is a way.

psherman · February 2, 2023, 10:53pm

I think you're missing the point... thus far, there has not been a sufficiently robust method proposed that would have the following features:

no false positives
no true negatives
no potential 'failsafe' related collisions/conflicts with the network infrastructure/topology/hosts.

It's not that I'm saying it cannot be done (although I do think it's a rather hard nut to crack)... but rather that nothing presented so far provides a path that would be considered safe for implementation insofar as a generic/universal method to be embedded into the core OpenWrt firmware.

If someone comes up with a robust and safe mthod, I will change my mind, of course. That may be you and/or others -- I'll give credit where due.

This is not necessarily true. There are threads where people with dual boot systems have had their devices mysteriously boot into a previous OpenWrt version (or stock firmware) with no apparent explanation of why. It turns out that they had power issues and it just so happened that it triggered the Linksys failover method (which is, btw, done at the boot loader level, so well before OpenWrt could affect such a thing). Although it is statistically very unlikely, it is still far more likely that an external power event could trigger the buttonless-failsafe than it is that the hardware button itself would be triggered unintentionally.

Regarding the special "failsafe-magic-packet" idea... maybe this is a possible path. But until someone designs a detailed proposal that could actually work, I still don't know that it is realistic. Fundamentally, if OpenWrt can't bring up the LAN or has some other major fault due to misconfiguration, there's no way to guarantee that it could listen on a given interface for said magic packet.

Yes, that's true. But, remember that we need to guarantee false positive triggers won't happen (it could disable the entire network). And, as I said previously, the button-based approach does mean there is physical proximity which means it is possible to manipulate the connections. If a buttonless method was triggered, even if intentionally, there is still a risk that the physical connections might need to be changed but without an easy way to do this without breaking out the ladder... in which case, what is the purpose of the buttonless method?

IMO, for your scenario, you'd probably benefit from a dual-partition device that switches partitions based on code in the and the boot loader, actuated by a specific power cycling sequence (for example, a Linksys device). There is a more significant benefit for this -- you could have a known good configuration (valid for your unique network topology) on the alternate partition, so although it's not easy to fix the other partition, at least you're back to where you were.

slh · February 2, 2023, 11:56pm

…and this can indeed be triggered very easily, many power events aren't that clean to cut out once, and re-appear full-on without hickups.

Think about the self-induced scenario (e.g. your lawn mower or some other heavy machinery triggering the fuse), clean power cut from the fuse box - but switching it back on might cause the fuse to blow immediately again (be it because the the lawn mower is still connected (think partially severed power cable), be it because the combined power-on load is too much for the fuse (especially electric motors have huge short-term power draws until it starts spinning, often exceeding the stated wattage by a lot), bang, you're at your second power cut - only one chance left.

With external causes for the power cut, it's even more likely to trigger the three-interrupted boots, if you think about overhead wires touching in heavy wind, snow/ ice or rain. There are no fuses involved in front of your house, so if things short out intermittently, chances are that the power will 'flap' around for a bit, until total failure.

There really is no way to implement your desired feature safely and in a generic way, without introducing much more severe points of failures that wouldn't exist without the feature that was supposed to make your life easier.

--
Heck, I can even imagine potential issues with my suggested remote serial console access (apart from IP ratings going down the drain by opening the outdoor case), but at least those are more manageable.

Bill · February 3, 2023, 12:19am

NASA and the ESA and all the other SA's would not argue the benefit.

When in doubt... and faced with too much disco, think outside the box.

slh · February 3, 2023, 12:40am

The equivalent to that approach, would be adding a remote controlled electro-mechanical actor (aka 'finger'), to press the reset button for you, when needed. I guess no one would question that being a safe, generic and working solution - just not a pretty one.

Things get a lot easier, if you consider things like https://www.washingtonpost.com/business/capitalbusiness/the-air-forces-10000-toilet-cover/2018/07/14/c33d325a-85df-11e8-8f6c-46cb43e3f306_story.html a valid solution to your problems.

Bill · February 3, 2023, 12:44am

Post is behind a Paywall, thankfully. The jest will have to be assumed hilarious.

Judd Hirsch
Julius Levinson
Independence Day