Buttonless Failsafe Mode

CFSworks · March 14, 2023, 8:18pm

Hi Peter! Thanks for taking the time to provide feedback.

The idea would be that the writable partition (not the overlay) is temporarily mounted read-only just long enough to peek in there and grab the key, then immediately unmounted. Much later on, the overlay is mounted normally.

Come to think of it, this same mechanism could be used to implement a reboot-to-failsafe, which would make this whole thing a lot simpler: the "go to failsafe" signal could be received later in the boot process, perhaps even when the system is fully booted. Hmm...!

Yeah, fair point. I was wondering about that in the back of my mind the whole time I was mentioning WiFi, but figured I'd include it in case any devices had WiFi interfaces (certain hardmac devices with on-chip firmware) that could be available that early in the boot process. (The real motivation for WiFi support at all is because WiFi interfaces don't have those pesky switchdevs in the way.)

The quote this is under says that the failsafe process brings up the Ethernet interfaces. They can be taken back down again before proceeding in the boot process.

As an aside: I ask that you keep the criticism -- though it is appreciated -- constructive. "X won't work" doesn't give me a path forward. "Y may work better than X" is most preferred as it offers an alternative. "For X to work, we would have to do Z. I am concerned the costs of Z outweigh the benefits of X" allows community ownership of the difficulty of problem Z: perhaps someone will come forward who is willing to do Z, or someone can offer an innovative way of achieving X without Z. Alas much of the various feedback so far in this thread seems to be "this requested feature can't work in the current boot process" which is... pretty circular reasoning given that the request is to modify the boot process!

While I still believe this mechanism is workable, and it does solve a real-world problem ("How does one get a device into failsafe mode when reaching the reset button entails a personal safety risk?"), I do wonder if it's practical or would even see a lot of use. There's a certain amount of complexity that would have to be introduced into the boot process that the indoor-only users would not need (or, perhaps, tolerate) just for the few users with outdoor devices. It's an important problem, because it's addressing a safety concern (we do not want to make our users go up on ladders), but I'm wondering if specifically targeting failsafe mode isn't quite the right call here: it's a lot of complexity for one very specific situation.

After sleeping on it, I realized that this type of protocol would be a lot more useful to me personally (in the sense that it would save me having to get up and walk over to the device if I bungle the config) if it could work on a fully-booted device and get me a root shell. The same L2-protocol-only trick is used to bypass as much of the network stack as possible, getting around ip/netfilter misconfigurations -- indeed, the only requirement for access via WiFi Action frames would be that the PHYs be kept awake, which isn't hard at all to do. Such a daemon can be an entirely separate project independent of OpenWrt, available as an optional package for those that want it. (And who knows, it may prove to be so popular that it works its way into the core distribution.) I'd certainly enjoy writing such a thing, too, once I have the free time for it!

Regardless, I believe we should address some of the shortcomings in UCI that led to this problem happening in the first place:

This is a pretty clear sign that UCI is a far cry from LuCI's reliability. @Lynx essentially states that if the same change were done through the latter, this particular soft-bricking would have been avoided. I think there are 3 pretty low hanging fruits we should consider:

Invent a uci apply command, which mirrors LuCI's analogue: It applies (but does not commit) the changes, tells the user they need to run uci apply --confirm to prevent the rollback, and sleeps for 5 seconds before exiting (so the user won't confirm too soon). A uci commit would also cancel the rollback, but uci apply --confirm is the recommended way to go because it won't work after the rollback has already happened.
@Lynx was editing /etc/config/network directly, which (although I do the same myself) I think is pretty bad practice: you are saving changes before testing them. We should probably learn from tools like visudo that when a file is so critical that you could be locked out if you screw it up, you shouldn't be editing it directly. So, we should invent a uci edit command (so I can run uci edit network instead of vi /etc/config/network) which copies the pending config to a temporary location, runs the editor on it, checks if any changes were made, and if so, does a quick check for syntax and conflicting config changes that happened during the editing, before staging that new file for commit. (And, since it's meant to be an interactive command, it can even suggest that the user run uci apply next.)
Discourage writing to the saved configuration without testing it first! UCI could stick a notice at the top of the /etc/config files that says something like "editing this directly is discouraged, use uci edit instead." The documentation also needs to do a much better job communicating that uci commit should not be used until any changes are tested. The first example in the UCI page of the user guide is currently recommending the opposite.

@Lynx would these be a satisfactory resolution if we ditched the "failsafe mode" idea specifically?

psherman · March 14, 2023, 8:55pm

The zero false-positive requirement would apply here... what would the condition be for triggering "go to failsafe" later in (or after) the boot cycle?

The failsafe process can be viewed as a super minimal bring-up of the system (not unlike 'safe mode' in many desktop operating systems). Bringing up the interface means you are a bit further along in the boot cycle than when the failsafe boot decision is made. Some devices actually bring up the interfaces for a very short time for tftp or other similar operation as a function of the boot loader, but that is before anything from OpenWrt begins to boot... once the bootloader hands over to OpenWrt, the decision about failsafe is made before the interfaces come up... bringing them up would require that drivers are loaded and an address is needed, among other things. Once you're this far, you're past the decision point. And if the network is enabled before the decision point, since it needs an address, it would run the risk of causing a conflict on the main network (which could happen anytime the device reboots).

This is what I aim to do. My apologies if this has come across as rude or anything other than a technical reason that this approach wouldn't work. If someone asks to build a perpetual motion machine, describing the 2nd law of thermodynamics explains why it's not possible -- even delivering this method kindly could result in a negative reception because the person asking may feel criticized. My goal is to simply provide explaination about why the approach won't work from the facts, but I'm honestly trying to deliver that message in the most constructive way possible.

I agree with this philopsophy, and I try to employ this whenever I can. In this particular case, I still haven't figured out any way that it could work, so I don't have another approach to suggest. Maybe this is a failure of imagination on my part (or maybe I'm not smart enough). Here, the fact that I'm not providing an alternative approach is not about 'being critical just to be critical', but rather pointing out the things that make it impractical based on the way the system works (if I had another approach, I'd share it and brainstorm with you).

Modifying the boot process would be the only mechanism I can think of that would provide the keys to this kingdom. The devs (especially those familiar with the very early boot process) can probably provide more context, but IIRC, part of the issue is that there isn't a way to run for example an initramfs and then cycle into the regular boot. I may be wrong here, but I seem to recall a discussion to that effect.

EDIT: I was just thinking of how another company addressed the issue on physical access to the reset button... Ubiquiti, on some of the older hardware, had a reset button that was on the PoE injector. They called it "remote reset." The use case for their APs and other hardware means that, by design, many of the devices are really hard to get to (sometimes on the top of very large towers or on very high ceilings/roofs), so they had the incentive to develop a solution that was easier and safer. The thinking was that you could reset the device from whereever the PoE injector was located (ideally at the 'easily accessable' end of the cable). It was still hardware based, and it worked for some products, but was not reliable for others. AFAIK, this is no longer offered in any of their current products. In their situation, they control the whole stack (hardware/firmware/software for the individual devices as well as the intermediate devices like switches and other infrastructure), so they could have theoretically implemented a 'buttonless' option and/or they could have expanded the offering of this 'remote reset' function. But, even this hybrid 'remote reset' thing is mostly past tense. There are many users who wish this was available for obvious reasons. But, it suggests that Ubiquiti (who has a commercial interest in such a reset option and and control of the stack) was not satisfied with the 'half measure' of the PoE adapter mounted button when it came to reliability and security, and that they likewise felt a buttonless reset would be extremely difficult to achieve while meeting security and reliability requirements.

CFSworks · March 14, 2023, 10:05pm

Okay, my apologies as well: I had misunderstood your tone. I think we might also both be getting tripped up on the distinction between what's possible and what's sensible. It's clear to me now that when you say "won't work," you mean that it would require far more reworking of a delicate process than you would consider reasonable. I had read it as an assertion that this type of approach is impossible which I think we both agree is not the case (no perpetual motion machines here!)

I have two or three ideas, but I'm only offering them as-is. I haven't checked if they're easily implemented (again, I'm focused on what's possible and working backwards from there):

Write a file (or magic block) to the writable partition, then reboot. To comply with the zero-false-positive requirement, the partition is then remounted as rewritable, the signal file/block erased, and the change to flash confirmed, before the failsafe procedure triggers. (Trading a false positive for a false negative in case the flash can't be written for some reason -- though the admin has bigger problems if the hardware has failed.)
Many embedded devices use a trick where they put a prearranged 32-bit or 64-bit magic value at a prearranged physical address or hardware register (one not reset by the bootloader) before rebooting otherwise "normally," and the early boot sees the magic value, (erases it,) then enters failsafe mode. There's a small (2^-32 or 2^-64) chance of a false positive with this approach, but this is usually considered acceptably low, especially if the user can just power-cycle again. The fly in the ointment with this one is figuring out a suitable physical address for every supported device. (The physical address can be reserved with no difficulty using Linux's memmap= parameter, however.)
I have once seen a situation where Linux kexec'd itself: as in, back to the presently-running kernel, not a kernel image loaded from usermode. I didn't look too deeply into how this worked; for all I know, it may have even been a (very warm) reboot directly to the bootloader. But it allowed starting the boot over with a different kernel cmdline. (Again, this option is merely possible, not necessarily doable.)

psherman:

The failsafe process can be viewed as a super minimal bring-up of the system (not unlike 'safe mode' in many desktop operating systems). Bringing up the interface means you are a bit further along in the boot cycle than when the failsafe boot decision is made. Some devices actually bring up the interfaces for a very short time for tftp or other similar operation as a function of the boot loader, but that is before anything from OpenWrt begins to boot... once the bootloader hands over to OpenWrt, the decision about failsafe is made before the interfaces come up... bringing them up would require that drivers are loaded and an address is needed, among other things. Once you're this far, you're past the decision point. And if the network is enabled before the decision point, since it needs an address, it would run the risk of causing a conflict on the main network (which could happen anytime the device reboots).

Ah, I wasn't clear: By "failsafe process" I meant /bin/failsafe_server; the process that listens for the failsafe command. (It isn't exactly a "daemon" or I would have called it that. Do you have a preferred term for this thing?)

By "address" do you mean an IP address or a hardware address? I presume you mean the latter since the former is only needed if you want to speak IP, not if you just want the interface up. But the latter should generally be available that early, shouldn't it? With my devices, at least, the information needed to fetch that address from EEPROM resides in the DTB, so the address is correct as soon as the driver finishes binding. (Even if the hardware address isn't available yet, it's safe to set a random one with the "locally administered" bit set: the speak-only-when-spoken-to rule means the address isn't actually used until the admin wants to trigger failsafe mode.)

There's a good point to be made about needing the drivers loaded that early, though: not just Ethernet but also switchdev (as it may be necessary to put the switch in an all-ports-up-but-isolated mode). But at least I don't think it's a major change to load one or two more drivers in pre-failsafe early-boot? We already need drivers enough to access the GPIO hosting the reset button, after all.

EDIT: I also want to be clear that everything I'm suggesting comes before the failsafe boot decision point (unless a suitable reboot-to-failsafe mechanism can be devised), and that any necessary loading of drivers, looking at flash, configuration of devices, etc. also (temporarily!) occurs before /bin/failsafe_server runs.

slh · March 14, 2023, 11:48pm

Let's just step back for a moment, what are the common reasons for needing a recovery of some sorts?

flashed a new firmware that doesn't even boot (or misses/ misconfigures crucial components, as in the network)
to fix this, you need to get the bootloader to save your bacon - but OpenWrt doesn't provide the bootloader, so there's nothing you really can do within OpenWrt, even less as you don't know in advance that you're going to need recovery
firmware is fine, but a someway wrong configuration has been applied or that the overlay has been compromised (as in overlay full)
this means you can't trust the overlay to be consistent, nor that you can even mount it in the first place (it may no longer be readable)

If you can find a fix for the former, you no longer need to care about the later.

But these problems remain (regardless of which of the two you're going to tackle):

you need to find some way to reliably trigger the recovery, avoiding false positives, but also not relying on the potentially broken installed firmware and its potentially misconfigured overlay
you need some way to connect to recovery environment, ideally to write a new firmware (akin to push-button tftp) or at least to OpenWrt's failsafe environment (this would only guard against very simple configuration issues)
you may need to find a way to preconfigure your recovery environment to some extent (e.g. wifi credentials), so something like imagebuilder an custom uci firstboot scripts (not really point and clicky)

I have a problem finding any reliable approach to the former, but I fundamentally miss any chance of doing the later. The only conceivable approach would be over wired ethernet, as wireless is way too heavy for bootloader based approaches - even for an OpenWrt based failsafe like environment.

EDIT: and making all of this kind of generic, to work on multiple (~all) different devices…

CFSworks · March 15, 2023, 12:56am

Such a device is not soft-bricked. Recovering fully bricked devices can get pretty involved and isn't what anybody in this thread is trying to address.

This, on the other hand, is "soft-bricked." It's the specific thing being discussed here.

Avoiding false positives yes, but as I said previously 100.0% reliability is a non-goal. This is meant to be something to try before getting out the ladder, not a replacement for the reset button.

If the writable partition cannot be mounted, even read-only, then there is no need to stop and see if the admin doesn't want a normal boot to proceed: We already know that a normal boot can't proceed because the overlay is FUBAR.

This is not a new problem introduced by this approach: even today, you need some way to connect to the failsafe environment if the failsafe is triggered by button press. It's a problem worth solving, sure, but not actually a problem with "buttonless" failsafe specifically.

The "boot to failsafe" command could certainly be made to carry some TLVs or extra kopts to provide this configuration; that's useful as the admin may not be able to predict what the network setup of the device will be at the time the soft-bricking happens.

psherman · March 15, 2023, 1:11am

How do we know this? We, as humans can figure this out. But the system is just doing what it is told... there are several different scenarios to consider, but fundamentally the system would need to have logic to detect the problem.

I could set a valid, but incorrect configuration on a built-in switch (such as excluding a port from all networks/VLANs), causing me to lose access, but the system is running properly.
Similarly, I could set a /32 network which would be valid for an interface, but not for the rest of the network.
I could set input=reject on the lan/management network, and again lose access. How does the system know this is a problem (vs legit... maybe the intent is to only provide access via serial?)
There are a bunch of actually invalid situations where syntax is wrong or other problems, but you'd have to have code to parse the resulting errors to decide if it is a 'failsafe-required' situation.

Once detected, how would you then inform the failsafe boot decision later? (you'd have to write the result somewhere, but you can't read that because no rw partitions are loaded yet)

CFSworks · March 15, 2023, 1:18am

None of the bulleted examples render the overlay non-mountable.

psherman · March 15, 2023, 1:19am

Correct, but they render the device unreachable and soft-bricked. <<<--- it is this issue that is most likely to cause a user to need to use failsafe. This happens far more often than a failed overlay partition. If the overlay partition has failed, it is highly likely that there is a larger issue with the flash memory.

CFSworks · March 15, 2023, 1:25am

The quote at the top of your previous post was me talking about the overlay being FUBAR, but then your bullet points described situations where the overlay would be intact but containing a configuration that makes the device unmanageable.

psherman · March 15, 2023, 1:28am

Yes... that's true. I missed the original context of it being a physical issue with the overlay and thought of it as an issue with the contents in the overlay.

But still, how often have you encountered a failed overlay (an actual read issue) vs bad data/configs on there?

CFSworks · March 15, 2023, 1:30am

Literally never. I'm just as surprised as you that @slh brought it up as a serious problem that had to be addressed.

StepSis · March 17, 2023, 11:57pm

How about a different approach. If you must keep router outside, what's stopping you from removing reset switch, soldering two wires in it's place and routing the wires all the way into your home?

You can even replace switch with something more fancy and use it like a failsafe button - if you mess something up just hit the button

I'd argue it's even more secure this way.... With the current setup anyone can walk by enter failsafe mode , hook their laptop and mess with your router ( However unlikely that may be )

I personally do not want to see any rushed software failsafe implemented into OpenWrt... Just think of all the possible attack points against that...

Trusting device on certain LAN port spells disaster....

Malicious software on your PC could reboot router into failsafe mode and from there get full access

Any scenario where someone can unhook LAN cable is gg

( For this to work you can't rely on simply trusting LAN port , you need password/key of some sorts, with whole process protected against password guessing ... )

mpratt14 · March 19, 2023, 4:53pm

this is something that will never happen upstream because the applications of it seem to differ for everyone, you may be able to introduce some kind of extension to a package

if you really want to track network interfaces coming up just use the watchcat package which pings a host and if no ping for a specific period of time, it can run a script, that script can either restore a backed up or copied config or just do a factory reset and reboot

CFSworks · March 19, 2023, 7:09pm

So, it turns out that this assumption -- that the only way to communicate information to the next boot without relying on flash is with the cooperation of the bootloader -- proves to be false. I grabbed a spare travel router of mine and hacked up a quick proof-of-concept to illustrate how:

The idea is to reserve a single 4K page of RAM to stash some volatile information that can survive reboots. The page should be somewhere near the middle of the physical memory range, where the bootloader and early kernel do not use it. My travel router's system RAM region is 00000000-03ffffff, so I figured that 02BADxxx would be a fitting magic number for something meant for "failsafe" use.

First I added memmap=4K$0x2bad000 to my kernel command line, which causes the kernel to reserve the page early in the boot process. Then I defined a RAM-backed MTD device at that location:

root@OpenWrt:~# cat /sys/firmware/devicetree/base/bootcfg@2bad000/compatible 
mtd-ram

As you can see, it preserves its contents across warm reboots:

root@OpenWrt:~# sha256sum /dev/mtd0
1a6f70682c46ced47ddb08071cdd49ac8623082f0a8fe90cc164d2e9b6de33ef  /dev/mtd0
root@OpenWrt:~# dd if=/dev/urandom of=/dev/mtd0
dd: error writing '/dev/mtd0': No space left on device
9+0 records in
8+0 records out
root@OpenWrt:~# sha256sum /dev/mtd0
89c85779ab017a6f11e977fc6c9275493c820a4c32eca66da9b698018c79f04f  /dev/mtd0
root@OpenWrt:~# reboot

...
...
...

BusyBox v1.36.0 (2023-03-18 11:47:48 UTC) built-in shell (ash)

  _______                     ________        __
 |       |.-----.-----.-----.|  |  |  |.----.|  |_
 |   -   ||  _  |  -__|     ||  |  |  ||   _||   _|
 |_______||   __|_____|__|__||________||__|  |____|
          |__| W I R E L E S S   F R E E D O M
 -----------------------------------------------------
 OpenWrt SNAPSHOT, r22307-4bfbecbd9a
 -----------------------------------------------------
root@OpenWrt:~# sha256sum /dev/mtd0
89c85779ab017a6f11e977fc6c9275493c820a4c32eca66da9b698018c79f04f  /dev/mtd0
root@OpenWrt:~#

At this point it became trivial to write a reboot_failsafe script, which simply fills that MTD with ASCII "FAILSAFE" repeated 512 times, and a /lib/preinit/35_check_failsafe_flag which looks for this bit pattern, overwrites it with /dev/zero, then sets FAILSAFE=true. Et voilà: a robust, specific, flash-free, vendor-neutral reboot-to-failsafe mechanism. From there, a device admin can create whatever software trigger they deem suitable (such as being unable to ping a host for X minutes, or "SOS" tapped out in Morse code on link-up/link-down events, ...).

This type of mechanism can have several other uses as well, such as tracking whether the last reboot was expected or not, stashing a partial kernel panic traceback, counting failed boots, carrying command-line args from reboot_failsafe that override /lib/preinit/00_preinit.conf, and so on. The physics of DRAM is even such that the contents can survive the chip being powered down for more than a few seconds, so it can also count power cuts. Remember: this is RAM, not flash, so it doesn't have the same write/erase limitations.

mpratt14 · March 19, 2023, 7:34pm

nice, however...

failsafe mode is useless if a script can still be executed...

let's not forget the original poster's actual problem

what makes more sense to me would be an automatic backup of current config, erasing the config files for device access (network, firewall) and running config_generate and reboot

slh · March 19, 2023, 7:38pm

What's the first thing you do in a panic, if the device doesn't come up?
…cut power and restart, RAM contents gone.

(you don't know in advance if you're going to need recovery either).

CFSworks · March 19, 2023, 8:09pm

Please elaborate?

a) In my testing (on my specific device only), the RAM contents stay intact for power cuts of up to about 90 seconds.
b) The reboot_failsafe command triggers a reboot immediately. The time window where the "FAILSAFE"*512 string is in there is something like 5-10 seconds.

bmork · March 19, 2023, 9:13pm

Yes, and I've had that mounted outdoors for a couple of years now giving me a permanent console connection to the NR7101. Using a Bluetooth dongle in a Unifi AC Pro with consever on the other side of the wall. Works perfectly in combination with ubus power control provided by the ZyXEL GS1900-10HP powering it.

That's what I call a nice OpenWrt based solution:-)

The Bluetooth serial has a couple of downsides though. No security worth mentioning. And the range is limited so you need another device close by. But if you can live with those then it's really nice. I'm using it on a number of other devices too. Just love permanent console connections without the cable mess and case holes.

EDIT: Here's another one - Unifi 6 lite mounted rather inaccessible in the attic:

The plastic bag insulation is the result of infinite laziness. It was there.

Case snapped right back with absolutely no signs of the modification. Cables next to the WiFi antenna and Bluetooth antenna close to the rf shield isn't perfect, but I haven't noticed any issues. Don't have any device close enough for permanent console. But being able to connect when necessary from my laptop on the floor below was crucial to solving the dtb offset bug.

The Bluetooth modules I use are called jdy-31 on aliexpress. I prefer modules which come without additional circuitry like 5v power supply since they are connected directly to 3.3V

mpratt14 · March 19, 2023, 11:29pm

break config
script automatically launches failsafe mode
use failsafe mode to fix config

why do we need 3 steps?

break config
script automatically fixes config and reboots

CFSworks · March 20, 2023, 12:17am

I do think it's a great idea to update UCI to make it harder to break the config in the first place (see my numbered list here), and using an automatic rollback to a previously-working config when a new config isn't confirmed in time is an excellent way to achieve that. Sometimes, though, despite everything, the device ends up on a bad config, even if it thinks the bad config is the "emergency fallback." If this never happened, OpenWrt would not need a failsafe mode at all!

The point of a reboot_failsafe command is to have something in the core firmware that an optional package/daemon can invoke if it receives some (hereto unspecified) out-of-band signal from the admin that they're locked out and failsafe mode is desired. This signal may not be able to carry any more information than simply "I am locked out, help!" otherwise it would indeed be a good idea to let the admin upload/specify a config backup through that out-of-band channel.