Buttonless Failsafe Mode

That probably isn't safe enough. A magic packet should not be repeatable. We don't want a compromised system on our LAN to listen for these and then simply be able to replay the packet.
I think a fully encrypted communication channel providing equal security as ssh or https would be needed.

Ah yes. I wondered whether that would be feasible within a magic packet context and it seems like something along these lines is the subject of this Intel patent:

Currently, this wake mechanism is insecure. In other words, computing devices or platforms do not sufficiently protect against spurious or malicious wake events. A so-called “sniffer” can monitor the packet sent over the communications network used by the two systems. A malicious person can detect such packets and replay them at a later time. A variation of the wake mechanism is referred to as “magic packet+password”. The “magic packet+password” is similar to a packet of the “magic packet” but includes an additional six-byte password appended to end. While the “magic packet+password” mechanism does have a password, the password is nonetheless sent unencrypted and susceptible to a replay attack in the same manner as the “magic packet”.

Upon waking the necessary hardware, network controller 170 may allow management controller 180 to establish a secure or encrypted communication session between network controller 170 and the remote console.

Albeit our context is trickier because setting up that communication channel is rendered harder (or impossible?) given the lack of connectivity we are attempting to address.

I admit I'm beginning to lose heart a little now.

idea is great, but highly unsecure

what ever approach to choose:

  1. router SEND magic packets to some dst mac/ip/port knock and then bring up ssh daemon for short time on positive reply from other side

  2. router LISTEN for some mac/ip/port knock combo and then bring up ssh daemon on positive match

problem is that there must be some kind of password to access the device

if the password is same for all OpenWRT device, then, it is no password anymore :slight_smile:
so, only unique password which could be baked at boot time is something based on device MAC address
and that is far from strong password

I don't think I have ever had a need for such a feature (I have never had devices in hard-to-reach places) but the problem has nonetheless successfully nerd-sniped me. From the thread, I am under the impression that the go/no-go for failsafe mode must happen before the overlayfs is established, and there's no mechanism for rebooting into failsafe mode (as indeed a good option might be to fork a long sleep followed by a reboot-to-failsafe into the background before touching the network config, killing it when done).

I agree fully that the design must permit NO false positives (accidental or malicious) and must not interfere with the rest of the network. But I do not think it is a reasonable or feasible goal to make the software failsafe 100% reliable (as indeed that's what the button is for). The goal here is just to have something else to try before one has to resort to climbing out of one's window. To that end, anything that works in 90-95% of cases should be "good enough" (and is obviously a massive improvement over the 0% we have today).

This is a sketch of a solution, not an actual proposal, but:

  1. On first boot, a failsafe key is randomized and saved to /etc/config/system. The idea is that this gets included in configuration backups, so that the device admin is likely to have it when they end up needing it.
  2. In preinit, before the normal/failsafe decision but after /overlay is mounted(?), the key is parsed out of /etc/config/system and a lightweight "failsafe service" is launched. The boot process waits for this service to run to completion before deciding how to proceed.
  3. The "failsafe service" scans for all Ethernet interfaces and (if available this early) 802.11 interfaces. They are brought up (the latter in monitor mode, on channel 1 or 36 depending on band) in a strict speak-only-when-spoken-to fashion, to avoid interfering with the network.
  4. The service listens for 500ms(?) for magic packets on the wired (with a magic EtherType) and wireless (802.11 Action frames) interfaces. After the timeout, proceed with normal boot.
  5. If one arrives, check that it has a valid HMAC (using the configured key) and see if it is a challenge packet or a failsafe packet.
  6. For a challenge packet, randomize a nonce (IFF not done yet this boot) and transmit it back to the sender. This protects against replay attacks. For a failsafe packet, ensure that it contains a nonce and matches the one that was randomized, then proceed to failsafe boot.

I'm reasonably confident that this (or something like it) satisfies the "do not interfere" and "no false positives" requirements. There is no guarantee that there are no false negatives, as a few things can still go wrong:

  • Admin may not be able to send the necessary magic packets: either they are remote and not on the same L2 segment, or their computer does not permit sending raw Ethernet frames and/or injecting wireless Action frames
  • Admin may not actually have any backup of the device's config handy
  • Device might be unable to initialize EITHER 802.11 or Ethernet interfaces that early if the kernel modules don't load or they both require firmware
  • Device's 802.11 WNIC might not support frame injection to send the challenge nonce, forcing the admin to use Ethernet
  • Device might have been soft-bricked in some way that kills this mechanism (the failsafe key was deleted, or the network drivers corrupted, or ...)

I don't see any of these cases being extremely likely, though. I still think the reliability percentage is somewhere in the 90s. And, again, the point is just to have something for folks like @anon10117369 to try before resorting to riskier things like leaning out the window and/or climbing a ladder. We aren't replacing the hardware button, just augmenting it.

1 Like

I like the thinking here, but I don't think that it will work...

This key would need to be written to the overlay. No issue writing it there on first boot, but...

The key was written to overlay, so this can't happen until overlay is mounted. But overlay doesn't get mounted until later in the boot cycle -- after the normal/failsafe decision has been made.

Wifi doen't come up until way later in the boot cycle... you need to load a lot of things before you can load the drivers and start a network stack on those interfaces. Ethernet isn't even up at the time of the failsafe decision.

Can't listen if the interfaces aren't up.

Hi Peter! Thanks for taking the time to provide feedback.

The idea would be that the writable partition (not the overlay) is temporarily mounted read-only just long enough to peek in there and grab the key, then immediately unmounted. Much later on, the overlay is mounted normally.

Come to think of it, this same mechanism could be used to implement a reboot-to-failsafe, which would make this whole thing a lot simpler: the "go to failsafe" signal could be received later in the boot process, perhaps even when the system is fully booted. Hmm...!

Yeah, fair point. I was wondering about that in the back of my mind the whole time I was mentioning WiFi, but figured I'd include it in case any devices had WiFi interfaces (certain hardmac devices with on-chip firmware) that could be available that early in the boot process. (The real motivation for WiFi support at all is because WiFi interfaces don't have those pesky switchdevs in the way.)

The quote this is under says that the failsafe process brings up the Ethernet interfaces. They can be taken back down again before proceeding in the boot process.

As an aside: I ask that you keep the criticism -- though it is appreciated -- constructive. "X won't work" doesn't give me a path forward. "Y may work better than X" is most preferred as it offers an alternative. "For X to work, we would have to do Z. I am concerned the costs of Z outweigh the benefits of X" allows community ownership of the difficulty of problem Z: perhaps someone will come forward who is willing to do Z, or someone can offer an innovative way of achieving X without Z. Alas much of the various feedback so far in this thread seems to be "this requested feature can't work in the current boot process" which is... pretty circular reasoning given that the request is to modify the boot process! :slight_smile:

While I still believe this mechanism is workable, and it does solve a real-world problem ("How does one get a device into failsafe mode when reaching the reset button entails a personal safety risk?"), I do wonder if it's practical or would even see a lot of use. There's a certain amount of complexity that would have to be introduced into the boot process that the indoor-only users would not need (or, perhaps, tolerate) just for the few users with outdoor devices. It's an important problem, because it's addressing a safety concern (we do not want to make our users go up on ladders), but I'm wondering if specifically targeting failsafe mode isn't quite the right call here: it's a lot of complexity for one very specific situation.

After sleeping on it, I realized that this type of protocol would be a lot more useful to me personally (in the sense that it would save me having to get up and walk over to the device if I bungle the config) if it could work on a fully-booted device and get me a root shell. The same L2-protocol-only trick is used to bypass as much of the network stack as possible, getting around ip/netfilter misconfigurations -- indeed, the only requirement for access via WiFi Action frames would be that the PHYs be kept awake, which isn't hard at all to do. Such a daemon can be an entirely separate project independent of OpenWrt, available as an optional package for those that want it. (And who knows, it may prove to be so popular that it works its way into the core distribution.) I'd certainly enjoy writing such a thing, too, once I have the free time for it!

Regardless, I believe we should address some of the shortcomings in UCI that led to this problem happening in the first place:

This is a pretty clear sign that UCI is a far cry from LuCI's reliability. @anon10117369 essentially states that if the same change were done through the latter, this particular soft-bricking would have been avoided. I think there are 3 pretty low hanging fruits we should consider:

  1. Invent a uci apply command, which mirrors LuCI's analogue: It applies (but does not commit) the changes, tells the user they need to run uci apply --confirm to prevent the rollback, and sleeps for 5 seconds before exiting (so the user won't confirm too soon). A uci commit would also cancel the rollback, but uci apply --confirm is the recommended way to go because it won't work after the rollback has already happened.
  2. @anon10117369 was editing /etc/config/network directly, which (although I do the same myself) I think is pretty bad practice: you are saving changes before testing them. We should probably learn from tools like visudo that when a file is so critical that you could be locked out if you screw it up, you shouldn't be editing it directly. So, we should invent a uci edit command (so I can run uci edit network instead of vi /etc/config/network) which copies the pending config to a temporary location, runs the editor on it, checks if any changes were made, and if so, does a quick check for syntax and conflicting config changes that happened during the editing, before staging that new file for commit. (And, since it's meant to be an interactive command, it can even suggest that the user run uci apply next.)
  3. Discourage writing to the saved configuration without testing it first! UCI could stick a notice at the top of the /etc/config files that says something like "editing this directly is discouraged, use uci edit instead." The documentation also needs to do a much better job communicating that uci commit should not be used until any changes are tested. The first example in the UCI page of the user guide is currently recommending the opposite.

@anon10117369 would these be a satisfactory resolution if we ditched the "failsafe mode" idea specifically?

1 Like

The zero false-positive requirement would apply here... what would the condition be for triggering "go to failsafe" later in (or after) the boot cycle?

The failsafe process can be viewed as a super minimal bring-up of the system (not unlike 'safe mode' in many desktop operating systems). Bringing up the interface means you are a bit further along in the boot cycle than when the failsafe boot decision is made. Some devices actually bring up the interfaces for a very short time for tftp or other similar operation as a function of the boot loader, but that is before anything from OpenWrt begins to boot... once the bootloader hands over to OpenWrt, the decision about failsafe is made before the interfaces come up... bringing them up would require that drivers are loaded and an address is needed, among other things. Once you're this far, you're past the decision point. And if the network is enabled before the decision point, since it needs an address, it would run the risk of causing a conflict on the main network (which could happen anytime the device reboots).

This is what I aim to do. My apologies if this has come across as rude or anything other than a technical reason that this approach wouldn't work. If someone asks to build a perpetual motion machine, describing the 2nd law of thermodynamics explains why it's not possible -- even delivering this method kindly could result in a negative reception because the person asking may feel criticized. My goal is to simply provide explaination about why the approach won't work from the facts, but I'm honestly trying to deliver that message in the most constructive way possible.

I agree with this philopsophy, and I try to employ this whenever I can. In this particular case, I still haven't figured out any way that it could work, so I don't have another approach to suggest. Maybe this is a failure of imagination on my part (or maybe I'm not smart enough). Here, the fact that I'm not providing an alternative approach is not about 'being critical just to be critical', but rather pointing out the things that make it impractical based on the way the system works (if I had another approach, I'd share it and brainstorm with you).

Modifying the boot process would be the only mechanism I can think of that would provide the keys to this kingdom. The devs (especially those familiar with the very early boot process) can probably provide more context, but IIRC, part of the issue is that there isn't a way to run for example an initramfs and then cycle into the regular boot. I may be wrong here, but I seem to recall a discussion to that effect.

EDIT: I was just thinking of how another company addressed the issue on physical access to the reset button... Ubiquiti, on some of the older hardware, had a reset button that was on the PoE injector. They called it "remote reset." The use case for their APs and other hardware means that, by design, many of the devices are really hard to get to (sometimes on the top of very large towers or on very high ceilings/roofs), so they had the incentive to develop a solution that was easier and safer. The thinking was that you could reset the device from whereever the PoE injector was located (ideally at the 'easily accessable' end of the cable). It was still hardware based, and it worked for some products, but was not reliable for others. AFAIK, this is no longer offered in any of their current products. In their situation, they control the whole stack (hardware/firmware/software for the individual devices as well as the intermediate devices like switches and other infrastructure), so they could have theoretically implemented a 'buttonless' option and/or they could have expanded the offering of this 'remote reset' function. But, even this hybrid 'remote reset' thing is mostly past tense. There are many users who wish this was available for obvious reasons. But, it suggests that Ubiquiti (who has a commercial interest in such a reset option and and control of the stack) was not satisfied with the 'half measure' of the PoE adapter mounted button when it came to reliability and security, and that they likewise felt a buttonless reset would be extremely difficult to achieve while meeting security and reliability requirements.

Okay, my apologies as well: I had misunderstood your tone. I think we might also both be getting tripped up on the distinction between what's possible and what's sensible. It's clear to me now that when you say "won't work," you mean that it would require far more reworking of a delicate process than you would consider reasonable. I had read it as an assertion that this type of approach is impossible which I think we both agree is not the case (no perpetual motion machines here!)

I have two or three ideas, but I'm only offering them as-is. I haven't checked if they're easily implemented (again, I'm focused on what's possible and working backwards from there):

  1. Write a file (or magic block) to the writable partition, then reboot. To comply with the zero-false-positive requirement, the partition is then remounted as rewritable, the signal file/block erased, and the change to flash confirmed, before the failsafe procedure triggers. (Trading a false positive for a false negative in case the flash can't be written for some reason -- though the admin has bigger problems if the hardware has failed.)
  2. Many embedded devices use a trick where they put a prearranged 32-bit or 64-bit magic value at a prearranged physical address or hardware register (one not reset by the bootloader) before rebooting otherwise "normally," and the early boot sees the magic value, (erases it,) then enters failsafe mode. There's a small (2^-32 or 2^-64) chance of a false positive with this approach, but this is usually considered acceptably low, especially if the user can just power-cycle again. The fly in the ointment with this one is figuring out a suitable physical address for every supported device. (The physical address can be reserved with no difficulty using Linux's memmap= parameter, however.)
  3. I have once seen a situation where Linux kexec'd itself: as in, back to the presently-running kernel, not a kernel image loaded from usermode. I didn't look too deeply into how this worked; for all I know, it may have even been a (very warm) reboot directly to the bootloader. But it allowed starting the boot over with a different kernel cmdline. (Again, this option is merely possible, not necessarily doable.)

Ah, I wasn't clear: By "failsafe process" I meant /bin/failsafe_server; the process that listens for the failsafe command. (It isn't exactly a "daemon" or I would have called it that. Do you have a preferred term for this thing?)

By "address" do you mean an IP address or a hardware address? I presume you mean the latter since the former is only needed if you want to speak IP, not if you just want the interface up. But the latter should generally be available that early, shouldn't it? With my devices, at least, the information needed to fetch that address from EEPROM resides in the DTB, so the address is correct as soon as the driver finishes binding. (Even if the hardware address isn't available yet, it's safe to set a random one with the "locally administered" bit set: the speak-only-when-spoken-to rule means the address isn't actually used until the admin wants to trigger failsafe mode.)

There's a good point to be made about needing the drivers loaded that early, though: not just Ethernet but also switchdev (as it may be necessary to put the switch in an all-ports-up-but-isolated mode). But at least I don't think it's a major change to load one or two more drivers in pre-failsafe early-boot? We already need drivers enough to access the GPIO hosting the reset button, after all.

EDIT: I also want to be clear that everything I'm suggesting comes before the failsafe boot decision point (unless a suitable reboot-to-failsafe mechanism can be devised), and that any necessary loading of drivers, looking at flash, configuration of devices, etc. also (temporarily!) occurs before /bin/failsafe_server runs.

1 Like

Let's just step back for a moment, what are the common reasons for needing a recovery of some sorts?

  • flashed a new firmware that doesn't even boot (or misses/ misconfigures crucial components, as in the network)
    to fix this, you need to get the bootloader to save your bacon - but OpenWrt doesn't provide the bootloader, so there's nothing you really can do within OpenWrt, even less as you don't know in advance that you're going to need recovery
  • firmware is fine, but a someway wrong configuration has been applied or that the overlay has been compromised (as in overlay full)
    this means you can't trust the overlay to be consistent, nor that you can even mount it in the first place (it may no longer be readable)

If you can find a fix for the former, you no longer need to care about the later.

But these problems remain (regardless of which of the two you're going to tackle):

  • you need to find some way to reliably trigger the recovery, avoiding false positives, but also not relying on the potentially broken installed firmware and its potentially misconfigured overlay
  • you need some way to connect to recovery environment, ideally to write a new firmware (akin to push-button tftp) or at least to OpenWrt's failsafe environment (this would only guard against very simple configuration issues)
  • you may need to find a way to preconfigure your recovery environment to some extent (e.g. wifi credentials), so something like imagebuilder an custom uci firstboot scripts (not really point and clicky)

I have a problem finding any reliable approach to the former, but I fundamentally miss any chance of doing the later. The only conceivable approach would be over wired ethernet, as wireless is way too heavy for bootloader based approaches - even for an OpenWrt based failsafe like environment.

EDIT: and making all of this kind of generic, to work on multiple (~all) different devices…

1 Like

Such a device is not soft-bricked. Recovering fully bricked devices can get pretty involved and isn't what anybody in this thread is trying to address.

This, on the other hand, is "soft-bricked." It's the specific thing being discussed here.

Avoiding false positives yes, but as I said previously 100.0% reliability is a non-goal. This is meant to be something to try before getting out the ladder, not a replacement for the reset button.

If the writable partition cannot be mounted, even read-only, then there is no need to stop and see if the admin doesn't want a normal boot to proceed: We already know that a normal boot can't proceed because the overlay is FUBAR.

This is not a new problem introduced by this approach: even today, you need some way to connect to the failsafe environment if the failsafe is triggered by button press. It's a problem worth solving, sure, but not actually a problem with "buttonless" failsafe specifically.

The "boot to failsafe" command could certainly be made to carry some TLVs or extra kopts to provide this configuration; that's useful as the admin may not be able to predict what the network setup of the device will be at the time the soft-bricking happens.

1 Like

How do we know this? We, as humans can figure this out. But the system is just doing what it is told... there are several different scenarios to consider, but fundamentally the system would need to have logic to detect the problem.

  • I could set a valid, but incorrect configuration on a built-in switch (such as excluding a port from all networks/VLANs), causing me to lose access, but the system is running properly.
  • Similarly, I could set a /32 network which would be valid for an interface, but not for the rest of the network.
  • I could set input=reject on the lan/management network, and again lose access. How does the system know this is a problem (vs legit... maybe the intent is to only provide access via serial?)
  • There are a bunch of actually invalid situations where syntax is wrong or other problems, but you'd have to have code to parse the resulting errors to decide if it is a 'failsafe-required' situation.

Once detected, how would you then inform the failsafe boot decision later? (you'd have to write the result somewhere, but you can't read that because no rw partitions are loaded yet)

1 Like

None of the bulleted examples render the overlay non-mountable.

Correct, but they render the device unreachable and soft-bricked. <<<--- it is this issue that is most likely to cause a user to need to use failsafe. This happens far more often than a failed overlay partition. If the overlay partition has failed, it is highly likely that there is a larger issue with the flash memory.

The quote at the top of your previous post was me talking about the overlay being FUBAR, but then your bullet points described situations where the overlay would be intact but containing a configuration that makes the device unmanageable.

Yes... that's true. I missed the original context of it being a physical issue with the overlay and thought of it as an issue with the contents in the overlay.

But still, how often have you encountered a failed overlay (an actual read issue) vs bad data/configs on there?

Literally never. I'm just as surprised as you that @slh brought it up as a serious problem that had to be addressed.

How about a different approach. If you must keep router outside, what's stopping you from removing reset switch, soldering two wires in it's place and routing the wires all the way into your home?

You can even replace switch with something more fancy and use it like a failsafe button - if you mess something up just hit the button

I'd argue it's even more secure this way.... With the current setup anyone can walk by enter failsafe mode , hook their laptop and mess with your router ( However unlikely that may be :cowboy_hat_face: )

I personally do not want to see any rushed software failsafe implemented into OpenWrt... Just think of all the possible attack points against that...

  • Trusting device on certain LAN port spells disaster....

Malicious software on your PC could reboot router into failsafe mode and from there get full access

Any scenario where someone can unhook LAN cable is gg

( For this to work you can't rely on simply trusting LAN port , you need password/key of some sorts, with whole process protected against password guessing ... )

this is something that will never happen upstream because the applications of it seem to differ for everyone, you may be able to introduce some kind of extension to a package

if you really want to track network interfaces coming up just use the watchcat package which pings a host and if no ping for a specific period of time, it can run a script, that script can either restore a backed up or copied config or just do a factory reset and reboot

So, it turns out that this assumption -- that the only way to communicate information to the next boot without relying on flash is with the cooperation of the bootloader -- proves to be false. I grabbed a spare travel router of mine and hacked up a quick proof-of-concept to illustrate how:

The idea is to reserve a single 4K page of RAM to stash some volatile information that can survive reboots. The page should be somewhere near the middle of the physical memory range, where the bootloader and early kernel do not use it. My travel router's system RAM region is 00000000-03ffffff, so I figured that 02BADxxx would be a fitting magic number for something meant for "failsafe" use. :slight_smile:

First I added memmap=4K$0x2bad000 to my kernel command line, which causes the kernel to reserve the page early in the boot process. Then I defined a RAM-backed MTD device at that location:

root@OpenWrt:~# cat /sys/firmware/devicetree/base/bootcfg@2bad000/compatible 

As you can see, it preserves its contents across warm reboots:

root@OpenWrt:~# sha256sum /dev/mtd0
1a6f70682c46ced47ddb08071cdd49ac8623082f0a8fe90cc164d2e9b6de33ef  /dev/mtd0
root@OpenWrt:~# dd if=/dev/urandom of=/dev/mtd0
dd: error writing '/dev/mtd0': No space left on device
9+0 records in
8+0 records out
root@OpenWrt:~# sha256sum /dev/mtd0
89c85779ab017a6f11e977fc6c9275493c820a4c32eca66da9b698018c79f04f  /dev/mtd0
root@OpenWrt:~# reboot


BusyBox v1.36.0 (2023-03-18 11:47:48 UTC) built-in shell (ash)

  _______                     ________        __
 |       |.-----.-----.-----.|  |  |  |.----.|  |_
 |   -   ||  _  |  -__|     ||  |  |  ||   _||   _|
 |_______||   __|_____|__|__||________||__|  |____|
          |__| W I R E L E S S   F R E E D O M
 OpenWrt SNAPSHOT, r22307-4bfbecbd9a
root@OpenWrt:~# sha256sum /dev/mtd0
89c85779ab017a6f11e977fc6c9275493c820a4c32eca66da9b698018c79f04f  /dev/mtd0

At this point it became trivial to write a reboot_failsafe script, which simply fills that MTD with ASCII "FAILSAFE" repeated 512 times, and a /lib/preinit/35_check_failsafe_flag which looks for this bit pattern, overwrites it with /dev/zero, then sets FAILSAFE=true. Et voilà: a robust, specific, flash-free, vendor-neutral reboot-to-failsafe mechanism. From there, a device admin can create whatever software trigger they deem suitable (such as being unable to ping a host for X minutes, or "SOS" tapped out in Morse code on link-up/link-down events, ...).

This type of mechanism can have several other uses as well, such as tracking whether the last reboot was expected or not, stashing a partial kernel panic traceback, counting failed boots, carrying command-line args from reboot_failsafe that override /lib/preinit/00_preinit.conf, and so on. The physics of DRAM is even such that the contents can survive the chip being powered down for more than a few seconds, so it can also count power cuts. Remember: this is RAM, not flash, so it doesn't have the same write/erase limitations.


nice, however...

failsafe mode is useless if a script can still be executed...

let's not forget the original poster's actual problem

what makes more sense to me would be an automatic backup of current config, erasing the config files for device access (network, firewall) and running config_generate and reboot