Buttonless Failsafe Mode

Such a device is not soft-bricked. Recovering fully bricked devices can get pretty involved and isn't what anybody in this thread is trying to address.

This, on the other hand, is "soft-bricked." It's the specific thing being discussed here.

Avoiding false positives yes, but as I said previously 100.0% reliability is a non-goal. This is meant to be something to try before getting out the ladder, not a replacement for the reset button.

If the writable partition cannot be mounted, even read-only, then there is no need to stop and see if the admin doesn't want a normal boot to proceed: We already know that a normal boot can't proceed because the overlay is FUBAR.

This is not a new problem introduced by this approach: even today, you need some way to connect to the failsafe environment if the failsafe is triggered by button press. It's a problem worth solving, sure, but not actually a problem with "buttonless" failsafe specifically.

The "boot to failsafe" command could certainly be made to carry some TLVs or extra kopts to provide this configuration; that's useful as the admin may not be able to predict what the network setup of the device will be at the time the soft-bricking happens.

1 Like

How do we know this? We, as humans can figure this out. But the system is just doing what it is told... there are several different scenarios to consider, but fundamentally the system would need to have logic to detect the problem.

  • I could set a valid, but incorrect configuration on a built-in switch (such as excluding a port from all networks/VLANs), causing me to lose access, but the system is running properly.
  • Similarly, I could set a /32 network which would be valid for an interface, but not for the rest of the network.
  • I could set input=reject on the lan/management network, and again lose access. How does the system know this is a problem (vs legit... maybe the intent is to only provide access via serial?)
  • There are a bunch of actually invalid situations where syntax is wrong or other problems, but you'd have to have code to parse the resulting errors to decide if it is a 'failsafe-required' situation.

Once detected, how would you then inform the failsafe boot decision later? (you'd have to write the result somewhere, but you can't read that because no rw partitions are loaded yet)

1 Like

None of the bulleted examples render the overlay non-mountable.

Correct, but they render the device unreachable and soft-bricked. <<<--- it is this issue that is most likely to cause a user to need to use failsafe. This happens far more often than a failed overlay partition. If the overlay partition has failed, it is highly likely that there is a larger issue with the flash memory.

The quote at the top of your previous post was me talking about the overlay being FUBAR, but then your bullet points described situations where the overlay would be intact but containing a configuration that makes the device unmanageable.

Yes... that's true. I missed the original context of it being a physical issue with the overlay and thought of it as an issue with the contents in the overlay.

But still, how often have you encountered a failed overlay (an actual read issue) vs bad data/configs on there?

Literally never. I'm just as surprised as you that @slh brought it up as a serious problem that had to be addressed.

How about a different approach. If you must keep router outside, what's stopping you from removing reset switch, soldering two wires in it's place and routing the wires all the way into your home?

You can even replace switch with something more fancy and use it like a failsafe button - if you mess something up just hit the button

I'd argue it's even more secure this way.... With the current setup anyone can walk by enter failsafe mode , hook their laptop and mess with your router ( However unlikely that may be :cowboy_hat_face: )

I personally do not want to see any rushed software failsafe implemented into OpenWrt... Just think of all the possible attack points against that...

  • Trusting device on certain LAN port spells disaster....

Malicious software on your PC could reboot router into failsafe mode and from there get full access

Any scenario where someone can unhook LAN cable is gg

( For this to work you can't rely on simply trusting LAN port , you need password/key of some sorts, with whole process protected against password guessing ... )

this is something that will never happen upstream because the applications of it seem to differ for everyone, you may be able to introduce some kind of extension to a package

if you really want to track network interfaces coming up just use the watchcat package which pings a host and if no ping for a specific period of time, it can run a script, that script can either restore a backed up or copied config or just do a factory reset and reboot

So, it turns out that this assumption -- that the only way to communicate information to the next boot without relying on flash is with the cooperation of the bootloader -- proves to be false. I grabbed a spare travel router of mine and hacked up a quick proof-of-concept to illustrate how:

The idea is to reserve a single 4K page of RAM to stash some volatile information that can survive reboots. The page should be somewhere near the middle of the physical memory range, where the bootloader and early kernel do not use it. My travel router's system RAM region is 00000000-03ffffff, so I figured that 02BADxxx would be a fitting magic number for something meant for "failsafe" use. :slight_smile:

First I added memmap=4K$0x2bad000 to my kernel command line, which causes the kernel to reserve the page early in the boot process. Then I defined a RAM-backed MTD device at that location:

root@OpenWrt:~# cat /sys/firmware/devicetree/base/bootcfg@2bad000/compatible 
mtd-ram

As you can see, it preserves its contents across warm reboots:

root@OpenWrt:~# sha256sum /dev/mtd0
1a6f70682c46ced47ddb08071cdd49ac8623082f0a8fe90cc164d2e9b6de33ef  /dev/mtd0
root@OpenWrt:~# dd if=/dev/urandom of=/dev/mtd0
dd: error writing '/dev/mtd0': No space left on device
9+0 records in
8+0 records out
root@OpenWrt:~# sha256sum /dev/mtd0
89c85779ab017a6f11e977fc6c9275493c820a4c32eca66da9b698018c79f04f  /dev/mtd0
root@OpenWrt:~# reboot

...
...
...

BusyBox v1.36.0 (2023-03-18 11:47:48 UTC) built-in shell (ash)

  _______                     ________        __
 |       |.-----.-----.-----.|  |  |  |.----.|  |_
 |   -   ||  _  |  -__|     ||  |  |  ||   _||   _|
 |_______||   __|_____|__|__||________||__|  |____|
          |__| W I R E L E S S   F R E E D O M
 -----------------------------------------------------
 OpenWrt SNAPSHOT, r22307-4bfbecbd9a
 -----------------------------------------------------
root@OpenWrt:~# sha256sum /dev/mtd0
89c85779ab017a6f11e977fc6c9275493c820a4c32eca66da9b698018c79f04f  /dev/mtd0
root@OpenWrt:~#

At this point it became trivial to write a reboot_failsafe script, which simply fills that MTD with ASCII "FAILSAFE" repeated 512 times, and a /lib/preinit/35_check_failsafe_flag which looks for this bit pattern, overwrites it with /dev/zero, then sets FAILSAFE=true. Et voilà: a robust, specific, flash-free, vendor-neutral reboot-to-failsafe mechanism. From there, a device admin can create whatever software trigger they deem suitable (such as being unable to ping a host for X minutes, or "SOS" tapped out in Morse code on link-up/link-down events, ...).

This type of mechanism can have several other uses as well, such as tracking whether the last reboot was expected or not, stashing a partial kernel panic traceback, counting failed boots, carrying command-line args from reboot_failsafe that override /lib/preinit/00_preinit.conf, and so on. The physics of DRAM is even such that the contents can survive the chip being powered down for more than a few seconds, so it can also count power cuts. Remember: this is RAM, not flash, so it doesn't have the same write/erase limitations.

3 Likes

nice, however...

failsafe mode is useless if a script can still be executed...

let's not forget the original poster's actual problem

what makes more sense to me would be an automatic backup of current config, erasing the config files for device access (network, firewall) and running config_generate and reboot

What's the first thing you do in a panic, if the device doesn't come up?
…cut power and restart, RAM contents gone.

(you don't know in advance if you're going to need recovery either).

Please elaborate?

a) In my testing (on my specific device only), the RAM contents stay intact for power cuts of up to about 90 seconds.
b) The reboot_failsafe command triggers a reboot immediately. The time window where the "FAILSAFE"*512 string is in there is something like 5-10 seconds.

2 Likes

Yes, and I've had that mounted outdoors for a couple of years now giving me a permanent console connection to the NR7101. Using a Bluetooth dongle in a Unifi AC Pro with consever on the other side of the wall. Works perfectly in combination with ubus power control provided by the ZyXEL GS1900-10HP powering it.

That's what I call a nice OpenWrt based solution:-)

The Bluetooth serial has a couple of downsides though. No security worth mentioning. And the range is limited so you need another device close by. But if you can live with those then it's really nice. I'm using it on a number of other devices too. Just love permanent console connections without the cable mess and case holes.

EDIT: Here's another one - Unifi 6 lite mounted rather inaccessible in the attic:

The plastic bag insulation is the result of infinite laziness. It was there.

Case snapped right back with absolutely no signs of the modification. Cables next to the WiFi antenna and Bluetooth antenna close to the rf shield isn't perfect, but I haven't noticed any issues. Don't have any device close enough for permanent console. But being able to connect when necessary from my laptop on the floor below was crucial to solving the dtb offset bug.

The Bluetooth modules I use are called jdy-31 on aliexpress. I prefer modules which come without additional circuitry like 5v power supply since they are connected directly to 3.3V

2 Likes
  1. break config
  2. script automatically launches failsafe mode
  3. use failsafe mode to fix config

why do we need 3 steps?

  1. break config
  2. script automatically fixes config and reboots

I do think it's a great idea to update UCI to make it harder to break the config in the first place (see my numbered list here), and using an automatic rollback to a previously-working config when a new config isn't confirmed in time is an excellent way to achieve that. Sometimes, though, despite everything, the device ends up on a bad config, even if it thinks the bad config is the "emergency fallback." If this never happened, OpenWrt would not need a failsafe mode at all!

The point of a reboot_failsafe command is to have something in the core firmware that an optional package/daemon can invoke if it receives some (hereto unspecified) out-of-band signal from the admin that they're locked out and failsafe mode is desired. This signal may not be able to carry any more information than simply "I am locked out, help!" otherwise it would indeed be a good idea to let the admin upload/specify a config backup through that out-of-band channel.

IMO, relying on RAM is precarious, at best.

  • The amount of time that data may persist in RAM after a power cut is very likely device dependent (the power supply architecture, the RAM chips used and the amount of local bypassing, etc.), not to mention may be different if the power is cut at the router vs at the outlet (as a function of the capacitors in the power brick). It may even depend on chip-to-chip variability of the RAM chips themsleves (even when the same brand and part).

  • The contents in RAM are not deterministic at power-on. Data may persist, but also may become corrupted during the power-down/power-up sequence. Such a mechanism may not be reliable. There is a highly unlikey, but non-zero chance that the random processes that dictate the power-on state of the RAM could even mean that the magic number shows up during the power-on sequence (these chances are indeed really really low, but still statistically not zero).

  • Further, depending on the chips and/or the system design, there may be a power-on-reset that clears the contents of RAM, so that would mean that not all devices can utilize such a method

  • A relaible trigger mechanism still needs to be designed to detect that the router is in a problematic state. It's not trivial, but possible to detect a state when it comes to actual corrupt or invalid configs. But a config can be completely valid (from a technical standpoint) but produce undesired effects that cause the router to be unrechable (simple example: input = reject/drop for the lan firewall zone). The system has to be able to determine that this was a mistake vs intentional (maybe there is an explicit rule that accepts input for a designated host, maybe there's another management network, etc.).

2 Likes

This is why I'm only using it for the reboot_failsafe command, which only tries to keep it in memory across a warm reboot. Much more testing will be needed to determine what's a typical power-cut persistence time. But remember that DRAM is just a capacitor bank itself, and the "cold boot attack" (the malicious version of this same thing) is very difficult to defend against intentionally.

You're... you're going to split hairs over a 1 in 2^32768 chance? Really? Why? What possible benefit could that have besides contrarianism for its own sake? What on earth counts as "negligible" in your book, if not something that's millions of times (EDIT: I previously said "orders of magnitude" by mistake) more unlikely than winning the lottery every week for the rest of your life?

The address is chosen carefully so as to be outside of the ranges initialized by the bootloader and early kernel.

With the existence of a reboot_failsafe command, the admin has the ability to develop and/or install a daemon that triggers a reboot into failsafe mode in response to whatever stimulus the admin deems appropriate. (Personally, I would go for some kind of out-of-band signal, not "try to divine that the configuration is a mistake.")

Yes, it is effectively a capaictor bank, but one that requires frequent refreshing (and we're talking in the 10's of ms). Leaving it unrefreshed/unpowered for any amount of time on a human scale may end up resulting in lost/corrupted data in the RAM. All you need for this to fail to work is one flipped bit in the designated area. And in this case, I'm less worried about the cold boot attack, and more about the fact that this is not likely to be reliable as a mechanism to trigger failsafe.

This would require a lot of testing... and it would have to be done at different temperatures (since the semiconductor device intriniscs are temperature dependent). Also, who's going to test this across all of the supported devices? Or even a subset that is considered 'representitive' (although I'd argue that a 'subset' isn't sufficient to call 'reprentitive' here).

And what hapens if it is found that there are PORs happening for some devies?

I fully admited that it was a vanishingly small risk... but it is not zero.

This may be true for some device, but others may perform a POR of the complete memory as part of the power up sequence.

I'd argue that typically the admin will know some types of failure modes that they might include in the daemon/trigger definitions. However, they also know to avoid setting the problematic configurations that cause them. It's the million other misconfigurations that they won't think of that are the ones that will cause them to get locked out. Case in point -- I have a Linksys E3000 that I use purely for experimentation. It doesn't have a functional failsafe (the button doesn't actually trigger failsafe for some reason) Two days ago it seemed like it bricked, even though I was really careful to only work on a separate VLAN for an intentionally invalid configuration I was playing with. It turns out that dnsmasq crashed... the router was running, but DHCP was dead. During that time, I thought I needed failsafe and I tried the buttons (but like I said, they don't work)... I never would have thought to set this condition into a 'buttonless failsafe daemon' and so it wouldn't have triggered.

I repeat myself once again that I am using this technique for the reboot_failsafe command, which performs a warm reboot and does not cut power.

In our current understanding of the universe, even events that we consider completely impossible (including, in some theories, certain violations of the conservation laws) carry a nonzero probability of actually occurring. A nonzero probability of anything is not special. It is typical. We don't use those probabilities in civil debates because it's intellectually dishonest (I have heard this fallacy called "appeal to improbability") to suggest that there is any merit to considering them seriously. If we must bring them up, the proper term to use here is "negligible" -- as in, "safe to ignore" -- not "the chances are really really low." The latter is extremely misleading to a layperson browsing this thread, who might otherwise think that you are saying there's a, oh I don't know, "one in ten thousand boots" possibility of inadvertent triggering of failsafe mode. The term is "negligible."

Please: Cut it out. This discussion can't continue like this. Your feedback is good without having to resort to such sophistry. You're above this.