Buttonless Failsafe Mode

From a theoretical perspective is having an IP compatibility a fundamental requirement to communicating a signal?

I'm just trying to think of how node A could communicate a detectable pattern to node B without a good IP configuration therebetween.

What about some form of broadcast?

Is NO-CARRIER reflective of power status such that it's not like a sequence can be generated that way whilst maintaining power? Because otherwise a pattern could be communicated with cycling that in a particular way, right?

Or can transmission rate be set between 1000 and 100 and flip flop therebetween in a specific sequence?

Or how about six unique magic packets:

Or combination of the above?

I don't know enough about ethernet/networking to come up with a strategy but surely there is a way here.

No, an IP is not absolutely required. This can theoretically be done at L2 like WOL.

But there are logistical issues here... you wouldn't want this to be possible to engage from the WAN. But depending on the device, the MAC addresses may not be unique between the LAN and WAN... or maybe the WAN is being used as 'just another port' on the switch (which can be done at a higher level within OpenWrt by remapping ports on the switch or bridging ports). Or worse, for multi-WAN configurations where LAN ports are reassigned to serve as WANs, now the MAC that might have nominally been assigned to the LAN is now actually exposed to the WAN. Also, what do you do with a device that has only a single ethernet port (such as an AP or travel router)? So it could be hard to guarantee this method remain secure in all cases and also that it would always work.

Again, this could be triggered accidentally. Physical power issues as described earlier could easily falsely trigger this. So could plugging/unplugging devices (or power cycling/rebooting those devices, or even just changing configurations such that the ethernet port bounces). I even had a physical wiring issue that caused a whole lot of port flapping until I identified and fixed the issue (bad termination).... all of these things could very easily cause a false positive.

1 Like

Go on, give us a proposal.

It's your idea... I don't want to steal your thunder.

I don't care about thunder. Why not give it a shot.

Give it your very best proposal and then see if it withstands your own scrutiny. Wouldn't that be a nice idea rather than just finding faults with or shooting down all of my suggestions?

I'm sure there is a way here.

How about magic packet sequence after specific time from switching on?

But it's not my idea... and I don't have any interest in putting effort into concocting a method for what I kind of see as a solution looking for a problem.

I'm telling you why I don't think these things would work based on my knowledge of networking. If you told me that you wanted to use a plane with a propeller or jet engine to fly to the moon, I'd tell you that there's no air in space, so you need to come up with another method of propulsion. I might not be able to tell you how to build the rocket engine, but I could tell you that it needs to carry its own fuel and oxidizer to be able to burn since there's no oxygen up there to support combustion.

I don't have a solution, but I can tell you that these (so far) are not robust enough for integration into core OpenWrt. There was the "unbrickable" script that was shared earlier.. that seems pretty perfect for your needs, but even that is probably not something that should be included in OpenWrt by default.

If this is something you really want to explore, you should do some R&D around it yourself. If you can find a way that is robust enough for the developers to sign off on the idea and implement it, that would be quite cool.

Making one magic packet or a sequence of them doesn't really matter, as I explained earlier:

RFC 1918 16,777,216
On a HooToo or RouterBOARD just an example of two device by separate manufactures that run in Class A.

MikroTik first load is inframs to get to the jelly of OpenWrt.

How is this a solution looking for a problem? Device has become inaccessible and no power button or device inaccessible. That's a real problem.

Recreating a scene from back to the future with serial console is hardly convenient, nor is having to climb a ladder.

So I strongly dispute that assertion.

But your device does have a reset button.. it's just difficult to access. I get that -- it's hard to access which makes this a problem you want to solve.

Why is the brickproof option not sufficient for your needs?

It might be. But I didn't start this thread to find a way that might work for my device. I could write a script for that.

I wrote this thread because had there been an implementation in OpenWrt for a buttonless return to safe mode or whatever I would have really benefited from it. And I'll bet there are many others who would have really wanted this too at one time or another. And have had to break open device and use serial console or climb ladders or whatever too.

It's not a solution looking for a problem. It's a real and existing problem. Maybe as you described earlier a tough nut to crack. But categorically not a solution looking for a problem.

Pooh-pooing is a court marshal offence and destroys morale:

Suggests to me that it's indeed practical, as it apparently has been done that way already - and yes, that is from the https://openwrt.org/toh/zyxel/nr7101 device page.

Thankfully I did not need that and it certainly doesn't look so much like a scene from Back to the Future should I ever need it later on. In my case I could get away with just pressing the reset button having climbed a high ladder.

Whilst the latter may have motivated my feature request, as I tried to convey in the very post you replied to, I did not intend managing my particular device to be the subject of my feature request.

I'm hoping future posts on this thread can be more positive and constructive as some of the posts seem to reflect a mind desirous of misunderstanding or tearing down.

The subject of this thread does not reflect a problem looking for a solution like @psherman suggested in one of his posts above. Facilitating a user-friendly mechanism to regain access when locked out on a buttonless or hard to access device is a very real problem. And it can be a time of despair for inexperienced or new OpenWrt users. Preempting the situation with a custom script is one thing, but that's not helpful when the user is already locked out and hoping for a simple way to reset and regain entry without undue stress and difficulty.

Thanks to you, @psherman and others on this thread I see that there are difficulties in implementing a good technical solution to this problem, but I like to think that a good feature request in this helpful new forum category does not need to be presented together with an optimal solution.

I have tried to give viable options to the best of my limited understanding in this technical field like adopting:

  • a power cycling sequence; and/or
  • a magic packet sequence,

and I see there are outstanding challenges, but I don't think these are necessarily insurmountable.

I don't have a potential solution to offer but I do offer a requirement to consider.

Any solution based on network, IP packet or Ethernet frames (magic packet/WOL type) etc., should only be usable by authorized devices. It should not be physically available from any unauthorized system like a compromised device on the wired LAN or wifi.

We don't want malware on another device on our LAN, like malware on a PC, phone or iot device, to DOS our network, or worse, make use of this published feature to compromise our OpenWrt devices.

2 Likes

I wondered about that. Perhaps the magic packet could incorporate hashed router password? And by having specific window from powerup when this can be accepted this might also add further protection?

1 Like

That probably isn't safe enough. A magic packet should not be repeatable. We don't want a compromised system on our LAN to listen for these and then simply be able to replay the packet.
I think a fully encrypted communication channel providing equal security as ssh or https would be needed.

Ah yes. I wondered whether that would be feasible within a magic packet context and it seems like something along these lines is the subject of this Intel patent:

Currently, this wake mechanism is insecure. In other words, computing devices or platforms do not sufficiently protect against spurious or malicious wake events. A so-called “sniffer” can monitor the packet sent over the communications network used by the two systems. A malicious person can detect such packets and replay them at a later time. A variation of the wake mechanism is referred to as “magic packet+password”. The “magic packet+password” is similar to a packet of the “magic packet” but includes an additional six-byte password appended to end. While the “magic packet+password” mechanism does have a password, the password is nonetheless sent unencrypted and susceptible to a replay attack in the same manner as the “magic packet”.

Upon waking the necessary hardware, network controller 170 may allow management controller 180 to establish a secure or encrypted communication session between network controller 170 and the remote console.

Albeit our context is trickier because setting up that communication channel is rendered harder (or impossible?) given the lack of connectivity we are attempting to address.

I admit I'm beginning to lose heart a little now.

idea is great, but highly unsecure

what ever approach to choose:

  1. router SEND magic packets to some dst mac/ip/port knock and then bring up ssh daemon for short time on positive reply from other side

  2. router LISTEN for some mac/ip/port knock combo and then bring up ssh daemon on positive match

problem is that there must be some kind of password to access the device

if the password is same for all OpenWRT device, then, it is no password anymore :slight_smile:
so, only unique password which could be baked at boot time is something based on device MAC address
and that is far from strong password

I don't think I have ever had a need for such a feature (I have never had devices in hard-to-reach places) but the problem has nonetheless successfully nerd-sniped me. From the thread, I am under the impression that the go/no-go for failsafe mode must happen before the overlayfs is established, and there's no mechanism for rebooting into failsafe mode (as indeed a good option might be to fork a long sleep followed by a reboot-to-failsafe into the background before touching the network config, killing it when done).

I agree fully that the design must permit NO false positives (accidental or malicious) and must not interfere with the rest of the network. But I do not think it is a reasonable or feasible goal to make the software failsafe 100% reliable (as indeed that's what the button is for). The goal here is just to have something else to try before one has to resort to climbing out of one's window. To that end, anything that works in 90-95% of cases should be "good enough" (and is obviously a massive improvement over the 0% we have today).

This is a sketch of a solution, not an actual proposal, but:

  1. On first boot, a failsafe key is randomized and saved to /etc/config/system. The idea is that this gets included in configuration backups, so that the device admin is likely to have it when they end up needing it.
  2. In preinit, before the normal/failsafe decision but after /overlay is mounted(?), the key is parsed out of /etc/config/system and a lightweight "failsafe service" is launched. The boot process waits for this service to run to completion before deciding how to proceed.
  3. The "failsafe service" scans for all Ethernet interfaces and (if available this early) 802.11 interfaces. They are brought up (the latter in monitor mode, on channel 1 or 36 depending on band) in a strict speak-only-when-spoken-to fashion, to avoid interfering with the network.
  4. The service listens for 500ms(?) for magic packets on the wired (with a magic EtherType) and wireless (802.11 Action frames) interfaces. After the timeout, proceed with normal boot.
  5. If one arrives, check that it has a valid HMAC (using the configured key) and see if it is a challenge packet or a failsafe packet.
  6. For a challenge packet, randomize a nonce (IFF not done yet this boot) and transmit it back to the sender. This protects against replay attacks. For a failsafe packet, ensure that it contains a nonce and matches the one that was randomized, then proceed to failsafe boot.

I'm reasonably confident that this (or something like it) satisfies the "do not interfere" and "no false positives" requirements. There is no guarantee that there are no false negatives, as a few things can still go wrong:

  • Admin may not be able to send the necessary magic packets: either they are remote and not on the same L2 segment, or their computer does not permit sending raw Ethernet frames and/or injecting wireless Action frames
  • Admin may not actually have any backup of the device's config handy
  • Device might be unable to initialize EITHER 802.11 or Ethernet interfaces that early if the kernel modules don't load or they both require firmware
  • Device's 802.11 WNIC might not support frame injection to send the challenge nonce, forcing the admin to use Ethernet
  • Device might have been soft-bricked in some way that kills this mechanism (the failsafe key was deleted, or the network drivers corrupted, or ...)

I don't see any of these cases being extremely likely, though. I still think the reliability percentage is somewhere in the 90s. And, again, the point is just to have something for folks like @Lynx to try before resorting to riskier things like leaning out the window and/or climbing a ladder. We aren't replacing the hardware button, just augmenting it.

1 Like

I like the thinking here, but I don't think that it will work...

This key would need to be written to the overlay. No issue writing it there on first boot, but...

The key was written to overlay, so this can't happen until overlay is mounted. But overlay doesn't get mounted until later in the boot cycle -- after the normal/failsafe decision has been made.

Wifi doen't come up until way later in the boot cycle... you need to load a lot of things before you can load the drivers and start a network stack on those interfaces. Ethernet isn't even up at the time of the failsafe decision.

Can't listen if the interfaces aren't up.

Hi Peter! Thanks for taking the time to provide feedback.

The idea would be that the writable partition (not the overlay) is temporarily mounted read-only just long enough to peek in there and grab the key, then immediately unmounted. Much later on, the overlay is mounted normally.

Come to think of it, this same mechanism could be used to implement a reboot-to-failsafe, which would make this whole thing a lot simpler: the "go to failsafe" signal could be received later in the boot process, perhaps even when the system is fully booted. Hmm...!

Yeah, fair point. I was wondering about that in the back of my mind the whole time I was mentioning WiFi, but figured I'd include it in case any devices had WiFi interfaces (certain hardmac devices with on-chip firmware) that could be available that early in the boot process. (The real motivation for WiFi support at all is because WiFi interfaces don't have those pesky switchdevs in the way.)

The quote this is under says that the failsafe process brings up the Ethernet interfaces. They can be taken back down again before proceeding in the boot process.

As an aside: I ask that you keep the criticism -- though it is appreciated -- constructive. "X won't work" doesn't give me a path forward. "Y may work better than X" is most preferred as it offers an alternative. "For X to work, we would have to do Z. I am concerned the costs of Z outweigh the benefits of X" allows community ownership of the difficulty of problem Z: perhaps someone will come forward who is willing to do Z, or someone can offer an innovative way of achieving X without Z. Alas much of the various feedback so far in this thread seems to be "this requested feature can't work in the current boot process" which is... pretty circular reasoning given that the request is to modify the boot process! :slight_smile:


While I still believe this mechanism is workable, and it does solve a real-world problem ("How does one get a device into failsafe mode when reaching the reset button entails a personal safety risk?"), I do wonder if it's practical or would even see a lot of use. There's a certain amount of complexity that would have to be introduced into the boot process that the indoor-only users would not need (or, perhaps, tolerate) just for the few users with outdoor devices. It's an important problem, because it's addressing a safety concern (we do not want to make our users go up on ladders), but I'm wondering if specifically targeting failsafe mode isn't quite the right call here: it's a lot of complexity for one very specific situation.

After sleeping on it, I realized that this type of protocol would be a lot more useful to me personally (in the sense that it would save me having to get up and walk over to the device if I bungle the config) if it could work on a fully-booted device and get me a root shell. The same L2-protocol-only trick is used to bypass as much of the network stack as possible, getting around ip/netfilter misconfigurations -- indeed, the only requirement for access via WiFi Action frames would be that the PHYs be kept awake, which isn't hard at all to do. Such a daemon can be an entirely separate project independent of OpenWrt, available as an optional package for those that want it. (And who knows, it may prove to be so popular that it works its way into the core distribution.) I'd certainly enjoy writing such a thing, too, once I have the free time for it!

Regardless, I believe we should address some of the shortcomings in UCI that led to this problem happening in the first place:

This is a pretty clear sign that UCI is a far cry from LuCI's reliability. @Lynx essentially states that if the same change were done through the latter, this particular soft-bricking would have been avoided. I think there are 3 pretty low hanging fruits we should consider:

  1. Invent a uci apply command, which mirrors LuCI's analogue: It applies (but does not commit) the changes, tells the user they need to run uci apply --confirm to prevent the rollback, and sleeps for 5 seconds before exiting (so the user won't confirm too soon). A uci commit would also cancel the rollback, but uci apply --confirm is the recommended way to go because it won't work after the rollback has already happened.
  2. @Lynx was editing /etc/config/network directly, which (although I do the same myself) I think is pretty bad practice: you are saving changes before testing them. We should probably learn from tools like visudo that when a file is so critical that you could be locked out if you screw it up, you shouldn't be editing it directly. So, we should invent a uci edit command (so I can run uci edit network instead of vi /etc/config/network) which copies the pending config to a temporary location, runs the editor on it, checks if any changes were made, and if so, does a quick check for syntax and conflicting config changes that happened during the editing, before staging that new file for commit. (And, since it's meant to be an interactive command, it can even suggest that the user run uci apply next.)
  3. Discourage writing to the saved configuration without testing it first! UCI could stick a notice at the top of the /etc/config files that says something like "editing this directly is discouraged, use uci edit instead." The documentation also needs to do a much better job communicating that uci commit should not be used until any changes are tested. The first example in the UCI page of the user guide is currently recommending the opposite.

@Lynx would these be a satisfactory resolution if we ditched the "failsafe mode" idea specifically?

1 Like