Buttonless Failsafe Mode

IMO, relying on RAM is precarious, at best.

  • The amount of time that data may persist in RAM after a power cut is very likely device dependent (the power supply architecture, the RAM chips used and the amount of local bypassing, etc.), not to mention may be different if the power is cut at the router vs at the outlet (as a function of the capacitors in the power brick). It may even depend on chip-to-chip variability of the RAM chips themsleves (even when the same brand and part).

  • The contents in RAM are not deterministic at power-on. Data may persist, but also may become corrupted during the power-down/power-up sequence. Such a mechanism may not be reliable. There is a highly unlikey, but non-zero chance that the random processes that dictate the power-on state of the RAM could even mean that the magic number shows up during the power-on sequence (these chances are indeed really really low, but still statistically not zero).

  • Further, depending on the chips and/or the system design, there may be a power-on-reset that clears the contents of RAM, so that would mean that not all devices can utilize such a method

  • A relaible trigger mechanism still needs to be designed to detect that the router is in a problematic state. It's not trivial, but possible to detect a state when it comes to actual corrupt or invalid configs. But a config can be completely valid (from a technical standpoint) but produce undesired effects that cause the router to be unrechable (simple example: input = reject/drop for the lan firewall zone). The system has to be able to determine that this was a mistake vs intentional (maybe there is an explicit rule that accepts input for a designated host, maybe there's another management network, etc.).

2 Likes

This is why I'm only using it for the reboot_failsafe command, which only tries to keep it in memory across a warm reboot. Much more testing will be needed to determine what's a typical power-cut persistence time. But remember that DRAM is just a capacitor bank itself, and the "cold boot attack" (the malicious version of this same thing) is very difficult to defend against intentionally.

You're... you're going to split hairs over a 1 in 2^32768 chance? Really? Why? What possible benefit could that have besides contrarianism for its own sake? What on earth counts as "negligible" in your book, if not something that's millions of times (EDIT: I previously said "orders of magnitude" by mistake) more unlikely than winning the lottery every week for the rest of your life?

The address is chosen carefully so as to be outside of the ranges initialized by the bootloader and early kernel.

With the existence of a reboot_failsafe command, the admin has the ability to develop and/or install a daemon that triggers a reboot into failsafe mode in response to whatever stimulus the admin deems appropriate. (Personally, I would go for some kind of out-of-band signal, not "try to divine that the configuration is a mistake.")

Yes, it is effectively a capaictor bank, but one that requires frequent refreshing (and we're talking in the 10's of ms). Leaving it unrefreshed/unpowered for any amount of time on a human scale may end up resulting in lost/corrupted data in the RAM. All you need for this to fail to work is one flipped bit in the designated area. And in this case, I'm less worried about the cold boot attack, and more about the fact that this is not likely to be reliable as a mechanism to trigger failsafe.

This would require a lot of testing... and it would have to be done at different temperatures (since the semiconductor device intriniscs are temperature dependent). Also, who's going to test this across all of the supported devices? Or even a subset that is considered 'representitive' (although I'd argue that a 'subset' isn't sufficient to call 'reprentitive' here).

And what hapens if it is found that there are PORs happening for some devies?

I fully admited that it was a vanishingly small risk... but it is not zero.

This may be true for some device, but others may perform a POR of the complete memory as part of the power up sequence.

I'd argue that typically the admin will know some types of failure modes that they might include in the daemon/trigger definitions. However, they also know to avoid setting the problematic configurations that cause them. It's the million other misconfigurations that they won't think of that are the ones that will cause them to get locked out. Case in point -- I have a Linksys E3000 that I use purely for experimentation. It doesn't have a functional failsafe (the button doesn't actually trigger failsafe for some reason) Two days ago it seemed like it bricked, even though I was really careful to only work on a separate VLAN for an intentionally invalid configuration I was playing with. It turns out that dnsmasq crashed... the router was running, but DHCP was dead. During that time, I thought I needed failsafe and I tried the buttons (but like I said, they don't work)... I never would have thought to set this condition into a 'buttonless failsafe daemon' and so it wouldn't have triggered.

I repeat myself once again that I am using this technique for the reboot_failsafe command, which performs a warm reboot and does not cut power.

In our current understanding of the universe, even events that we consider completely impossible (including, in some theories, certain violations of the conservation laws) carry a nonzero probability of actually occurring. A nonzero probability of anything is not special. It is typical. We don't use those probabilities in civil debates because it's intellectually dishonest (I have heard this fallacy called "appeal to improbability") to suggest that there is any merit to considering them seriously. If we must bring them up, the proper term to use here is "negligible" -- as in, "safe to ignore" -- not "the chances are really really low." The latter is extremely misleading to a layperson browsing this thread, who might otherwise think that you are saying there's a, oh I don't know, "one in ten thousand boots" possibility of inadvertent triggering of failsafe mode. The term is "negligible."

Please: Cut it out. This discussion can't continue like this. Your feedback is good without having to resort to such sophistry. You're above this.

@CFSworks isn't this rendered moot given:

I mean it's not just about one bit then, right?

Depending on the hardware, a warm reboot may operate very much like a cold boot. Think of any machine that does a ram test at boot - doesn’t matter if it is warm or cold, if it writes anything to ram in your magic area, the failsafe reboot fails to perform the task.

It depends on how the flag is checked. Right now, my /lib/preinit/35_check_failsafe_flag as designed checks that the exact 32,768-bit flag is in there exactly. But it could instead check for some maximum tolerable binary Hamming distance if it was desirable for some reason.

Since this is for warm boots only, I don't really see much of a benefit. (Though the flag will have to be shortened if the idea of stashing preinit config overrides in there comes to fruition.)

Since my claims are now backed by a test on (admittedly, only) one piece of real hardware (a GL.iNet 6416), I think it's fair to say that the ante for playing at this table is no longer the hypothetical "any machine that ..." but is now the specific OpenWrt-supported machine you have in mind that does do this exact RAM test on every boot. (If you can't name one, the intellectually honest thing to do here is to retract your claim. I am specifically disputing that such things are common.) (EDIT: I should also add that it has to be one that uses the squashfs+jffs2 rootfs combo, making the RAM signaling necessary in the first place, so "well some x86 machines are equipped with memory scramblers ..." does not count.)

Twofold answer here:

  1. Many x86 systems still do a real ram test. Openwrt is supported on x86.

  2. for a feature of this nature (which fundamentally alters the boot behaviors) to be seriously considered for implementation as an addition to the core openwrt functionality, I believe that it is incumbent on the requesters to test on a broad selection of hardware to prove that it actually works reliably enough to be universally useful. The burden of proving it doesn’t work shouldn’t be on the people who are skeptical - instead those who believe this is a viable option must put the effort in to show conclusively that it does work.

Think of this like the shift in signing off on the launch of a crewed spacecraft after the challenger disaster. Prio to the accident, management asked the engineers to prove that it wasn’t safe to fly. After, they had to prove that it was safe (as safe as can be, given space flight is inherently dangerous) by showing that they have considered everything possible regarding the safety systems and procedures.

Please don't ignore the edit that I made before you replied where I stipulated the reasons that x86 does not count.

Please, again: Stop being like this. I really want to have a civil and honest, mutually-beneficial, collaborative conversation with you about this. Your lack of careful reading of my points, reliance on logical fallacies, ignoring the matter entirely (rather than correcting it) when I identify the fallacies, and gatekeeping of the information I need (e.g. specific model numbers of machines) to assess whether there are any merits to the whataboutisms and how the idea can be adjusted to support those machines is just not conducive to improvements to the project. :confused:

1 Like

Hey all, I have been reading this whole thread just now, having missed it when it started.

I just wanted to mention a relevant issue which is that if there is a magic network packet I strongly recommend it uses Ipv6 link local addressing, thereby avoiding any network config issues at all.

The router could generate a nonce, send the nonce via Ipv6 multicast, whoever cared to cause the safe mode could generate a reply hash of the nonce+password and send it Ipv6 link local to the saddr associated with the announced nonce.

1 Like

Thanks @dlakelan - interesting point. That could perhaps be a virtual press for our virtual button.

1 Like

I recommend sending the nonce to ff02::fa17:5afe :sunglasses:

1 Like

Why does it “not count”? X86 is a significant platform. That is like saying that pickup trucks don’t count as personal vehicles. And for the record, I didn’t see your edit when I was writing.

I’m think you may be misinterpreting my comments. I’m saying that the ideas, while clever, are not currently proven robust enough to pass muster for a critical functional change to openwrt. My arguments are not logical fallacies just because you don’t agree with them.

I’m not the gatekeeper here. I cannot possibly know which devices POR the memory and which don’t. But I expect that a solution presented in this discussion would have testing by its authors to provide insight into the universality of the method. Right now you’ve said it works on your singular device. That is a long way from a universal solution.

I swear by IPv6 LLA. I have a network I'm designing for somebody right now and I do the initial stand-up with the fe80:: addresses since they're all unique, and it lets me ignore the fact that they're all using 192.168.1.1 and running DHCP servers.

A script early in /lib/preinit/ can easily turn on the Ethernet interfaces and listen (either on IPv6 LLA addresses or for raw layer 2 traffic) for a magic packet, before turning them all back off and proceeding with boot. The only hangup with this approach is the magic packet has to be cryptographically strong against replay attacks, which introduces a few moving parts to the boot sequence (namely it has to do a quick read-only peek into the rootfs_data partition to read a key, which seems to make others in this thread uneasy).

Answers already provided by my other posts in this thread:

(+ your continued refusal to call it "negligible," and insistence that "really really low" is not a misleading understatement.)

Can you prove this assertion?

I currently believe it to be true anecdotally: of the various devices I've installed OpenWrt on and had to use the bootloader (so about 4-7?), none of them wasted valuable boot time on anything long like a memory test. Even the datacenter switches I've installed NOSes on do not do a memory test in the bootloader/firmware unless you specifically request it -- and those are much more serious boxen than a Belkin in the windowsill.

It is possible to do a more rigorous study (e.g. a random sampling) of the devices in the OpenWrt TOH and get a pretty good idea of the percentage here, but I'd really rather have one singular device named (by model number, so I can look into it) so that if the RAM idea really is nonworkable, I can brainstorm based on what facilities are in that machine, then work backwards to the other devices. It puts me in an unfair position to be asked "what will you do about all of the various machines out there which do a RAM test on every warm(!) reboot" if I can't even have a single example named. Not only do I take this difficulty to find a single concrete example (model numbers!) as support for my current hypothesis, it also comes across as an unwillingness to put in the bare minimum amount of effort to back up the assertion that such a thing even exists.


I would like to take a moment to reiterate that I do think highly of you. You know your stuff. The pushback is part of the process and is, in general, good for tempering rough ideas into solid solutions. I don't want this discussion to get hostile; I have been worried for the past few replies that it might erupt into a flamewar at any given moment and I would really like to avoid that. It may be good to get away from the negativity for a bit. Note that there are several ideas in this thread about avoiding a failsafe reboot that haven't really been touched on; if these are workable, it might be good to focus on those for a while? Just to bring up the "average positivity per post" percentage in this thread a bit? :slight_smile:

1 Like

Many (most?) openwrt devices have neither hw rng nor good randomness source. Making your system provably cryptographically strong will require a lot more then just "peeking" around.

Heh, we don't even have a proof that P!=NP. We can't prove any NP-hard cryptosystem strong without that; we have to resort to models that replace primitives with idealized functions before we can assess the strength of protocols, and that still makes the assumption that primitives are good... :pensive:

But, yes, the system should read from /dev/random not /dev/urandom to guarantee that these are good bits used in a nonce, and some of those "moving parts" may end up having to include a userspace entropy daemon.

/dev/random doesn't have enough entropy on a typical wi-fi router, to consider generated bits "cryptographically strong". Even more so in the early boot stages.