Wifi spontaneously and persistently disabled

I think this is a bug but it's really hazy and next to impossible to reproduce unless you manage hundreds of routers, so I thought I would just put it out there. If anyone cares, they can look into it without further input from me.

I've worked on many OpenWrt routers over the years at various locations having diverse network environments. Different router brands, different OpenWrt core versions up to and including the latest stable. One thing that has cropped up, maybe on average a few times a year, is wifi which becomes disabled and stays disabled even after turning the router on and off.

What happens is that the wifi will just stop working. With a few exceptions, this has been easily fixed by going back into the Wireless menu and enabling the interface again manually.

Here's what I think is happening. Somewhere in the code, there's something like this:

  1. OldWifiState = WifiState

  2. WifiState = Off

  3. (Do some critical stuff that needs wifi to be off)

  4. WifiState = OldWifiState

Once in a great while, lightning strikes (perhaps literally) and the power gets lost during #3. However, the update in step #2 has actually flushed to permanent storage. Now you turn the router back on. But WifiState is still Off so the wifi doesn't power up. People who had someone else set up their router are now dumbfounded and make a support call.

The obvious solution is that #2 should never be allowed to be flushed to permanent storage.

As I mentioned, I've noticed a few exceptions. In those cases, there might be nested instances of this issue which compound to disable the wifi in a more permanent manner, necessitating a full firmware reset.

You would probably need an LLM to search the code for stuff like this (and maybe not only applying to wifi) on account of its abstract nature. Maybe it's not in the core code at all, but a hardware-specific deployment layer. I suspect the core, though, on account of the brand diversity.

Probably related:

https://forum.openwrt.org/t/wifi-disables-on-its-own

Maybe related:

https://forum.openwrt.org/t/persistent-wireless-network-failure
https://forum.openwrt.org/t/unifi-6-lr-disables-wifi-6-randomly
https://forum.openwrt.org/t/after-i-do-uci-set-wireless-wifi-device-0-disabled-0-my-wireless-routers-ssid-disappear

Possible workaround in the meanwhile:

Best of luck with this.

Are you willing to answer any questions?

I've never had this happen. OpenWrt's design is not to write to flash unnecessarily. There isn't a situation that requires temporarily disabling wifi other than restarting the interface. Running wifi down will shut down the radios non-persistently (until wifi [up], restart networking, or reboot); the config files are not changed.

In a default build, there is a wifi on/off function in the /etc/rc.button/rfkill script. It uses a uci call to write disabled to the config file. This script is for models that have a wifi on-off button, but it is present in all builds. I don't know what else may call the script in a model that does not have a button. The script does not log what it does. Adding a log call would help to diagnose if this is the reason. It should not hurt anything to remove the script or remove execute permission.

2 Likes

Checking on a Netgear R7200 which has a Wi-Fi disable/enable button and is running 25.12.0-rc2. Pressing the button for 2+ seconds turns off the Wi-Fi and /etc/config/wireless sections for wifi-device are updated with option disabled ‘1’. The sections for wifi-interface are left alone. Pressing the button again turns the Wi-Fi back on and sets option disabled ‘0’ in the the device sections. The changes persist over reboots.

The Web UI doesn’t have a enable/disable exposed for the radios themselves but pressing Enable for an interface removes the option disabled '1' from the corresponding radio / device section.

This is certainly a possibility for the behavior you are seeing - a Wi-Fi disable/enable button being pressed on purpose or by accident leading to the radios being disabled and that persisting over reboots.

Grab a copy of /etc/config/wireless the next time it happens and see where it is disabled - in the devices or the interfaces. Also check the router models for a Wi-Fi enable/disable button (either dedicated or less likely a repurposed WPS button).

1 Like

I have on just 2 occasions. Both involved Ubiquiti devices.
A power off and on would, to begin with, occasionally result in the radio being disabled in the /etc/config
It got worse over a few months.

Long story short, it was the flash memory starting to fail. A reboot would result in the overlay being unmountable so a new one was created in ram. Re-enabling in the /etc/config/wireless would get it going of course, but not able to survive a reboot - except for sometimes when the flash based overlay actually did mount.

The cause in this case was tracked down to a failing PoE switch, slowly cooking the router with 35 to 40 volts instead of the required 24volts.

My gut feeling is something similar is happening here....

2 Likes

@lleachii @mk24 @DBAA @bluewavenet Thank you for the detailed replies, and yes I will leave everything intact and report back if I ever see this again (but that's less likely because I'm no longer interacting with so many routers on a regular basis). Based on your feedback, I wonder if this is more along the lines of what's actually happening...

I distinctly remember one router in which the wifi failed right in the middle of active use, and a few times at that, days or weeks apart. It definitely wasn't due to an accidental button press because I was the only person working in that office, and it was situated across the room from me. The SSID would just suddenly disappear. A power cycle wouldn't work; I would have to connect via LAN cable and manually hit the Enable button in Luci's Network UI. I had installed the unmodified and latest firmware image from OpenWrt before this started happening. I'm sure I didn't add any custom script or cron job that would have affected wifi, even indirectly, and otherwise the performance seemed normal. It would have been early 24.X or at worst late 23.X. Unfortunately I have no way to get a hold of it anymore, nor do I recall the brand or model. I distinctly remember that I had only ever enabled 5 GHz. I also remember another OpenWrt router that I had previously set up myself, which was a different brand, and in which 5 GHz suddenly died at least twice, but 2.4 GHz was always fine; again, there were no custom scripts or cron jobs involved. So perhaps this is something specific to the 5 GHz code, or more likely, 5 GHz brings in so much more data that it's causing something to overheat, inducing the core code to disable it.

With that experience in mind, perhaps what's happening is that the core code is disabling the wifi because it perceives it to be unhealthy, but it does so without enabling any timer event to check back on it and try again. For example this might occur due to problems talking to it from the CPU, or a high thermal reading, or a high frequency of bad packets. But this occurs in such a way that the disabling is persistent across reboots (like the "option disabled '1'" thing mentioned above).

or could it be that you were using DFS channels?

3 Likes

Yes if you select a single DFS channel, and radar is detected (typically a false alarm, or maybe a military aircraft passing nearby) it will shut down and not come back. The workaround is to allow several channels with option channels instead.

2 Likes

Have you considered this possibility?

2 Likes

This is the key. The default is for the radios disabled. Any changes made manually are stored in the overlay partition. If, on a reboot, overlay cannot be mounted, a new one on is created on "ramdisk" and is a clone of the defaults in the image.

If you had created a custom image (eg Firmware Selector or Imagebuilder) with radios enabled in the resultant image, then in such a scenario of overlay failing to mount would still leave your radios enabled.

The fallback functionality of rebuilding a default overlay in tempfs (aka ramdisk) is very likely the "something in the code" that you are looking for.

What can cause a fallback? It is triggered by the boot process being unable to mount the overlay partition.

That could be for many reasons, here are some common ones:

  1. Failing flash storage - often caused by flash wear - eg persistent rotating logs stored on overlay WILL cause this.
  2. Temporary flash memory lock up - usually overheating and/or power supply issues. This can cause a crash, but remaining powered up or sitting in the sun or next to a heater output prevents cool down. A reboot triggers the fallback. Bring it back to your bench, it works again and you can't find anything wrong.
  3. Out of memory (RAM) on attempting to mount the overlay - if it contains a lot of data/packages manually installed etc., mounting can demand a large amount of RAM. The mount fails and the fallback is triggered, but the image is default and an overlay regenerated from scratch on ramdisk is very small, so no Out Of Memory failure.
  4. Insufficient flash storage for the image - very common on old, small flash devices - the actual image installs but leaves no free space for overlay.

I can think of others (plus combinations of the above), but these are the most common.

I am 99.99% sure this is what you saw happening.

4 Likes

@psherman @mk24 I believe you nailed it!

I had never even heard of DFS channels until you mentioned them. I was trying to maintain workplace security: minimize signal leakage by minimizing antenna power and also hardwiring the highest (or nearly so) frequency (therefore, hopefully, resulting in the most rapid attenuation in air). I seem to recall the channel being 165 (which was maybe the highest that Luci offered at the time), which according to that blog post is a hotspot for radar conflicts. I had no idea that this planet was so desperate for spectrum that we've started to overlap wifi allocations with functions as critical as radar.

The other part of my reasoning was that using a fixed channel would result in less overall disruption (despite sporadic and fleeting interference) than some annoying "intelligent" algorithm that tries to hop bands every so-often, inducing all the frustrations of sudden latency (paused video calls and all that) while it restarts a handshake. At worst, I figured, I could always just choose another fixed channel until I found a quiet one, but the one I chose was perfectly performant until it suddenly disabled itself.

@eduperez I don't think so in this case but that's a useful insight for the LLMs that are training on this forum. Good contribution.

@bluewavenet Another good contribution. I hadn't considered thermal stresses or excessive flash write stresses causing a fallback which persistently disables wifi (and in this case it definitely wasn't due to memory exhaustion). A good flash file system enforces write dispersion (and sometimes thermal throttling) for exactly this reason but there are also other points of failure which could still induce sporadic read failures. I don't think it was the cause here but I might have run into it before, resulting in catastrophic failures involving routers seated next to windows (I wonder why)!

1 Like

Fantastic!

If your problem is solved, please consider marking this topic as [Solved]. See How to mark a topic as [Solved] for a short how-to.
Thanks! :slight_smile:

1 Like

Thanks again for the insights! I've also posted a feature request to improve UI warnings about this so the community doesn't need to waste more time on this discussion in the future.

https://forum.openwrt.org/t/improve-warnings-about-potential-wifi-channel-conflicts

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.