Linksys WRT3200ACM goes completely down after a few hours requiring a full power cycle

Possibly not the same issue then if you don't get CPU stalls. There's a chance the log might not be able to capture them if it completely hangs it, but it seems it's all around VLAN related configuration/functionality.

Do you think I should make a new thread? Should I take any more action to make sure that the CPU is not stalled before giving up? As I am into warranty I do NOT want to open up the device to get serial but I do have a USB to Serial converter (CP2102) which I think I saw somewhere that can be used to get serial console on these.

This sounds suspiciously like the mwlwifi bug when attempting to get WPA3 working; ignore if you do not have that in play.

Nope no WPA3 in play, so it is not that, but I have captured at least one CPU stall that referenced WiFi.

Best theory at the moment is CPU cycles being taken up by certain processes causing the lockup, because the CPU is literally too busy to handle any other requests. This is noted by the indicator lights on the WRT3200ACM itself freezing up too as well as going non responsive to ping etc, I think only a single switch light was still blinking, then after a while it unfreezes. All the stalls seem to occur on CPU1.

What's weird is the timing of around 3-4 hours is always the case, but with some manual intervention, and stopping services and such, it can then recover on it's own, once the system load calms down. Right now, the uptime has been several days no problem, it is just that initial 3-4 hour period after either a reboot, sysupgrade or whatever where the uptime is at zero again.

I think there's potentially some edge case that's being triggered by a combination of packages and my configuration. The true test would be to reset back to defaults and leave it running, but given I'm work from home I can't really do that right now. I need the VLANs and mwan3 enabled.

This still has all of the hallmarks of a bad power supply, and Linksys in particular has a reputation for sourcing bad capacitors. Age and ambient heat are the two major causes for them failing, and electrical loads beyond what they are able to buffer causes ripple and therefore instability. Current draw is directly related to CPU utilization and therefore anything that taxes the router after a few hours of getting warm will cause it to become anemic.

An easy way to test is to borrow an external hard drive power supply. They are usually rated about the same - 12v/2-3A

Thankfully the filter cap on the router board itself is now a solid electrolyte so you can probably rule that one out.
filtercap

@jamesmacwhite did you ever get to the bottom of this? I think I am getting exactly the same issue.

  • I started seeing it after upgrading to 22.03.2, but I kept seeing it after reverting to 19.07.10.
  • From what I recall, 19.07 was stable for months at a time.
  • I know 22.03 uses the new firmware (0x903020c instead of 0x9030206), but it's not that because I was manually installing that on 19.07 (it works good for me).
  • I don't use VLANs (apart from WAN/LAN).
  • I install some extra packages (strace vim-fuller mtr tcpdump diffutils iputils-arping curl bind-dig htop iftop screen mailsend-nossl coreutils-nohup ipset wireless-tools), but I don't think these do much on their own, and I definitely wasn't using them yesterday when the issue happened.
  • For me, it definitely seems wifi-related, and it seems to happen when a known device is re-introduced to the network (e.g. someone returns home with their mobile phone, a computer wakes up from sleep, etc.)
  • I did tweak the station inactivity timeout to 3600 seconds... I wonder if it's related to that... I didn't use to do that in the past... I used to use the default. Did you tweak that by any chance?
  • Other than that, my config is really simple and vanilla. I like the power supply theory, not sure I have another to test it out though.
  • I also saw the frozen LEDs yesterday (well, one of them anyway, the 5G wifi one, which is our main source of wifi).

No I didn't, but there's likely a combination of factors which did trigger it.

On 21.02.6 now, I no longer have this issue, but I did change a couple of things along the way. At one point I did use an L2TP WAN, which did seem to have some influence, if I had this disabled for a while it didn't seem to cause the lockup after around 4 hours.

More than once I managed to capture CPU stalls and a kernel panic referencing Wireguard, but I don't think Wireguard itself is the issue.

I never replaced the power supply, so I don't think it was that.