No, as there hadn't really been any signs of sudden power loss as such. It hadn't happened again after reverting back to OpenWrt 19.07.3 which I had been running for a week or so, until I then tried booting OpenWrt 19.07.4 again earlier and then the same issue started happening within a few hours like before. In syslog I managed to capture what looks to be the reason why it is suddenly doesn't respond though, but what is causing it I have no idea. The device remained fully on and powered throughout, just going unresponsive, now I know it's because of the stall.
I'll have to stay on 19.07.3 again and see if it happens again. I thought it was coincidence that it looked like it was the upgrade, now I'm not sure, given it happened again after switching back the 19.07.4 partition.
Power supplies when they go bad can cause weird issues, like constantly rebooting, if it's not longer providing a consistent amount of voltage and a list of weird things, but I'm not sure it is the power supply in my case, given I've now seen crash output. Looks to me like something kernel related is being triggered by something I have running.
I left it and it seems to "recover" and uptime was still counting up so it hadn't rebooted, but the router kept being completely unresponsive because of the various CPU stalls.
I experienced the same problems, erratic behavior from the router and with a different power supply it went back to normal operation. The problem is that you cannot measure the power supply under stress easily. If you measure it with a multimeter it will show a healthy value on the output, but when power is consumed it might not keep the voltage steady.
I'll see if I can source a legit replacement power supply to test it and see if it helps. Won't hurt either way to rule it out, should be quiet cheap. Worst case scenario, I've got a spare power supply!
I've noticed that my desktop PC coming out of sleep seems to have directly triggered the erratic behaviour a few times now, but it's not something that can just be reproduced, as for the past 24 hours, it has been fine. So I'm wondering if it's a certain event that may trigger a power draw and then it all kicks off.
I would like to know if you managed to solve the problem. I got a shinning new such router, flashed it with 19.07.5 and I am experiencing the same issue. I think it has something to do with my VLAN configuration (in addition to my TL-SG108E switch) but I could easily be wrong.
Any advice if you were successful in solving the problem would be highly welcomed.
I don't know the exact cause, but I believe the crashes/non responsive nature is due to CPU stalls, I managed to capture some logs of a few:
I've caught the system load going insanely high after about 3-4 hours of up time, which is when the non responsive state starts and believe it to be the CPU stalling out. When this starts there is a small window where the router is responsive and you can get into SSH. When I do this, if I stop mwan3, it seems to stop the issue after a while and when the system load calms down, I can start it again and then everything is normal. If you can get past this weird 3-4 hour mark, the uptime is fine after that:
In terms of my configuration, I am using VLANs as well with mwan3. I have various different network interfaces, L2TP, Wireguard, DHCP etc.
I am unsure why the CPU stalls out, but it didn't start happening on 19.07 initially, until around 19.07.4, but after reverting back to 19.07.3, I did see the same behaviour, so I wonder if it is a combination of packages/configuration. The latest change network/VLAN wise is L2TP with a new VLAN, but I can't be sure that's it.
However, it is interesting you have experienced the same behaviour. VLAN configuration could be a clue. Please let me know your specific setup and maybe it will provide further clues.
I neither believe it was a power related issue. For me I am almost (!) sure it is related to VLAN but I am not sure if our problems are the same. Though if I had to consider power supply issues I would first had to look for temperature issues for mine.
None the less, the reason that I say I think it is VLAN related it is that for my case, as it is right now, is that I can reproduce the problem by simply enabling STP. I know this can be an intensive process but I think I left it enough time to complete if that was the issue.
I had STP on, earlier while tinkering with some other things and had no problems, but I did not had all my VLANs set back then, so this could be a blend of settings.
This is my VLAN config, ports 3 and 4 are for the additional network interfaces I have with mwan3, maybe it isn't the same issue exactly, but the common factor seems to be VLAN related. That would make sense as the only major change between 19.07 builds is VLAN related for me.
I haven't enabled STP, so possibly not quite the same, potentially though it is likely you are encountering CPU stalls like I am. Ideally if you can get the syslog info i.e. log to a file or remote source when the issue happens, you'll be able to capture a couple to confirm.
All CPU stalls appear to be related to CPU 1, which is where my main WAN port is tagged on by default. All my other VLANs are on CPU 0 which is interesting.
Thanks for sharing, so similar to me you have tagged all other VLANs on CPU 0 and left CPU 1 with the default WAN. All my CPU stalls are being triggered on CPU 1 by the looks of it from the stacktrace output.
I have logs go to an external log aggregator service and also write the system log a persistent destination provided by a USB stick, so they aren't lost of a reboot, that's the easiest way to capture persistent logs, doing either of the two.
Yeah actually that was my thought too, so I created a flash disk, had to do some troubleshooting till I remembered that probably an embedded system does not support vfat ( XD ) and was not mounting it but now I think it will log it there. Let's see. I think I could try to force it but I am not sure I can do with the downtime at the moment.
I also had the thought of buying a second TL-SG108E just to go in front of the router, but buying a 200€ router to then need to buy a new switch just to use 1 port on the router seems in a way counter productive to me, if I have to do all that I may as well return the router.
Possibly not the same issue then if you don't get CPU stalls. There's a chance the log might not be able to capture them if it completely hangs it, but it seems it's all around VLAN related configuration/functionality.
Do you think I should make a new thread? Should I take any more action to make sure that the CPU is not stalled before giving up? As I am into warranty I do NOT want to open up the device to get serial but I do have a USB to Serial converter (CP2102) which I think I saw somewhere that can be used to get serial console on these.
Nope no WPA3 in play, so it is not that, but I have captured at least one CPU stall that referenced WiFi.
Best theory at the moment is CPU cycles being taken up by certain processes causing the lockup, because the CPU is literally too busy to handle any other requests. This is noted by the indicator lights on the WRT3200ACM itself freezing up too as well as going non responsive to ping etc, I think only a single switch light was still blinking, then after a while it unfreezes. All the stalls seem to occur on CPU1.
What's weird is the timing of around 3-4 hours is always the case, but with some manual intervention, and stopping services and such, it can then recover on it's own, once the system load calms down. Right now, the uptime has been several days no problem, it is just that initial 3-4 hour period after either a reboot, sysupgrade or whatever where the uptime is at zero again.
I think there's potentially some edge case that's being triggered by a combination of packages and my configuration. The true test would be to reset back to defaults and leave it running, but given I'm work from home I can't really do that right now. I need the VLANs and mwan3 enabled.
This still has all of the hallmarks of a bad power supply, and Linksys in particular has a reputation for sourcing bad capacitors. Age and ambient heat are the two major causes for them failing, and electrical loads beyond what they are able to buffer causes ripple and therefore instability. Current draw is directly related to CPU utilization and therefore anything that taxes the router after a few hours of getting warm will cause it to become anemic.
An easy way to test is to borrow an external hard drive power supply. They are usually rated about the same - 12v/2-3A
Thankfully the filter cap on the router board itself is now a solid electrolyte so you can probably rule that one out.