I believe the explanation for the degraded situation over time might be a filesystem corruption issue that slowly manifest itself. I'm not sure what is causing it but unless you're running fsck all the time, something is messing with it. I notice my root partition (sda2) which is ext4 gets remounted as read-only and then the rebooting issues starts. Without usage the system seems to be OK but with usage it reboots. I tried different drives and USB thumb drives, same issue. I ran Ubuntu Desktop and Debian Jessie on the same hardware and no issues.
I got curious regarding the load idea. I created a small shell script with a while loop to exercise just CPU(and unit CESA(openssl speed)). So no action on the mwlwifi, LAN/WAN, USB/eSata... fronts. Running >1 instance of the script to keep device pegged, So far it just keeps ticking, getting tired of listening to fan running at full tilt though.
@2devnull, I doubt whether the issue you are seeing is related to the 4.9 reboot being experinced on the mamba. On the plus side you have access to functionality not available on mvebu target. iirc you should be able to catch an oops as the kexec / crashdump facility is in place and functioning on x86.
Can you explain how to catch the oops? I'll be happy to do it and provide any info that may help here or otherwise.
Has anyone tried disabling the cpuidle driver? i disabled two options relating to cpuidle in the kernel config after spending the night reading through kernel mailing lists trying to pinpoint possible leads.
There are a couple of bugs that reference the cpuidle driver causing the cpu to lockup under heavy IO loads and my novice interpretation is that the whole cpuidle implementation for mvebu is "hacky" and it seems to fit the description of our problem. It was disabled in mainline before 4.3(i think).
So far ive its been up for 7ish hours without reboots under a mixture of light and heavy workloads, atm i have it setup as a wireless client, its transferred 200+ GB at a consistent rate of 50-55MBps over wifi. Ill report back if i can achieve an uptime of 48+ hours, else hopefully useful just to rule cpuidle out as the culprit.
Yes, both cpuidle and cpu speed scaling, to no avail. Crashed 48 hours later with that patch.
Edit 1: Just noticed that i forgot to remove it from my patch directory so i'm still running mine without cpuidle and frequency scaling... ...oh well...
I doubt it, but...
and putting load on CPU did not yield reboot.
Mine is fairly stable too for the last couple of days.
root@net002:~# uptime 11:25:03 up 2 days, 21:54, load average: 1.63, 1.27, 0.62 root@net002:~# cat /etc/openwrt_version r5436+2-18cc8d520c root@net002:~# uname -a Linux net002.ncnet.local 4.9.65 #0 SMP Wed Nov 29 16:01:13 2017 armv7l GNU/Linux
Been running last mwlwifi-20171129, last kernel, some cpu related kernel hacks, idle during days and heavy loads at nights. Rock solid so far. Hope i nailed something.
Edit 1: up 4 days now.
Care to share the kernel hacks or are you going to generate a patch/pull request?
I will as soon as i do it properly. I need to do a "real" 4.9 patch file instead of writing over the kernel conf with kernel_menuconfig. I also need to pinpoint and separate each disabled function so we can reenable them one after the other to find the culprit. I was waiting to make sure i didn't go all in, betting on an image who would crash several hours later...
I know i have increased verbosity and some lockup/irq/workqueue/timers debug stuff enabled to try to catch something. The rest is just disabled kernel functions like cpuidle, cpu frequency scaling, real time clock (my prime suspects).
Maybe it's not one of these functions, maybe it's the added latency introduced by the slight overhead caused by the numerous debug flags who avoid the chip going crazy. Maybe this image was stable because of cosmic rays or whatever.
Edit 1: Reworked the kernel patch so it can be mergeable on a clean tree, shorten urls, some markdown thingy so it doesn't hurt the eyes too much.
Update on my last build, up 5 days, 3 hours, 22 minutes
Yeah ive been up for 3 days now, 2 days before that but with a manual reboot in between. I just disabled a few debugging interfaces(Not like they help for this issue anyway), disabled cpuidle, cpufreq is enabled but switched to the schedutil gov (ive had random reboot issues with different governor/ kernel combinations on android phones, so applying that here)
I also enabled the fastpath patch with the hope that it might bypass whatever is causing the issue and "compiled for perfomance"
edit: Nearly 4 days for me I suspect disabling cpuidle has been the main contributor to stability
btw, using default mwlwifi driver (kmod-mwlwifi_4.9.65+10.3.4.0-20171011-1_arm_cortex-a9_vfpv3.ipk)
I am using a WRT1200AC (v2 I think) and David's build on it. I had 3 reboots over the last two days, since I updated to r5422. I had running r5297 before, and it was rock stable, had no reboots since I bought it a month ago and put directly r5297 on it. r5297 had wifi version kmod-mwlwifi - 4.9.58+10.3.4.0-20170810-1 and I am using right now kmod-mwlwifi_4.9.65+10.3.4.0-20171129-1. One of the three crashes was directly the moment, when I opened LUCI in the web browser over wifi ( https://10.0.0.199/cgi-bin/luci/admin/system/packages ) maybe CPU load related? The other were just when I was browsing the web or at another random point.
I wasnt even aware that the wrt1200ac had this problem. I can do a build for you with the above config if you would like to test, im nearing 6 days uptime, would be interesting to see if it helps the 1200 also
What are your changes @JTRealms do I understand it correctly, you disabled cpuidle? I actually read before that it is bugged on this CPU, but that it was already disabled by default.
My dmesg also mentions this during boot:
[ 0.029886] cpuidle: using governor ladder
[ 0.030062] mvebu-pmsu: CPU hotplug support is currently broken on Armada 38x: disabling
[ 0.030070] mvebu-pmsu: CPU idle is currently broken on Armada 38x: disabling
So isn't it already disabled?
Do you get any crash logs generated? I assumed the wrt1200 was using the AmadaXP but rather it uses the same soc found in the wrt1900acv2 which doesnt have this issue. The reboot issue on the 1900acv1 specifically, doesnt print any debugging or crash info - if yours does its probably a different issue?
@JTRealms how would I see/save crash dumps/logs if the router reboots? is there a way saving it somewhere before the reboot?
Mine is logging to a USB drive and it's never recorded anything of use. I've just kept it that way on the off chance it ever does.
@ListerWRT That is what I thought too. If a kernel panic happens, how would it be able to log anything anymore to a driver place, like to a USB disk or network or whatever. How does Windows do it though for writing kernel dump to disk for a bluescreen? I guess it has it's own little space in ram with a BSOD kernel+seperate disk driver which can do it? Does Linux have something like this too? But not on a small device like a router I guess?
debugfs dump, but not on this target.
Edit: @mfka8, No, see discussion in posts back around the time-frame of the post I linked.
What do you mean "but not on this target" @anomeome ? So is there a way to get the logs, or not? No Howto on how to set this up somewhere for Lede? There must be a way to debug crash reboots, no?
Not implemented or not functional.
Your patch is for 3.18 kernel. A lot changed since that for the ARM ecosystem. I doubt it still disabled as i don't have this mvebu-pmsu debug message appearing. I also don't have a file with that name in my patchwork directory.
However, I am now almost certain our grief is either caused by CPU frequency scaling, or caused by the CPU Idle driver. With both these functions disabled, my router is now stable and did not experience spontaneous reboot since. It just run a little hotter.