[Solved] WRT1900ACV1 reboots: kernel 4.9

mfka8 · December 15, 2017, 6:16am

@NainKult what about these crash dumps I posted, why is there actually a like including the string 'arch_cpu_idle' if it should be disabled? Could it be, this patch isnt working properly too for the WRT1200 and so this might be a crash reason like on the WRT1900?

JTRealms · December 15, 2017, 12:11pm

12 days uptime, ill call that stable

anomeome · December 15, 2017, 3:36pm

It seems we are rapidly approaching a place that you cannot get to from here:

changing some kernel config parameters is felt to change issue for the better
now, more than one report on OpenWrt forum from users simply configuring wan6 and issue be gone
some cannot get away from issue no matter what is tried
I can no longer generate an image that fails, 2 different OS's on 2 different platforms, w 0 changes

Is that Heisenberg wondering around over there in the dark?

Up for ~8 days on 4.9.67k image, time to change build and try to fail again.

NainKult · December 15, 2017, 5:43pm

I don't follow you, could you elaborate ?

Never had anything to do with IPv6 at my home, my ISP being a big ol' dinosaur about it.
I removed my wan6 (and my wan for that matter) interface long ago and I started crashing waaaaaay long after that (going from 4.4 to 4.9). I stopped crashing one week ago, without changing anything whatsoever in my network configuration. Only kernel tweaks. The only thing that would validate this wan6 theory is a relevant commit in either the IPv6 kernel stack or some weird package included in my builds which is yet to identify.

That's why I'm tying to get people who are still crashing to test this CPUIdle / CPUGovernor theory, as it seems to be load bound from my perspective.

You poor thing

No, that's the FCC chairman. (don't hit me plz)

InkblotAdmirer · December 16, 2017, 5:55am

I have built an image with just a few of these options disabled (no extra features from the patch) -- just disabling CPU IDLE and RTC. This unit doesn't run any hotter than my 2nd device (still on 4.4).

Interestingly, if you go back to post 1 I disabled frequency governor and RTC as a first guess. If this works I'm going to kick myself for being so close to the mark and having this drag on for months.

In any case, up for 27 hours so far. If it survives another day or so I'll flash a second device. My threshold for "stable" has come down to around 5 days or so -- seems that's the recent cadence for new kernel releases.

mfka8 · December 16, 2017, 10:30am

Thanks for ignoring me @NainKult really polite (:

anomeome · December 16, 2017, 5:37pm

My statement simply meant to imply that by changing things we are impacting the outcome, if long uptime is experienced, than attribute success. But of course without an actual capture of the issue, we have no certainty of issue resolution; and throw into that mix my current situation.
The comment regarding wan6 changes by others, was meant to add credence to the notion of changing things, anything, well... changes things; given the way I run my environment I lent no credence to the idea itself.
I would be more than willing to give your patch a shot on an image, as soon as I can break things again, not much point otherwise. I did recently push my mamba with a load, 2x100% CPU for about ~8 hours with no reboot.
The FCC chairman, and your ISP, have nothing but your best interests at heart. Why, I saw the FCC chair himself state that I would still be able to take a selfie and post to the internet, be still my beating heart. Please standby for a slower pipe coming soon to a place near you.

I am currently reworking my image contents on a C7V2, gashing dnsmasq with other changes. Once things are working correctly on the tp-link, I will take it to my mvebu targets and see if changes to image contents don't maybe break my mamba again.

@mfka8, question asked and answered, must be expecting a different answer second time around.

sludgepump · December 16, 2017, 8:29pm

Hi and thanks to everyone for trying to sort this out.
Here is some additional data points from my side:

Environment: WRT-1900AC V1 (mamba)

Have been attempting to run various kernel 4.9 releases for many months. All releases ultimately fail with a reboot within 48 hours. No error logging occurs at the time of the reboot. The router is not heavily loaded.
Kernel version 4.4 has run rock solid on same WRT1900ACv1 for many months without errors.
I compiled new kernel 4.9.65 last week with "CPU IDLE" and "CPUFREQ SCALING" both disabled in kernel config. Router has been now running for over 8 days now without reboots or errors.
Today I have recompiled a new kernel with "CPUFREQ SCALING" re-enabled and left CPU_IDLE still disabled in kernel config. Will test for one week to see the results. I will report back.

Based on some anecdotal evidence, I'm speculating that the problem may arise when the CPU is coming out of an IDLE state. On a couple of occasions my router was pretty much idle when I logged into LUCI. As soon as LUCI started, the router rebooted.
The Marvell doc shows the Armada XP CPU has 3 possible states (IDLE, DEEP IDLE, and SLEEP) Each progressive state is more power efficient but will take longer to restart. I'm not sure which state we are actually using in this 4.9 kernel. Maybe someone could give us some insight how this all works.
Thanks much.

.

mfka8 · December 17, 2017, 12:13pm

@anomeome Totally nonsense. I asked something different up a few posts above. He also offered to build a image for wrt1200ac which he doesnt answer anymore on it. The kernel log still has a quote line of this: "[178472.710791] [] (arch_cpu_idle) from [] (cpu_startup_entry+0xf0/0x19c)" why, if cpu_idle is disabled? It seems to me the issue is related to it, it is obvious, that all Armada CPUs have this issue, not just wrt1900v1 with XP CPU. The reboots I have also speak for it, that they happened under heavy CPU load or change from idle to load of the CPU. For example one happened the second, I opened Luci in web browser => cpu load.

NainKult · December 17, 2017, 2:37pm

@mfka8 I really suggest you to start a new thread. You have obviously the knowledge and came to us with the right bits of information (backtrace, information about your env, steps to reproduce) but hijacking an existing thread is a bad way to do it. Most of the "forums guys" don't like that. Frankly, I don't care, your issue is somewhat related but you'll have little to no feedback from this particular thread, as most of us don't even own a WRT1200AC. I'm not deliberately ignoring you and I hold no grudge against you. I don't know what you expect of me so i can't answer.

I am pretty sure I never said that and I'm too lazy to dig up my old posts. Feel free to quote me if i did. Will make my next build for all platforms but you should try to ask nicely, it work most of the time (heck it even works when you're not).

I have good news for you

root@net002:~# uptime
 15:14:56 up 5 days, 17:05,  load average: 0.00, 0.00, 0.00
root@net002:~# cat /etc/openwrt_version
r5493+3-b8220883fd

Edit 2: @mfka8 Previous link has been updated with a build for all mvebu target. This build has not been tested at all, use at you own risk.

mfka8 · December 18, 2017, 1:07pm

It was JTRealms who offered that, so I apologize, I thought it was you because of your patch file. If you find the time @NainKult please post a patch file for wrt1200 too. Or is your "kernel roulette" patch file not limited to armada XP CPUs, so are the settings you changed in the patch file global and would work out of the box for the other models too (armada 385)?

NainKult · December 18, 2017, 1:38pm

Patch is common to all mvebu targets

InkblotAdmirer · December 18, 2017, 2:18pm

I am actually starting to believe, here. I removed just the 3 CPU_IDLE configs and the 5 RTC configs and one mamba has been up 3 1/2 days.

I just now flashed a 2nd mamba using a config with just the CPU_IDLE config flags unset.

I can't tell the difference in CPU temps with and without the patches. I flashed a Shelby and Caiman device as well, and the same -- these CPUs run hot whether CPU_IDLE is enabled or not.

This will be nice -- the wireless seems to play with 4.9 better than 4.4 -- transferring large files I see peaks of ~60MB/s and iperf approaches 600 Mb/s. Not to mention being able to build with just one config.

anomeome · December 18, 2017, 5:00pm

This, assuming that community build got it right.

@mfka8, you continue to conflate, what are imo, two very different issues that manifest in very different ways. As far as I can tell, you are the only one seeing whatever the issue is that you are reporting. I have not seen anything untoward occurring from an image running on a rango.

mfka8 · December 19, 2017, 11:55am

Is there a way to get CPU but more important RAM clock values? I am wondering if especially RAM clock and/or voltage values for whatever reason arent correct (for the model), maybe since a specific kernel version. This may be also reason, why disabling cpuidle and cpufreq could show some help for some people.

davidc502 · December 20, 2017, 1:47am

I'm testing the patch NainKult provided for a v1 build.

Are these the correct parameters or answers to the questions? Thanks ----

CPU Frequency scaling

CPU Frequency scaling (CPU_FREQ) [Y/n/?] y
CPU frequency transition statistics (CPU_FREQ_STAT) [Y/n/?] y
CPU frequency transition statistics details (CPU_FREQ_STAT_DETAILS) [N/y/?] n
Default CPUFreq governor
1. performance (CPU_FREQ_DEFAULT_GOV_PERFORMANCE)
2. powersave (CPU_FREQ_DEFAULT_GOV_POWERSAVE)
3. userspace (CPU_FREQ_DEFAULT_GOV_USERSPACE)

ondemand (CPU_FREQ_DEFAULT_GOV_ONDEMAND)
5. conservative (CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
6. schedutil (CPU_FREQ_DEFAULT_GOV_SCHEDUTIL)
choice[1-6?]: 4
'performance' governor (CPU_FREQ_GOV_PERFORMANCE) [Y/?] y
'powersave' governor (CPU_FREQ_GOV_POWERSAVE) [N/m/y/?] n
'userspace' governor for userspace frequency scaling (CPU_FREQ_GOV_USERSPACE) [N/m/y/?] n
'ondemand' cpufreq policy governor (CPU_FREQ_GOV_ONDEMAND) [Y/?] y
'conservative' cpufreq governor (CPU_FREQ_GOV_CONSERVATIVE) [N/m/y/?] n
'schedutil' cpufreq policy governor (CPU_FREQ_GOV_SCHEDUTIL) [N/y/?] n

CPU frequency scaling drivers

Generic DT based cpufreq driver (CPUFREQ_DT) [N/m/y/?] (NEW) y
Generic ARM big LITTLE CPUfreq driver (ARM_BIG_LITTLE_CPUFREQ) [N/m/y/?] n
CPU frequency scaling driver for Freescale QorIQ SoCs (QORIQ_CPUFREQ) [N/m/y/?] n
*

ARM CPU Idle Drivers

Generic ARM/ARM64 CPU idle Driver (ARM_CPUIDLE) [N/y/?] n
CPU Idle Driver for mvebu v7 family processors (ARM_MVEBU_V7_CPUIDLE) [N/y/?] (NEW) n
*

sludgepump · December 20, 2017, 6:33pm

@davidc502
Hi David:
Just wanted to relate that I've been running now for almost 4 days without reboots on my mamba.
In my particular configuration only CPU IDLE is disabled. I left CPU FREQ (the 4.9 default) enabled. So far so good.
.
Here are the lines that were "deleted" by the "kernel menuconfig" program in the "config-4-9" file:
CONFIG_ARM_MVEBU_V7_CPUIDLE=y
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_PM=y

I believe you can either just delete those lines or change to:
# CONFIG_ARM_MVEBU_V7_CPUIDLE is not set
# CONFIG_CPU_IDLE is not set
# CONFIG_CPU_IDLE_GOV_LADDER is not set
# CONFIG_CPU_PM is not set

Sorry I can;t help with the CPU FREQ changes but I'm sure @NainKult or someone else can help with this..
In addition, my temperatures do not appear to be excessive: (however, this is a lightly loaded router)

root@linksys-router:~# sensors
armada_thermal-virtual-0
Adapter: Virtual device
temp1: +63.9°C

tmp421-i2c-0-4c
Adapter: mv64xxx_i2c adapter
temp1: +50.8°C
temp2: +52.9°C

Hope this helps and thanks for all you do for the community.

NainKult · December 20, 2017, 9:27pm

@davidc502 On which LEDE tag are you trying to merge the patch ? I remember I tried to apply it to 17.01 earlier and I had similar issues. I could rework the patch if you tell me more about your tree.

Btw, sorry for the links instability today, I had to rework my whole frontend architecture and one of the downside was short urls redirects were down with a 503.

Edit 1: I really need to ditch the useless debug overhead anyway...

davidc502 · December 20, 2017, 10:23pm

I'm building from lede trunk or daily "snapshot", so I wouldn't expect the results of the patch to be any different from 17.01. In this case it is r5572.

Appreciate the effort to rework this patch.

NainKult · December 21, 2017, 4:45pm

@davidc502 I can confirm my patch is applying correctly on a clean tree.
You may have forgotten to remove your tmp/ directory and/or do a make clean prior make world.

However, I reworked the patch anyway because of the -now useless- debug flags
[LEDE-DEV] kernel: mvebu: remove CPU power management features

Be careful, it applies to all mvebu targets.
And if it doesn't work, nuke your tree with make distclean and start again (backup your files first !).

Edit 1: Gonna bump my kernel with this new patch so I want to share some sweet, sweet uptime before it's gone.

root@net002:~# uptime
 18:20:03 up 9 days, 20:10,  load average: 0.12, 0.20, 0.09
root@net002:~# cat /etc/openwrt_version 
r5493+3-b8220883fd