[Solved] WRT1900ACV1 reboots: kernel 4.9

InkblotAdmirer · March 2, 2017, 12:19pm

Update: with the patch below (click triangle to expand) both of my mamba devices have been rock stable (for weeks) on kernel 4.9. The only issue (if 4.14 doesn't fix mamba) is how to roll this into trunk, in the event people don't want to disable CPU IDLE on non-mamba devices.

mamba cpu idle patch

--- a/target/linux/mvebu/config-4.9
+++ b/target/linux/mvebu/config-4.9
 @@ -44,7 +44,7 @@ CONFIG_ARM_HEAVY_MB=y
 CONFIG_ARM_L1_CACHE_SHIFT=6
 CONFIG_ARM_L1_CACHE_SHIFT_6=y
 # CONFIG_ARM_LPAE is not set
-CONFIG_ARM_MVEBU_V7_CPUIDLE=y
+# CONFIG_ARM_MVEBU_V7_CPUIDLE is not set
 CONFIG_ARM_PATCH_IDIV=y
 CONFIG_ARM_PATCH_PHYS_VIRT=y
 CONFIG_ARM_THUMB=y
@@ -94,8 +94,8 @@ CONFIG_CPU_FREQ_STAT=y
 CONFIG_CPU_HAS_ASID=y
 # CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
 # CONFIG_CPU_ICACHE_DISABLE is not set
-CONFIG_CPU_IDLE=y
-CONFIG_CPU_IDLE_GOV_LADDER=y
+# CONFIG_CPU_IDLE is not set
+# CONFIG_CPU_IDLE_GOV_LADDER is not set
 CONFIG_CPU_PABRT_V7=y
 CONFIG_CPU_PJ4B=y
 CONFIG_CPU_PM=y

Original post below...

Since the mvebu was moved to 4.9 the WRT1900ACV1 (Mamba) device has suffered continual reboots -- typically after an hour, sometimes much sooner.

I found a few suspicious CONFIG choices, given that if memory serves the Mamba device doesn't populate an external component required for the RTC to operate, and the CPU governor doesn't work (or just wasn't officially configured to work when the device was added).

In any case, I have a device that's been running LEDE trunk on 4.9 for almost 10 hours now with the following patch:

<snip! (it didn't work...)>

Is anyone familiar with why these were set and if they can be unconditionally "not set" for the mvebu platform in general? Based on boot logs the Shelby unit at least initializes its RTC and we wouldn't want to disable an available resource.

I also haven't narrowed down whether the governor flag is actually an issue, or if it's the RTC (and if so, is it necessary to disable both CONFIGs) but I think I'm at least on the right track to fix this extremely annoying issue. Comments/testing welcome.

anomeome · March 2, 2017, 8:56pm

A few data points on this:

My timeouts vary from a minute to quite a few hours.
I get no crash log file in /sys/kernel/debug/
Given the diff comparison of the 4.4 and 4.9 configs, I was wondering what led to zeroing in on those mentioned. Just wondering if you looked at CONFIG_ARMADA_370_XP_IRQ=y for instance.

Opened FS#585

Edit: Apologies, missed that FS, will request mine be closed. Should also mention that I have found nothing of use put to the serial console during an episode.

northbound · March 2, 2017, 9:22pm

I also opened same issue https://bugs.lede-project.org/index.php?do=details&task_id=564.
But in my case it can last days before a crash/reboot. Pounding with iperf3 no problems.
It tends to happen when there is no real load. I have seen it happen in Luci, first no response when trying to change tabs and about 10 sec later it reboots.

Edit: No need to apologize.
A question do you see the same thing in your log that I posted on this task?

anomeome · March 2, 2017, 11:21pm

Sorry, national trait . Yes I have the following on a reboot:

root@bsaedgy:/# logread | grep kern.err
Tue Feb 28 12:34:34 2017 kern.err kernel: [    1.253386] of: dev_pm_opp_of_cpumask_add_table: couldn't find opp table for cpu:0, -19
Tue Feb 28 12:34:34 2017 kern.err kernel: [    1.261423] cpu cpu1: opp_list_debug_create_link: Failed to create link
Tue Feb 28 12:34:34 2017 kern.err kernel: [    1.268075] cpu cpu1: _add_opp_dev: Failed to register opp debugfs (-12)

InkblotAdmirer · March 3, 2017, 12:30am

So based on northbound's comment ("I have seen it happen in Luci") I fired up luci and got a reboot within a few clicks. The router had an uptime of around 22 hours at that point. I have another that's been up about 12 hours, I'm just going to leave it alone doing its thing and see how long it hangs in.

As to why I focused on those particular options (and obviously they aren't the issue here) this is the complete list of additions to mvebu-specific 4.9 options:

> CONFIG_ARCH_CLOCKSOURCE_DATA=y
> CONFIG_ARMADA_370_XP_IRQ=y
> CONFIG_ARM_PATCH_IDIV=y
> CONFIG_BLK_MQ_PCI=y
> CONFIG_CPUFREQ_DT_PLATDEV=y
> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
> CONFIG_CRYPTO_CRC32=y
> CONFIG_GENERIC_EARLY_IOREMAP=y
> CONFIG_GENERIC_MSI_IRQ=y
> CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
> CONFIG_HAVE_ARM_SMCCC=y
> CONFIG_HAVE_CBPF_JIT=y
> CONFIG_MVNETA_BM_ENABLE=y
> CONFIG_PADATA=y
> CONFIG_PCI_DOMAINS=y
> CONFIG_PCI_DOMAINS_GENERIC=y
> CONFIG_PCI_MSI=y
> CONFIG_PCI_MSI_IRQ_DOMAIN=y
> CONFIG_REGMAP_SPI=y
> CONFIG_RTC_I2C_AND_SPI=y
> CONFIG_RTC_MC146818_LIB=y
> CONFIG_SERIAL_MVEBU_CONSOLE=y
> CONFIG_SERIAL_MVEBU_UART=y

anomeome · March 3, 2017, 12:41am

Yes, I have had it reboot just sitting there, as well as doing something in the gui. So it seemed the gui could initiate an event, but was not the cause. My reasoning behind the side by each diff was that it could be something omitted as well as something added.

InkblotAdmirer · March 3, 2017, 12:56am

This is the mvebu-specific set of CONFIGs that was in 4.4 but not 4.9:

< CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE=y
< CONFIG_ARCH_REQUIRE_GPIOLIB=y
< CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
< CONFIG_GPIO_DEVRES=y
< CONFIG_HAVE_BPF_JIT=y
< CONFIG_HAVE_DMA_ATTRS=y
< CONFIG_OF_MTD=y
< CONFIG_ZONE_DMA_FLAG=0

I have the generic CONFIGs separated as well but I figured it was higher probability to be in the mvebu-specific CONFIGs, so that's where I started.

northbound · March 3, 2017, 3:51am

I think if the debugfs gets properly loaded it would be an improvement. It would at least provide the needed clue.
Just an observation from my point of view.

anomeome · March 3, 2017, 11:44pm

I beleive it is:

root@bsaedgy:/# mount | grep debugfs
debugfs on /sys/kernel/debug type debugfs (rw,noatime)

and if I force a crash:

echo c >/proc/sysrq-trigger

I do get a crash log file generated, but not is this case obviously.

InkblotAdmirer · March 4, 2017, 2:21pm

On boot, the kernel error message (4.4) is as follows:

Thu Mar  2 18:44:24 2017 kern.err kernel: [    1.203072] cpu: dev_pm_opp_of_cpumask_add_table: couldn't find opp table for cpu:0, -19
Thu Mar  2 18:44:24 2017 kern.err kernel: [    1.212529] cpu: dev_pm_opp_of_cpumask_add_table: couldn't find opp table for cpu:1, -19

Is this a red herring or does it look like something isn't set up correctly for cpu1 on 4.9?

northbound · March 4, 2017, 6:31pm

I may have found it. But at this point it is a guess. I will test when I get home from work.
In the mamba.dts It is stated to use CONFIG_DEBUG_MVEBU_UART0_ALTERNATE
In "make kernel_menuconfig It was the old and not the new alternate I have made the change.
But am not going to risk flashing remotely. Will post results here when I get home.

Edit: I think I am barking up the wrong tree. I tried old, new and mvebu uart1 no changes.
Now to find the config that may be overwriting my changes in make kernel_menuconfig.
@anomeome If I do what you used to create a crashlog it works and shows data if I cat it
But when the crash actually occurs crashlog is not created. Or is this what you said above?
My problem is I am up for 2 or 3 days without a crash lately .Then something interesting changes so I build again and run that version for 2 or 3 days I guess I will just let it run and see what happens.

anomeome · March 5, 2017, 12:28am

Yes, that is what I meant with my rather poorly worded statement. Should add that no crash log file generated during an event is, in and of itself, another data point.

@InkblotAdmirer, I had briefly looked at those errors yesterday and arrived at the conclusion, perhaps in haste, that they were a red herring.

Edit: You guys using adblock, if so is it continually resetting, and by extension restarting dnsmasq? At any rate I have stopped it, not sure when that started happening.

northbound · March 5, 2017, 2:27am

I will keep an eye on that but I have not seen it. At 6am as per cronjob it refreshes.
Just out of curocity how often are crashes happening for you.
Below is diffconfig
pastebin.com/y4nd3vQ6

anomeome · March 5, 2017, 4:53pm

My latest build has now been running for ~17 hours, which is the longest run time for me since this started happening. If you want to compare image contents, there is the standard config.seed files in the target directories of the builds at my drop; my dog's bark is worse than her bite

InkblotAdmirer · March 9, 2017, 1:25am

@anomeome
Any word on the stability with your changes? I haven't reviewed them since I don't know where your "drop" is.

anomeome · March 9, 2017, 2:44am

@InkblotAdmirer, Currently running r3640 with an up-time of > 3 days, I think the last image that exhibited the random reboot was r3624. I took a look at the git commits a couple of days back to see if there was anything that might explain a change, but nothing leapt out. Just to be clear, there is nothing in my build expressly to address this issue.

I guess my attempt at a humorous clue was too cryptic, at any rate, click on my avatar and follow the link. Build r3636-9e740fa.sdio was to play with PR893 on a rango (which I have yet to do), and was never flashed to the mamba.

northbound · March 9, 2017, 3:44am

LEDE Reboot SNAPSHOT r3644+2-7f0c95a / LuCI Master (git-17.063.72654-40b7b68)
Kernel Version 4.9.13
Local Time Wed Mar 8 22:38:57 2017
Uptime 4d 2h 18m 14s
Load Average 0.00, 0.00, 0.00
Mamba looking good using trunk and blogic's blockd patches.
I still want to know the fix for no crashlog. When not self inflicted.

Phil-BKK · March 9, 2017, 7:27am

Im running 4.9 swrt build which exhibits the same issue. I have prebuilt a r3636, ready for loading to my mamba and rango routers. Is r3636 free from reboots?

InkblotAdmirer · March 9, 2017, 11:27pm

I built from trunk r3674 last night and the mamba I installed it on rebooted within the hour. I left it running and it's now been up over 17 hours. I still think there's an issue, if it reboots again I'm sticking with 4.4 until there's a definitive fix.

northbound · March 10, 2017, 12:17pm

I made it to Uptime 5d 10h 2m 6s before a reboot on r3644
Once again no crashlog.

Edit: I was up just over 24 hrs and it rebooted again.
Built r3681-e58ea0a and will see if there is a difference.
My guess is no.
I find it hard to believe that others are not experiencing the random reboots.
It really would be nice to get a crashlog when this happens.