[Solved] WRT1900ACV1 reboots: kernel 4.9

anomeome · November 25, 2017, 6:19pm

Getting DSA here would not be a trivial pursuit, even starting with the work already available from the @sera OpenWrt tree. When talking about mvneta, what is really being stated is the LEDE specific patches in support of offload units (BM,BQL) around this area.

@NainKult, I have always been running with serial connected, nothing presents.

NainKult · November 25, 2017, 6:57pm

Yep, again, big file transfer, crashed right after starting it. Nothing on serial until it goes to u-boot.

-- 0:ttyU0 -- time-stamp -- Nov/25/17 17:35:45 --

** crash **

-- 0:ttyU0 -- time-stamp -- Nov/25/17 19:50:19 --

^MBootROM 1.20
^MBooting from NAND flash

JTRealms · November 26, 2017, 3:26am

Is there a timeframe for when LEDE will switch to the next kernel version?

anomeome · November 26, 2017, 4:47pm

There is some discussion to be had on the ML, but I suppose PR1546 will provide reason enough to reexamine all the target specific patches in the tree.

NainKult · November 27, 2017, 1:17pm

@nbd Would you be so kind to reopen the issue #888 ? Considering how many user still experience this issue and based on the feedback of the community, it seems that the current mwlwifi bug is unrelated to the continuous mamba crashes.

NainKult · November 30, 2017, 3:24pm

[PRODUCTION][netadm@net003.ncnet.local]/tmp: cat serial-net002.log.0 | grep "Booting from NAND" | wc -l
3

Rebooted 3 times in 4 hours (log rotate every day at 12PM GMT+1). That's starting to get worse when powered on for a long time. And by powered on I mean, plugged in the wall outlet. Do you powercycle when flashing ? I know I don't, I upgrade remotely*

*don't judge me plz

Edit 1: Make that 5 reboots in 4h30mins ohgodwhy.jpg

anomeome · November 30, 2017, 3:49pm

I don't power cycle when I flash a new image. When I was experiencing reboots, I could generate an image that would behave that way, then the next one would last for days/weeks. Now I don't seem to be able to make one that fails; sorry to be so negative

I am using procd fan_control rather then the OOTB cron job. Temperature thresholds are lower, the fan is at 50% minimum speed rather than off. Maintains lower temps, but I don't think this related.

NainKult · November 30, 2017, 3:58pm

Could you send me your config seed if that's ok with you ?

I'm using the cronjob, but with 128->196->255 instead of 0->128->255. My router is right next to the lateral fan intake of my DIY rack.

anomeome · November 30, 2017, 4:14pm

Here is a configdiff. I build for a mamba and rango using the per device rootfs functionality which is the reason for the defines at the top.

2devnull · November 30, 2017, 5:11pm

I cannot seem to make one that doesn't constantly reboot. I'm using x86/64 though.

i have tried snapshot and 17.1.04, 17.1.02, same things.

Frustrating as it doesn't give the reason why.

Sorry to hijack, hoping resolution here would be the fix for my device type also.

shm0 · November 30, 2017, 9:26pm

did you try a different power supply?

NainKult · December 1, 2017, 9:03am

I really think it is CPU / Load bound and not network related as stated before. A btrfs filesystem check generating IOs and quite some load (checksumming) make the router crash in less than an hour.

Edit 1: I failed to mention, network interfaces were brought down, i was doing my business on the serial console.

NainKult · December 1, 2017, 12:07pm

I was hoping to catch something fancy who'd explain your router stability. Do you run some kernel patches ? Specific kernels options added / removed ?

anomeome · December 1, 2017, 4:47pm

Pretty vanilla except maybe turning on seccomp/namespaces and using jail. Currently git status is just master with the addition of the the fan_monitor stuff, my file/ directory is just some localized setup configuration, and nothing done with kernelmenuconfig. So no, nothing special about my tree.

Regarding the thought about it being load, I have seen many reboots when the device was just sitting idle, but perhaps related to an IRQ issue as per the last post by @nbd, or memory(over/under).

NainKult · December 1, 2017, 5:20pm

It's the exact opposite for me. The longest periods of uptime come from not putting the thing under load. But I do note some strange behavior, after flashing an image built on a clean toolchain (understand make distclean), the router is fairly stable for a short period of time, even under extreme load (see #90). And then it's slowly starting to loose it's shit until it convince me to do some kernels config shenanigans and build another image.

For instance, today, i flashed an image and immediately after, done a complete btrfs filesystem check (4.5TB of data, around 10k files). I couldn't complete it with the previous image, it crashed after 20-30 mins, (hence the 7 reboots in 5 hours) but today, ran like a charm for two hours, i have now 6h of uptime

Could be my imagination tho. I ran out of ideas two month ago anyway... At this point, i would even blame the russians. Or Belkin. Or both.

2devnull · December 1, 2017, 6:15pm

I believe the explanation for the degraded situation over time might be a filesystem corruption issue that slowly manifest itself. I'm not sure what is causing it but unless you're running fsck all the time, something is messing with it. I notice my root partition (sda2) which is ext4 gets remounted as read-only and then the rebooting issues starts. Without usage the system seems to be OK but with usage it reboots. I tried different drives and USB thumb drives, same issue. I ran Ubuntu Desktop and Debian Jessie on the same hardware and no issues.

anomeome · December 1, 2017, 9:23pm

I got curious regarding the load idea. I created a small shell script with a while loop to exercise just CPU(and unit CESA(openssl speed)). So no action on the mwlwifi, LAN/WAN, USB/eSata... fronts. Running >1 instance of the script to keep device pegged, So far it just keeps ticking, getting tired of listening to fan running at full tilt though.

@2devnull, I doubt whether the issue you are seeing is related to the 4.9 reboot being experinced on the mamba. On the plus side you have access to functionality not available on mvebu target. iirc you should be able to catch an oops as the kexec / crashdump facility is in place and functioning on x86.

2devnull · December 1, 2017, 9:41pm

Can you explain how to catch the oops? I'll be happy to do it and provide any info that may help here or otherwise.

Thanks.

JTRealms · December 2, 2017, 4:42am

Has anyone tried disabling the cpuidle driver? i disabled two options relating to cpuidle in the kernel config after spending the night reading through kernel mailing lists trying to pinpoint possible leads.

There are a couple of bugs that reference the cpuidle driver causing the cpu to lockup under heavy IO loads and my novice interpretation is that the whole cpuidle implementation for mvebu is "hacky" and it seems to fit the description of our problem. It was disabled in mainline before 4.3(i think).

So far ive its been up for 7ish hours without reboots under a mixture of light and heavy workloads, atm i have it setup as a wireless client, its transferred 200+ GB at a consistent rate of 50-55MBps over wifi. Ill report back if i can achieve an uptime of 48+ hours, else hopefully useful just to rule cpuidle out as the culprit.

NainKult · December 2, 2017, 10:24am

Yes, both cpuidle and cpu speed scaling, to no avail. Crashed 48 hours later with that patch.

Edit 1: Just noticed that i forgot to remove it from my patch directory so i'm still running mine without cpuidle and frequency scaling... ...oh well...