OpenWrt Forum Archive

Topic: Update on Linksys WRT1900AC support

The content of this topic has been archived between 16 Sep 2014 and 7 May 2018. Unfortunately there are posts – most likely complete pages – missing.

DavidMcWRT wrote:
gufus wrote:
Chadster766 wrote:

When the "rcu_sched self-detected stall on CPU { 0}" occurs isn't it supposed to clear up the stall instead of allowing CPU 0 to be stalled forever?

Fouund a error in the code in fan_ctrl.sh

if [ -d /sys/devices/pwm_fan ];then
    FAN_CTRL=/sys/devices/pwm_fan/hwmon/hwmon0/pwm1
elif [ -d /sys/devices/platform/pwm_fan ];then
    FAN_CTRL=/sys/devices/platform/pwm_fan/hwmon/hwmon0/pwm1

There is NO /sys/devices/platform/pwm_fan/hwmon/hwmon0/pwm1

I think the code's fine - from memory the if is checking for 3.18 paths and the elseif 4.x paths.

The code says "elif" not "elseif" ?

Explanation :
1.Expression allows  only constant expression
2.#elif directive means “else if”
3.#elif Establishes an if-else-if chain for multiple compilation options.
4.Result of the Expression is TRUE , then Block of Statement between #if and fist #elif is compiled , then it jumps to #endif.
5.Result of the Expression is FALSE , then Corresponding #elif condition  is tested , if true the the block followed by that elif is Compiled otherwise it checks for Next condition followed by next elif statement
6.#endif is the end of #if statement

Chadster766 wrote:

INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=7147915 c=7147914 q=97)
Wed May 27 11:55:06 2015 kern.info kernel: [1218508.937871] Task dump for CPU 0:
Wed May 27 11:55:06 2015 kern.info kernel: [1218508.941297] kworker/u4:0    R running      0 26554      2 0x00000002
Wed May 27 11:55:06 2015 kern.info kernel: [1218508.947946] Workqueue: phy1 ieee80211_iface_work [mac80211]

This time 5ghz did it

IMO It's the wifi driver sad

(Last edited by gufus on 27 May 2015, 23:07)

gufus wrote:
Chadster766 wrote:

INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=7147915 c=7147914 q=97)
Wed May 27 11:55:06 2015 kern.info kernel: [1218508.937871] Task dump for CPU 0:
Wed May 27 11:55:06 2015 kern.info kernel: [1218508.941297] kworker/u4:0    R running      0 26554      2 0x00000002
Wed May 27 11:55:06 2015 kern.info kernel: [1218508.947946] Workqueue: phy1 ieee80211_iface_work [mac80211]

This time 5ghz did it

IMO It's the wifi driver sad

I hope you are right and this provides the clues necessary to solve the issue.

Chadster766 wrote:

Wed May 27 11:47:29 2015 daemon.info hostapd: wlan0: STA 8c:3a:e3:6d:03:3b WPA: group key handshake completed (RSN)
Wed May 27 11:47:29 2015 daemon.info hostapd: wlan0: STA 4c:8d:79:b0:66:cd WPA: group
key handshake completed (RSN)
Wed May 27 11:47:30 2015 daemon.info hostapd: wlan0: STA 90:72:40:7b:1d:44 WPA: group key handshake completed (RSN)
Wed May 27 11:47:51 2015 daemon.info hostapd: wlan1:

See 2.4ghz was online then 5ghz crashed it

r45683 made it to just about 13 days for me... 

jumping to latest trunk and will leave it alone as long as possible...

cheers

Chadster766 wrote:
gufus wrote:
Chadster766 wrote:

INFO: rcu_sched self-detected stall on CPU { 0}  (t=6000 jiffies g=7147915 c=7147914 q=97)
Wed May 27 11:55:06 2015 kern.info kernel: [1218508.937871] Task dump for CPU 0:
Wed May 27 11:55:06 2015 kern.info kernel: [1218508.941297] kworker/u4:0    R running      0 26554      2 0x00000002
Wed May 27 11:55:06 2015 kern.info kernel: [1218508.947946] Workqueue: phy1 ieee80211_iface_work [mac80211]

This time 5ghz did it

IMO It's the wifi driver sad

I hope you are right and this provides the clues necessary to solve the issue.

Well, the same thing comes up in the logs over-and-over sad

Chadster766 wrote:
gufus wrote:
Chadster766 wrote:

It also looks more like this is relate to the "/sbin/fan_ctrl.sh" more than wireless at this point.

I'm going to change the schedule to once every 15 minutes for /sbin/fan_ctrl.sh crontab and see what happens.

Maybe eh...

Hope thats the prob

When the "rcu_sched self-detected stall on CPU { 0}" occurs isn't it supposed to clear up the stall instead of allowing CPU 0 to be stalled forever?

There isn't really that much to do, is there?  The CPU is running buggy code.  If you knew how to automatically fix that code, then you would probably have run that automated fixup on the source _before_ building the kernel/driver that is failing, wouldn't you?

But the kernel will try its best, using the available means. Which is basically limited to forcing the scheduler to run whatever is waiting in line. This code follows the warning and stacktrace dump:

        /*
         * Attempt to revive the RCU machinery by forcing a context switch.
         *
         * A context switch would normally allow the RCU state machine to make
         * progress and it could be we're stuck in kernel space without context
         * switches for an entirely unreasonable amount of time.
         */
        resched_cpu(smp_processor_id());

But that won't work very well if the only task waiting for CPU time is the buggy one, which likely is the case.  It will just lock up again and again.


There is actually great docs available on how to interpret this warning.  See
https://www.kernel.org/doc/Documentatio … llwarn.txt

The bug is probably a locking error in one of the functions listed in the stack trace (or another place using the same locks).  I believe i've seen a few mwlwifi functions there, which definitely is high on the list of suspects.I assume the Marvell developers are looking into it.  You'd have to get a good grasp on the driver locking design before you can try to figure out the problem.

I want to mention a scenario I had since its similar.  I was having a rcu_sched self-detected stall on CPU { 0} stall pretty often, either once a day or sometimes multiple times a day.  (This is back when I was asking how to get syslog going).  I ended up building a job to reboot the router when it sees the line ...
https://forum.openwrt.org/viewtopic.php … 30#p276130

The point though, is that when I analyzed the syslog I noticed that there was lots of dhcp release/renews going on.  I mean like 98% of the log was that!  And then I realized almost 100% of that was from just one device that just wasn't able to connect and just kept trying, (several times a second).  When I removed this device the router hasn't crashed since then (16 days ago).  I had wondered if there is maybe some type of leak occurring and the excessive repeating of this failed process caused the router to have some resource expire and locked it up?  I guess the point is, do you guys see anything else fishy in the log?

fyi: I'm still on build 45222! (its been a trooper!)

(Last edited by IvanRaide on 28 May 2015, 12:46)

IvanRaide wrote:

I want to mention a scenario I had since its similar.  I was having a rcu_sched self-detected stall on CPU { 0} stall pretty often, either once a day or sometimes multiple times a day.  (This is back when I was asking how to get syslog going).  I ended up building a job to reboot the router when it sees the line ...
https://forum.openwrt.org/viewtopic.php … 30#p276130

The point though, is that when I analyzed the syslog I noticed that there was lots of dhcp release/renews going on.  I mean like 98% of the log was that!  And then I realized almost 100% of that was from just one device that just wasn't able to connect and just kept trying, (several times a second).  When I removed this device the router hasn't crashed since then (16 days ago).  I had wondered if there is maybe some type of leak occurring and the excessive repeating of this failed process caused the router to have some resource expire and locked it up?  I guess the point is, do you guys see anything else fishy in the log?

fyi: I'm still on build 45222! (its been a trooper!)

Yup i had a device (apple tv) that was suffering from OCD or ADHD or something, kept making dhcp requests. While i never had a lockup. I ended up putting static leases for 90% of my devices at home.

P.S 45222 was a champ!!

I've been running my own custom builds for two months now. No lockups yet.


root@OpenWrt:# uptime
09:55:29 up 12 days,  4:55,  load average: 0.01, 0.02, 0.04
root@OpenWrt:# uname -a
Linux OpenWrt 3.18.11 #1 SMP Wed Apr 22 19:13:09 EDT 2015 armv7l GNU/Linux
root@OpenWrt:# cat /etc/openwrt_version
r45566

3.18.11 kernel

IvanRaide wrote:

I want to mention a scenario I had since its similar.  I was having a rcu_sched self-detected stall on CPU { 0} stall pretty often, either once a day or sometimes multiple times a day.  (This is back when I was asking how to get syslog going).  I ended up building a job to reboot the router when it sees the line ...
https://forum.openwrt.org/viewtopic.php … 30#p276130

The point though, is that when I analyzed the syslog I noticed that there was lots of dhcp release/renews going on.  I mean like 98% of the log was that!  And then I realized almost 100% of that was from just one device that just wasn't able to connect and just kept trying, (several times a second).  When I removed this device the router hasn't crashed since then (16 days ago).  I had wondered if there is maybe some type of leak occurring and the excessive repeating of this failed process caused the router to have some resource expire and locked it up?  I guess the point is, do you guys see anything else fishy in the log?

fyi: I'm still on build 45222! (its been a trooper!)

I checked for a memory leaks just before one of the lockups and the memory usage was fine so at least its not that.

That is really interesting stuff. I used to be badly affected by the lock but now its a 5-8 day type issue and my record is 18 days uptime. I didn't static IP's but I did set static reservations for DHCP so I could do some iptables tracing.

I also logged kernel logs to USB for a couple of months now and see similar. But in terms of fishy things, I noted that before each crash system load goes up... In other words, there is a leading indicator that shows a crash will happen.

I didn't see much in the logs to suggest the cause, but clearly that's an avenue to go down. Why not add lucu-statistics in and keep an eye on it? I'm sure this is not just me seeing that...

Check out the image captured at this post in this thread...

https://forum.openwrt.org/viewtopic.php … 26#p272126

regards

Paul

IvanRaide wrote:

I want to mention a scenario I had since its similar.  I was having a rcu_sched self-detected stall on CPU { 0} stall pretty often, either once a day or sometimes multiple times a day.  (This is back when I was asking how to get syslog going).  I ended up building a job to reboot the router when it sees the line ...
https://forum.openwrt.org/viewtopic.php … 30#p276130

The point though, is that when I analyzed the syslog I noticed that there was lots of dhcp release/renews going on.  I mean like 98% of the log was that!  And then I realized almost 100% of that was from just one device that just wasn't able to connect and just kept trying, (several times a second).  When I removed this device the router hasn't crashed since then (16 days ago).  I had wondered if there is maybe some type of leak occurring and the excessive repeating of this failed process caused the router to have some resource expire and locked it up?  I guess the point is, do you guys see anything else fishy in the log?

fyi: I'm still on build 45222! (its been a trooper!)

paul.culmsee wrote:

That is really interesting stuff. I used to be badly affected by the lock but now its a 5-8 day type issue and my record is 18 days uptime. I didn't static IP's but I did set static reservations for DHCP so I could do some iptables tracing.

I also logged kernel logs to USB for a couple of months now and see similar. But in terms of fishy things, I noted that before each crash system load goes up... In other words, there is a leading indicator that shows a crash will happen.

I didn't see much in the logs to suggest the cause, but clearly that's an avenue to go down. Why not add lucu-statistics in and keep an eye on it? I'm sure this is not just me seeing that...

Check out the image captured at this post in this thread...

https://forum.openwrt.org/viewtopic.php … 26#p272126

regards

Paul

That's very interesting and good troubleshooting. I recommend that testers create a cron job to export "ps" or "top" to a file every 10 minutes (not append). This should capture the process overload before crash.

IvanRaide wrote:

I want to mention a scenario I had since its similar.  I was having a rcu_sched self-detected stall on CPU { 0} stall pretty often, either once a day or sometimes multiple times a day.  (This is back when I was asking how to get syslog going).  I ended up building a job to reboot the router when it sees the line ...

Care to share your script that does this?

Sounds like a good workaround for now. smile

gufus wrote:
IvanRaide wrote:

I want to mention a scenario I had since its similar.  I was having a rcu_sched self-detected stall on CPU { 0} stall pretty often, either once a day or sometimes multiple times a day.  (This is back when I was asking how to get syslog going).  I ended up building a job to reboot the router when it sees the line ...

Care to share your script that does this?

Sounds like a good workaround for now. smile

Sure, but it requires some setup to work (getting logd to write to a syslog file), setting up as a service if you want that; I have it running as a service but you could just cron job on a higher interval (notice I have an aggressive 30 second check).  Therefore it might be simpler to rip it apart to conform it to your needs.   Why 190 lines on tail?  The reboot is over 190 lines in the log so you will not hit a previous stall that way.  I am making some assumptions, if anything isn't clear, tell me and I can go over it more.

#!/bin/sh /etc/rc.common

FileFullPath="/mine/scripts/WatchDogRebooter/log.txt"
AppName="WatchDogRebooter v1.0"
SysLogFile="/syslog"

#cleanup
echo `date` "-- Moving to previous"  >> $FileFullPath
cp -f $FileFullPath $FileFullPath.previous.txt
rm $FileFullPath
echo `date` "-- *** Starting " $AppName "*** -- "  >> $FileFullPath


while test 1
do
    TailResult=`tail -n 190 /syslog | grep 'rcu_sched self-detected stall on CPU { 0}'`
    if [ "$TailResult" != "" ]
    then
        echo `date` " -- *** found a stalled CPU, I'll reboot *** -- " $TailResult "  " >> $FileFullPath    
        echo `date` " -- *** found a stalled CPU, I'll reboot *** -- " $TailResult "  " >> $SysLogFile    
        /sbin/reboot -f
    fi
sleep 30
done
IvanRaide wrote:

Sure, but it requires some setup to work (getting logd to write to a syslog file),


logread -f  >> /mnt/sdb1/syslog

This would work instead of logd, at startup and run your script via cron.

gufus wrote:
IvanRaide wrote:

Sure, but it requires some setup to work (getting logd to write to a syslog file),


logread -f  >> /mnt/sdb1/syslog

This would work instead of logd, at startup and run your script via cron.

But we have lots of log recordings already. We really need process cpu % now recorded just before lockup.

I actually do the logging from config, but there are lots of different ways.  I also have a cron job that rolls the syslog (keeping 2) based on size.  I initially had the job as for rebooting as cron only, but wanted to make it a service so I could stop and start easily, but there are lots of different ways to do this.

Chadster766: As soon as I get a lockup I'll give a log :-) 

config system
    option hostname 'OpenWrt'
    option zonename 'America/Toronto'
    option timezone 'EST5EDT,M3.2.0,M11.1.0'
    option cronloglevel '8'
    option log_file '/syslog'
    option conloglevel '7'
IvanRaide wrote:

I actually do the logging from config, but there are lots of different ways.  I also have a cron job that rolls the syslog (keeping 2) based on size.  I initially had the job as for rebooting as cron only, but wanted to make it a service so I could stop and start easily, but there are lots of different ways to do this.

Chadster766: As soon as I get a lockup I'll give a log :-) 

config system
    option hostname 'OpenWrt'
    option zonename 'America/Toronto'
    option timezone 'EST5EDT,M3.2.0,M11.1.0'
    option cronloglevel '8'
    option log_file '/syslog'
    option conloglevel '7'

Will this log the processes cpu usage percentages? I had a lockup today but didn't have a script in place to capture processes cpu loads.

Can I just put this in the system config and it work for this purpose?

Hi,

Sorry but the connection throught 5GHz is too slow & buggy. I have to reboot everyday. The wife's level of complain is too hight wink
I am reverting back to stock firmware. Shame on Belkin/Linksys who have said that this device will be openwrt compliant.

regards,

IvanRaide wrote:

Sure, but it requires some setup to work

Thanks for your script smile

I have everything writing to external storage and your scrip running via cron. Should work.

bdherouville wrote:

Hi,

Sorry but the connection throught 5GHz is too slow & buggy. I have to reboot everyday. The wife's level of complain is too hight wink
I am reverting back to stock firmware. Shame on Belkin/Linksys who have said that this device will be openwrt compliant.

regards,

Like any other company, this is work in progress. I would wait another month or so, and I'm sure these issues will be fixed wink

Chadster766 wrote:
gufus wrote:
IvanRaide wrote:

Sure, but it requires some setup to work (getting logd to write to a syslog file),


logread -f  >> /mnt/sdb1/syslog

This would work instead of logd, at startup and run your script via cron.

But we have lots of log recordings already. We really need process cpu % now recorded just before lockup.

Yes, I'ed bet it's a high load.

I just setup WatchDogRebooter v1.0

Chadster766 wrote:
IvanRaide wrote:

I actually do the logging from config, but there are lots of different ways.  I also have a cron job that rolls the syslog (keeping 2) based on size.  I initially had the job as for rebooting as cron only, but wanted to make it a service so I could stop and start easily, but there are lots of different ways to do this.

Chadster766: As soon as I get a lockup I'll give a log :-) 

config system
    option hostname 'OpenWrt'
    option zonename 'America/Toronto'
    option timezone 'EST5EDT,M3.2.0,M11.1.0'
    option cronloglevel '8'
    option log_file '/syslog'
    option conloglevel '7'

Will this log the processes cpu usage percentages? I had a lockup today but didn't have a script in place to capture processes cpu loads.

Can I just put this in the system config and it work for this purpose?

No, the config will only make the logging (that is going to memory) go to a file.
And syslog doesn't by default log cpu process.  I believe you would need collectd installed to do that stuff.  If you see paul.culmsee previous note about luci-app-statistics it will stall components like this but I'm not that familiar with it to suggest the proper setup.  In a real brute-force kind of way you could do something like...

echo `date` "-- TOP output -- "  >> /OtherLogFile
top -n 1 >> /OtherLogFile

so that it outputs CPU load (to a different log) and do that on a cron job at the interval you want?  (but this is pretty crude)

(Last edited by IvanRaide on 29 May 2015, 00:36)

Sorry, posts 5251 to 5250 are missing from our archive.