Climbing Load Average when Overview is open on Luci

btsimonh · March 8, 2017, 10:17am

May be similar to LUCI slow down to death (WR1043N V1) but that thread went off in a different direction.

I have a number of routers:
BT Homehub 2a - OpenWrt Chaos Calmer 15.05 / LuCI (git-15.248.30277-3836b45)
BT Homehub 2b - OpenWrt Chaos Calmer 15.05 / LuCI (git-15.248.30277-3836b45)
BT Homehub 5a LEDE Reboot 17.01.0 r3205-59508e3 / LuCI lede-17.01 branch (git-17.051.53299-a100738)

None have DSL connected.
2A: (bcm63xx/HOMEHUB2A (0x6358/0xA1) - 64M Ram
If the Overview screen is open with auto-refresh on, 'Load Average' climbs rapidly to >>2
turn off auto refresh, and load average falls fairly quickly.
top does not indicate excessive CPU usage.
BUT; it only does this when it get's into a certain state... maybe Overview open for a long time?
(i.e. whilst writing this, I turned off auto-update to test, messed with XHR.js, put the changes back, and now can't reproduce...)

2B: (Danube rev 1.5) - 64M Ram
If the Overview screen is open, aggressively refreshing (F5) can cause 'Load Average' to climb, but with just auto-refresh, it falls again.
top shows /sbin/dsl_cpe_control using cpu cycles as expected.
BUT; as above, I'm sure this router exhibited the same issue yesterday...

5A: (xRX200 rev 1.2) - 128M Ram
If the Overview screen is open with auto-refresh on, 'Load Average' climbs rapidly to >>2
if dsl_control is disabled in startup, and the unit rebooted, then it manages to cope; but i'm sure I've had a situation where it did not. It's almost as though 'something goes wrong' to trigger a state where multiple processes end up waiting on the same resource.

At first I blamed the DSL firmware/code, but now after finding it on both lantiq and broadcom, it feels more like a bug in the code called by the luci polling; but looking through the script I've not got enough experience to identify an issue. However, if it is just purely that luci polling ends up starting multiple 'polls' because one takes longer than a second to service, then maybe a change in the polling style would solve it - the xhr.js, when looked deeply, is fairly complex.

If anyone else is experiencing unexplained high 'Load Average' figures, the simple workaround is don't leave the Overview page open with auto-refresh turned on. But it would be good to get to the bottom of the issue...

Update:
I left to make coffee; come back to the 2A after 15 minutes to 'Load Average' at >2
turning off auto refresh caused it to fall, turn back on, caused it to rise again.
Shift-F5 in the browser, and it's now falling (with auto-refresh on).
It soon started to rise again, this time two shift-F5 to get it to calm down.
I do note in chrome developer tools that there are two luci polls in progress every 5 seconds (status and dyndns); on this router they take ~2.3s to service. but there is no apparent difference in response or poll style between 'rising' mode and 'not rising' mode. (I have a large hosts dataset it reads every time...). I also note that on the 2B, there is only one luci request in flight at a time (no dyndns). On the 5A (currently behaving), one poll at a time; and LEDE seems to alternate between status and hosts data.

Snotmann · March 8, 2017, 10:47am

Got the same here:

Witch packages you got installed ? Me vnstat, ddns with no-ip and sqm

btsimonh · March 8, 2017, 11:00am

packages: nothing special.
2A: unused dyndns? nano, usb. It's my main router, configured to talk to a BT openreach over ethernet for FTTC
2B: nano, usb stuff. Secondary router acts as an AP.
5A: pretty fresh LEDE, going to be for VLAN, but nothing except maybe nano.
@snotmann: check http://openwrt.ebilan.co.uk/viewtopic.php?f=7&t=200
You have crashing - my routers don't die or even become slow; they just display high load average (which can't be good anyway..), but not on luci screens which don't refresh. The above thread discusses a crash in BT HH V5; and highlights a possible commit which may cause it. I've not checked exactly where, or what routers it would affect, because for my V5, I can avoid it by turning off DSL, so I skipped that issue :), but it may be of use to you.

psyborg · March 9, 2017, 5:23pm

it is known issue since long time ago. happens on wifi overview page and especially main overview page. load quickly rises over 4 sometimes even 5. refreshing page several times it is possible to hit the case where the load will drop to 0 if there is no other device activities. this could be something rare like one of ten page loadings.

motocrossmann · March 9, 2017, 5:50pm

Do you use HTTPS? I've noticed that SSL really consumes resources. I've since chosen to limit access to LUCI to my LAN only. If I need to get to it remotely, I do an SSH tunnel.

btsimonh · March 9, 2017, 6:02pm

no, http only.

hnyman · March 9, 2017, 8:44pm

Nothing new. I opened an issue about it in the LuCI bug tracker in 2015: https://github.com/openwrt/luci/issues/459

There is just too much refreshed stuff on the front page. Some of those take a long time to fetch and process the refreshed status data.

The load could be mitigated by slowing the refresh cycle, but that is maybe just a workaround the core issue of too complex status data fetched too often..

@Snotmann
The memory exhaustion issues with 32 MB RAM devices are mainly a different issue.

Snotmann · March 9, 2017, 9:07pm

Damn. I understand. Maybe the time brings the solution

btsimonh · March 10, 2017, 7:36am

@snotmann - that's why I liked the BT homehubs; they did not skimp on ram; just left us with some nasty nand/usb interrupt conflicts on some models :).
@hnyman - nice writeup in your bugtracker entry. For me, it's unpredictability smacks of something getting 'stuck' - it behaves ok sometimes, but once 'something' has triggered, the load avg suddenly goes into a mode where it rises until you leave the page - like an unmatched mutex; a process stalling with a lock on something which the poll uses. But then leaving the page and returning later, it's cleared and behaves normally again for a while. When in this state, CPU usage does not appear high, only load avg; but then my assessment is lacking expert linux experience.

hnyman · March 10, 2017, 7:54am

My guess is one of the periodic auto-refresh things does not finish its check before the next one kicks in. Might ddns, upnp or something like that. Some of those packages have rather complex status derivation schemes

You should test with browser profiling tools like I did. At least Firefox and IE have rather easily accessible developer/debug modes where you can see how long each page component takes to download. It would be great to find out which plugin causes it for you.

I will likely test first in my own system slowing down the auto-refresh pace for the front page components, and then push the changes to Luci master to mitigate things a bit. As you can see from my bug report, this is nothing new, and I tested some changes two years ago but lost then interest and never pushed changes to Luci master. One reason for that was the suspicion that the XHR poll rate is not the root problem, but just makes it more visible.

dlang · March 10, 2017, 8:05am

@hnyman - nice writeup in your bugtracker entry. For me, it's unpredictability
smacks of something getting 'stuck' - it behaves ok sometimes, but once
'something' has triggered, the load avg suddenly goes into a mode where it
rises until you leave the page - like an unmatched mutex; a process stalling
with a lock on something which the poll uses. But then leaving the page and
returning later, it's cleared and behaves normally again for a while. When in
this state, CPU usage does not appear high, only load avg; but then my
assessment is lacking expert linux experience.

Linux loadavg is processes wanting CPU + processes waiting on I/O

if you can ssh into the box and look at top for processes in D or Z state
(pretty much anything but S), that may spot a common culprit.

David Lang

jow · March 10, 2017, 8:20am

Please check if the load normalizes after issuing the following command:
echo /bin/true > /proc/sys/kernel/modprobe

psyborg · March 10, 2017, 9:30am

maybe, but i have no ddns or upnp installed. remember, it still happens on wifi page so it will be most likely related to wireless.

don't do it. is not a solution. i've decreased this interval to 1 sec in order to have more realtime stats update and let me tell you when i hit the page that one of ten times even with 3 physical radios and about 10 virtual interfaces refreshing data every second does not create load over 1.5

true that

btsimonh · March 10, 2017, 9:37am

@jow: effect of echo /bin/true > /proc/sys/kernel/modprobe:
V2A (where load avg rises always and immediately), no apparent effect. load avg levels at ~3.
V2B (where load avg rises only after a little time), no apparent effect. load avg levels at ~2.
The load avg curve for the v2b is interesting:

The overview page was open for a short time before the start of the graph, and there is a step change in behavior about 1 minute in.
s

btsimonh · March 10, 2017, 9:37am

Another interesting observation; my laptop went to sleep for 15 minutes, and now it has awoken, the overview page is still updating, but the load avg is close to zero, however, whilst typing this, the load avg did this:

btsimonh · March 10, 2017, 9:38am

then a few minute later this:

and then promptly crashed! That router (acting only as an AP and ethernet hub) has NEVER crashed on me before... (I hope this is as a result of echo /bin/true > /proc/sys/kernel/modprobe ?).
(sorry for the multiple posts; apparently I can only attach one image per post because I'm a 'new' user ).

psyborg · March 10, 2017, 9:41am

luci process kept switching between R, S and D states

jow · March 10, 2017, 9:44am

So the only seemingly active process in "ps" is luci?

dlang · March 10, 2017, 9:54am

dlang:

if you can ssh into the box and look at top for processes in D or Z state (pretty much anything but S), that may spot a common culprit.

luci process kept switching between R, S and D states

When you are in the mode of a runaway load average, there should be some program
stuck in D or Z state

David Lang

btsimonh · March 10, 2017, 10:12am

@dlang: I did look for this previously, but did not identify anything specific. But I found top -d1 -b and ctrl-c to be useful, as the processes are not long-running. Today's quick look found nothing in D or Z. maybe there is literally 100 threads for 100ms, all asking for the same resource, kicking the load avg up, but difficult to catch at it? I'm coming to suspect something like uhttpd, as the thing serving luci?