Proposal (and solution!) for High Load fix on OpenWrt (LuCI)


#1

Hi,
Walking around the forum, I saw a lot of users talking about HighLoad on their routers with no apparent cause. Specifically when AUTOREFRESH is ON.

I just answered this on an specific topic here, and I'm creating this new topic so we can discuss it technically here (and not focused on the other guy problem that I replied)

And to permit us to discuss pros and cons of my proposal and/or analysis with the developers.

.
.
The Problem description: OpenWRT system get slow, presenting a very High Load while Luci interface is open and with AUTOREFRESH enabled.

Introduction:
Luci interface makes heavy use of CGI-BIN LUA Scripts. They are necessary to do almost everything on the interface. Some of them exists to update things in realtime, like indicators and dynamic text changes (just to mention a few). And when the page is opened on your browser, scripts are being called from time to time [via AUTOREFRESH] to retrieve updated informations and display them on the browser
[No news, everybody here knows how it works].

.
But why this causes High Load? (sometimes almost freezing the router out of resources?)

OpenWRT uses an embedded HTTP Server, called uhttp, that is responsible to process LuaScripts that reside on Luci interface and when AUTOREFRESH calls them.

The root of the problem:
uhttp comes pre-configured to run a maximum of 3 scripts at the same time in parallel. And Luci pages have calls to various different scripts. They keep running all the time to update indicators and graphs on Luci interface [when it's opened in borwser] and those scripts are getting stucked while uhttp is trying to run 3 of them at the same time!

This is the origin of the HUGE LOAD that occurs while autorefresh is ON. It´s uhttp cgi-bin subprocess that is originating the LOAD while running various scripts at the same time.

The proposal fix:
To fix it, we have to change the way uhttp deals with scripts processing, by changing uhttp configuration to run scripts sequentially [only 1 script at a time]. This will drop the load below 0.10 [and you can have as many opened Luci windows as you want (and with autorefresh ON)] :grinning:.
By changing this behavior all of the opened Luci windows will continue updating things in realtime, and the load will be no more a problem for the router. CPU will also be idle by 95%+ even with Autorefresh ON and having many browsers windows opened and updating things.

The logic:
By running 1 script at a time makes them run sequentially, and sequentially they run in milliseconds, this eliminates the HighLoad, also eliminates all the LAGs on the interface, and there is no side-effects (I extensively tested it and debuged this behavioral change. Did not find any problem until now) .

Changing it results in benefits and optimization.

.
.
Procedure:
Edit /etc/config/uhttpd

Change option max_requests value to 1.

option max_requests '1'

Save the file, reboot the router.
No more high load average and you get free CPU cicles and free CPU interruption, to run other process and programs faster.

.
.
Technical details:

That parameter line refers to to SCRIPT REQUESTS, not HTTP connections. It determines the quantity of CGI-BIN Lua scripts that can be executed simultaneously. (HTTP max connection is another parameter, and don't need to be changed)

Analysis from the Application point of view [and usage]

  1. There´s no technical reason [and no necessity] to run more than 1 lua script at the same time on the web interface.
  2. By running them sequentially, they run in milliseconds.
  3. Others Luci interface script-calls that exists on the webpage gets automatically queued to be processed by uhttp in sequence [so no lost information, no lost time, no lost attempts].
  4. Being queued makes they run instantly after the before-one has finished [and we are talking about milliseconds here]. This means the returned new information [returned by the script to the interface] is populated on the Luci interface instantly, in milliseconds.
  5. For us humans, milliseconds means realtime, so AUTOREFRESH stays realtime at our practical human experience and view.
  6. Luci interface is informative, I love it, but it must not [it should be prohibited to] cause a High Load on the router. Router has many more important things and jobs to do with it's resources and CPU cycles.
  7. We love to see realtime things in Luci Interface, we need to see it, and sometimes we need to keep the interface opened for hours. It is common and it should not prejudice the router itself.
  8. We cannot have the luxury to waste the router precious [and limited] resources with a stuck script via the interface.
  9. And again, there is no necessity/need [as I investigated] for those scripts to run in parallel. (please correct me if I missed something ou if I'm wrong here)

.
Analysis from the Router point of view [and health]

  1. This modification completely eliminates the High Load and High CPU usage / High SYS usage (that Luci scripts causes)
  2. This modification allow more resources to be used for other processes and jobs
  3. It stops Luci interface from slowing down the router

.
Results:

  1. a faster router, faster applications running, faster throughput. Because Luci interface no more causes High Load and no more slows down the router while opened.
  2. a faster Luci interface rendering, with pleasure interface navigation and comfortable usage, hundreds times faster, yes, hundreds (as I measured).

.
Final personal observations:
Luci interface seems to be the origin for HighLoad on various OpenWRT routers. I looked around this forum and found similar topics from people suffering with the same problem, some topics from a long time ago, with no solution. My intention and focus on this analysis was to understand the origin of the problem and fix it. Beside this intention, when I found the cause, I questioned myself: "Why? Why do we need to run the interface scripts in parallel?"

For sure there may exist [technically] something wrong happening on a layer below the CGI-BIN script layer [of uhttp] that causes the Load.

  • But again, do we need to run 3 scripts in parallel on the interface? Why 3? Why this number 3?
  • Probably because 3 is the default value of uhttp basic/original configuration?
  • For Luci interface, is there any reason needed to parallel script execution [and I don't know]?
  • The solution provided above (considering there is no need to parallel execution) fixes 100% of the problem. No more High Load. Isn't it better to solve this problem this way and have a huge performance/load problem solved?

I work for decades with data networks, and this is similar to a simple data congestion problem, so let's solve it the same way QoS do for data networks, the same way disks do to write data fiscally and concurrently: Let's queue it to be processed super fast and keep things running smoothly! :grinning:

Luci is an amazing interface, really good, beautiful and well done interface. With this modification, it becomes fast and very light for the router execute it.

By solving this big performance problem that affects a high percentage of OpenWRT users, my suggestion is to change it in trunk config file, to ship it adjusted like this, so users can have a much better solid OpenWRT product.

Do you guys see any problem in my fix suggestion proposal?

I'm looking forward for your replies,

Regards,
Rafael Prado


After restoring backup = no LuCI SSL connection
Build for Netgear R7800
#2

Seems like a very thorough writeup.
I’m not clear though on just how bad the problem here is / what it is that we’re trying to solve.

Can you create a script that demonstrates the problem? Im imagining something that runs on a pc, (curl, wget, or maybe javascript in their browser), that calls these cgi script manually - but whatever gets the job done)

With this cgi paramater change to 1, what does the behaviour then look like?


#3

Gee, you just saved me mucho-iofloppage :slight_smile:

EDIT: DO NOT DO THIS IN PRODUCTION ENV.
THANKS JOW!
_Try this on for size; option no_ubusauth '1' _
And if your feeling lucky; option script_timeout '7'


#4

@wulfy23 Can you elaborate a little what you want to achive with your options? In addition or as an alternative to Rafaels suggestion?


#5

Well... seeings that the UI is a non-day to day service.... the issue in general is that critical.

Having said that after trying out those parameters.....Power usage dropped to below 25% ( spikes are default )

load

So yeah, I think OP has a point re: tweaking the defaults toward the passive side..... but that's purely from a resource limited hardware / world energy consumption perspective....

For me, if it were left as is..... it creates no issues... and i'd rather have it how it is than purely static. But it's one more tweak I can add to my defaults that will give me more VROOOOOOOMMMMM!!! :wink:

I'm anti-parallel most of the time. But I don't think serializing.... in a specific sense is gonna achieve the mustard.

EDIT: So, it looks like its the global frequency folks / dataset scope. Graph pages.... maybe a few fields here and there only would need that kinda feedback..... ajax push (or pull ) per dynamic element would be the fix and global refresh -30 to 70% ( or configurable )..... That would almost half it.

(P.S.) Noticed potential app integration issues ( collectd json pouring in on non-related pages, dunno if thats how the system is supposed to work tho'.... i.e. two tabs open at once..... )


#6

Hmmm...I am using a Netgear R7800 with 2x 1700 Mhz and the cycling effect of generating a load >2 peaks to 4 was irritating and I wouldn't categorize the router I am using as low end and "resource limited" ignoring for one minute that every thing is limited, even the lifetime of the sun ;- )

But I am not sure how these load peaks influenced the rest of the routers routing performance

Here is a fine speed test script monitoring latency and cpu load so running this script with and without an opened luci overview site would be interesting.

PS: ok, universe seems to be endless and stupidity of some people including myself (sometimes ;- )


#7

@jow
any thoughts to the discussion on possible load due to cgi script concurrecy ?


#8

This didn't fix it on my 500 MHz single core router, but it did reduce the load to about 1.


#9

Please do not ever do this. This will expose full ubus access via http without any authentication. This is similar to run ssh without root password set.


#10

Its an interesting observation and in hindsight it makes a lot of sense to serialize the script calls. I’ll do some tests and will probably adjust the default when everything works out.


#11

Ooh nice, this reduces the likelihood of watchdog-triggered reboots when using LuCI on Lantiq xrx200.


#12

Hi @wulfy23,

I did some tests on your proposed changes.

About no_ubusauth @jow already answered. And his answer is definitive.

About reducing option script_timeout to '7' :

  • (in short) reducing it can sometimes return ERROR 500 on very slow (low end systems) and it can impede firmware updates via Luci on all systems if it's set too low.

  • (in long) pros and cons detailed, please read below:

.
.
IMPORTANT, about this option script_timeout:
For all OpenWRT systems (including low end, and hi end fast systems), there is ONLY ONE script that takes a big time to complete, and it's important to fully complete: It´s the Luci Firmware update UPLOAD and update phase script.

option script_timeout MUST be greater than the time it takes to upload your firmware.
.
.
About your proposed change: I suggest you adjust option script_timeout to the time sufficient to upload the openwrt image to your router. (You don't need to flash it for testing, just upload the image via Luci to test, than do not press button to flash. If you get Error 500, Increase script_timeout, restart uhttp service, and repeat the upload process to find the optimal value for ***option script_timeout ***)

Personally, I'm using option script_timeout '20' (just because the firmware upload case)

Technically this option interrupt script execution (kill the script) if script running time takes a bigger time than this option value.

It's a resource saver if you have a system with excessive IO, but you wont get the script result in case of interrupted. You will get error 500 [Bad Gateway]

.
.
Your suggestion is valid for the purpose and it adheres with my serialization suggestion. It acts like a second protection (in case some script get stuck, it will be killed) to preserve and protect system resources and IO. Except that we have that firmware upload page.

If you don't use Luci interface to update your firmware (if you update your firmware via SSH, no problem), than you can set a low value for option script_timeout !

Thanks!


#13

Hey, thanks for putting the effort into that!!!

If you haven't seen this "perspective" yet.... take a look at;

ubus monitor ( while on various luci pages )

If you can find a way to "scale back" frequency or verbosity of the json stream.... ( I think it might be on a js/xhr level ).... you will truly have your finger on the pulse :wink:

And at the other end of the equation.... is whether or not the lua ( nixio ) to storage is the bigger bottleneck.

Maybe something like /etc/config/uhttpd;

option auto_refresh 20 # excluding graph elements

Is a simpler way of looking at it....


#14

I had been wondering if anything might have changed in inbetween the 17 stable series and the 18, that might have something to do with this?

When v18 first showed up, I and others noticed high load numbers where we hadnt before. And, I seemed to notice lower performance in routing and SQM on my stressed to its limits C7 that I'd been testing heavily. Tried to bring this to attention. But it seemed inconsistent. Not everyone saw it. Nobody was that interested. I went back to v17 last.

Now, I'm trying out v17, v18 and recent snapshots, with the C7's being dumb AP's and a x86 doing the routing and SQM duties. And was seeing occasional high numbers and performance issues despite what should be a much lighter load. But, sometimes no problems.

Clues: seeing 3 (!) lua tasks momentarily appearing at 20-30% CPU each in top, and only when you have a LUCI page open that has some kind of updating going on. I've learned to get off those pages when I run some kind of benchmark.
Edit: have only looked at a recent snapshot, not stable v18.

Suddenly, this makes a lot more sense.

My question, did something get changed v17 to v18 that started this, or made it worse? Does this sound familiar to anyone?


#15

After changing max_requests to 1 and testing for awhile, I noticed that on a device with two radios, the table for the second radio under "Wireless Overview" seems to load information slightly slower. When set to 2 or the current default of 3, all tables seem to resolve from the initial "Collecting data..." state at roughly the same time. This was on a single core mips device, so the effect is probably less noticeable on a newer multi core arm device.

A new default of 2 may provide the best of both worlds with more serialized requests and a less jarring visual experience.


#16

Is there some sort of wizardry you gents can use to "renice" / jail these;

"Clues: seeing 3 (!) lua tasks momentarily appearing at 20-30% CPU each in top, and only when you have a LUCI page open that has some kind of updating going on."

max 10% cpu maybe....


#17

+1 Luci causes the system to be overloaded!


#18

Is this also seen with nginx?


#19

The default in 18.06.2 is in fact 1, so someone clearly thought it a good idea.


#20

I have run into an issue with this. The Freifunk wizard is set up to prompt the user to put in a password before continuing onto the next step on the install wizard. To know if the password has been set or not, a cgi script is called. A cgi-script runs as root and therefore has permission to look into /etc/shadow to see if there is a password hash (using luci.sys.user.* calls as user nobody doesn't work).

With max_requests set to 1, the cgi script which checks /etc/shadow never gets executed (times out after 30 seconds). This causes the wizard to no longer run correctly. Changing the max_requests to a number greater than 1 fixes this issue.

It may be that some other interfaces in luci have the same issue.

(Finding this thread is the result of countless hours of bisecting the freifunk-berlin firmware with all associated feeds).

Related freifunk-berlin commits and issues: