Possible cause of R7800 latency issues

What does low latency kernel mean?
Full preempt + 1000hz?

I also use voluntary preempt but 250hz on my wrt1200.
Indeed 1000hz will give you lower latency but also reduces throughput.
Bufferbloat.net also recommends 1000hz but I choose somewhat of the middle way here.
And I can also confirm that everything feels snappier.
I don't know how voluntary preempt can make that of a difference. Because it should only affect user space programs?

Is true that kernel hz should match your power grind frequency?
For example if your power grid frequency is 50hz you should use 250hz, 500hz or 1000hz?
For 60hz use 300hz, 600hz, 1200hz ?
But I guess thats a myth because the power grid voltage gets converted anyway.

Rt kernel is only useful if you have applications that can make use of it.
However I also tried the rt kernel thing x)
I could make the patches apply and openwrt just booted fine. But bugged all over the place. For example viewing the graphs generated insane amount of CPU usage then the system crashed :confused:

Yes, that is what I meant. It is called “Low Latency Desktop” in kernel_menuconfig I think. I was surprised that it did not boot.

I do not know, but 100ms spikes are gone. There was no independed confirmation though. Do not dnsmasq, hostapd, etc. run as user space apps?

Related to the WIFI-latency, some searching gave the following results:

A reason for the latency spikes to occur: http://blog.cerowrt.org/post/disabling_channel_scans/

A solution is described here:

https://answers.microsoft.com/en-us/windows/forum/windows_10-networking/is-there-any-way-to-stop-windows-10-from-scanning/3870b3d1-0f07-4875-8779-bb5c11fce0a8

Also, at the end of this help-thread there is a program mentioned that can be found on this page:

It is old but reported to work in Windows 10.

Please note: this is the result of my searches on the subject of latency, I have not tested this yet. Will try to do this during the weekend.

Yes they do.
But as i understand that entire preempt thing...
What it does is, it allows user space programs to interrupt the kernel.
Or im wrong here?
And the difference between voluntary and full preempt is that voluntary adds some "interrupt points" to the kernel and full adds even more.
So it would rather expect the opposite by enabling preempt.

I haven't tested full preempt. I cant tell if it does boot or not.
I think 250hz + voluntary is also default on ubuntu servers (and debian?) so i will stick with that.

@bouwew
Thanks for the links.
But the majority of clients here are android based.
The wifi got better since i switched to the voluntary.
But a couple of commits in the latest trunk tree did also improve it quite a bit.
Maybe the lag comes also from power saving feature of android.
Or because of the adblock i use here.
Im not quite sure. But it did improve :wink:
And the mwlwifi driver is still a bit bugged.

I think it allows any thread to interrupt the kernel, in particular it allows kernel tasks to interrupt other kernel tasks, which might allow packet processing to interrupt say garbage collection type maintenance operations.

Just wanted to briefly say: to all attempting to solve this issue, it is appreciated. I'm unfortunately not in a position to play around with or test my R7800 (in fact, I'm on stock due to the ping issues, and other issues with netlink bandwidth monitor).

For what it's worth, my unit has 'Antenna #' on the antennae; and I also experience ping spikes. Switching over to stock, as expected, the latency issues go away.

Not as detailed and thorough as some of your posts here, but have a look at my own evidence located in this post here.

As far as I could remember, kernel log is the same as hnyman's when booting; so same flash (Micron), RAM (Nanya), CPU stepping/revision, etc..

Truly is very strange that some users don't experience this issue... So then, maybe it has something to do with the connection between a certain modem and it's WAN connection to the R7800. Maybe there is something odd with the switching between the QCA8337 and the respective user's modem (does that even control the WAN port?)? I'm using an SB8200 as the modem, so it's a Broadcom BCM3390Z.

Here is a proper answer: http://devarea.com/understanding-linux-kernel-preemption/. Preemption is about a low priority thread not blocking a higher priority one. So it makes sense that voluntary preemption has improved latency, but a pre-emotive kernel is the best. Not real time, though.

On embedded systems with soft real time requirements it is a best practice to use this option but in a server system that we are usually work asynchronously the first option is better – less context switches – more cpu time

The default kernel configuration is definitely wrong. It is unfortunate that best option is not booting.

We need to find why it doesn't boot... This should be stock for a stock image

I have just realized that I used @hnyman's build with his patches. I will try building plain vanilla firmware now: there is a chance of a conflict...

No such luck: the firmware built from master does not boot either. Anyone here with serial access willing to lend a hand? I can share pre-built images...

This is what I am getting when pinging 8.8.8.8 with a voluntary preemption kernel and performance governor.

8

How does it look with default cpu governor?

Hey guys, some data points here:

With what huaracheguarache suspected as the root cause for the latency spike, that is, a kworker thread hogging 40%+ cpu every 2-3 seconds which correlate in time with the latency spike observed, I found the following:

On the same ipq806x platform based r7500v2, there is one kernel thread exhibiting exactly the same issue with 10h18:14 total cpu time consumed by a kworker thread for a router that has been running for 10days+15 hour.
The kernel version is

root@X4:~# uname -a
Linux X4 4.9.87 #0 SMP Tue Mar 20 20:45:27 2018 armv7l GNU/Linux

On a f9k1115v2 with a recent kernel (ar71xx) that has ran for 7days+13hours, a similar kworker thread consumed 1h23:00 cpu time

root@f9k1115v2:~# uname -a
Linux f9k1115v2 4.9.87 #0 Sat Mar 17 23:59:18 2018 mips GNU/Linux

However, in 4.4.X kernel version for a n150r that has been up for 130days+, a kworker thread consumed only
58:49.64 cputime.

Similarly for a wrt1900v1 with 4.4.X kernel version that is also up for 130days+, there are three kworker threads with:
0:20.62,
0:10.57,
0:47.13

cputime consumed by three separate kworker thread.

TLDR:
This seems to indicate that in general the 4.4.X -> 4.9.X switch introduced some processes that hogs the cpu more so in 4.9.X than in 4.4.X. I also observe cpu spikes in one of the kworker thread in 4.4.X kernel but they are less frequent and max cpu% I've seen is only ~8%

Evidence: On 4.4.X kernel in ar71xx platform and mvebu platform, sum total of kworker thread cpu time is less than 2 hours for two devices that has been up for 130days+.

On 4.9.X kernel in ar71xx platform and ipq806x platform, sum total of kworker thread cpu time is 1h23min/7days13hour, and 10h18min/10days15hour, which seems to be a substantially larger portion. In addition, on both of these devices I observe the kworker thread hogging 10-30% (ar71xx), and 40-80% cpu% (ipq806x) every 2~3 seconds.

2 Likes

Not sure if anyone in this thread already tried:

kernel 4.4 for ipq806x got removed by this commit which is not long ago:
https://github.com/openwrt/openwrt/commit/3a3564ead5e4cf2f6ff73302c1e680b5575079ec

Prehaps you guys can try building with 4.4.X kernel and see there's any difference for the r7800.

Full preemptive kernel boots fine here on wrt1200.
First thing i noticed, bootup has noticeable decreased.
Btw when you do compile your own firmware. Don't use O3 optimization.
I noticed on my device O3 noticeable increases bootup time.
And i guess it also slows down other things.
I always stick to O2.

Where do I change that?

In menuconfig
Under Advanced Configurations Options -> Target Options -> Target Optimizations
On my device default is:

-Os -pipe -mcpu=cortex-a9 -mfpu=vfpv3-d16"

So i changed that to:

-O2 -pipe -mcpu=cortex-a9 -mfpu=vfp3

Os optimizes for size.
Change that to O2 to optimize for performance.

There was also a discussion on irc. If it is not better to copy the target options over to the additional compiler options. So those optimization get more used in the build process.
You can find the "Additional compiler options" directly in the advanced configurations options.
Default here on my device is: -fno-caller-saves -fno-plt
So you would combine both:
-O2 -pipe -mcpu=cortex-a9 -mfpu=vfp3 -fno-caller-saves -fno-plt
So it looks like this and enter them here -> Advanced configuration options -> Additional compiler options
And clear the target optimizations in Advanced Configuration Options -> Target Options -> Target Optimizations

1 Like

Now the image is too big at 2231183 Bytes = 2178.89 kB = 2.13 MB. What could I safely remove from it? I have not added any USB modules, etc.

2.13 Mb, actually looks quite too small?

and R7800 has 128 mb flash ? And like 24 Mb or so that can be used actually.
Like on the wrt* devices.
My image is roughly 8 Mb.

I think 2MB is the size of the kernel partition. I have found steps to extend it to 4MB, but not sure if I could revert back to stock if I do so.

So r7800 has only a 2mb kernel partition size?
wrt* has 40mb kernel partition size.
//edit
it is 6mb

Hmm thats odd.
Before you mess around with removing things.
I would recommend to undo the changes. And leave it at Os.