Possible cause of R7800 latency issues

I can confirm that, but also with no hard evidence. I think the responsiveness has improved over WiFi as well.

Also @huaracheguarache

You can see a lot more going on in kworker thread if you do

echo *:* > /sys/kernel/debug/tracing/set_event

The cpu usage spike has nothing to do with ledtrig as well. I disabled the function and the cpu usage spike still happens. The correlation in time likely has to do with that the led trig function call is triggered by some sort of a clock which also trigger other processes listening to the same clock.

Here's a sample dump:

There's a torrent of logs every 2-3 seconds which correspond to the high cpu spike.

1 Like

Then you should like it even more if you compile the stable release with O2 optimization (as described above) and a preemptive kernel (the stable one boots fine). This chart is pinging 8.8.8.8 while the router was in use. I suspect those minor spikes are SQM kicking in during a normal use:they are nicely clustered as well.

2 Likes

Hey guys,

I recompiled a firmware for r7500 to include perf. Here's perf report output from perf data collected over 30 seconds.

pid 25 is the kworker thread that eats up ~5% on average, and burst to ~40-80% every couple seconds.

How to enable profiling and perf:

Gobal build settings > Kernel build options > compile the kernel with profiling enabled
Development > perf

Then I recorded with

perf record -p 25 sleep 30

and view report with

perf report --header -I

note, perf is gonna generate a perf.data file in $PWD, so better cd to /tmp before you do it.

Thats it. I didn't bother building another one for r7800 but I suspect the perf report should look similar.
On the r7800 the kworker thread that hogs cpu is on average ~4% ish, and also burst to 40%-80% every couple seconds. As note before, this kworker thread consumes a rather large portion of cpu time compared to uptime. (~11hours cpu time out of 10 days uptime, compared to ~2hrs of total cpu time of all kworker threads out of 130days+ uptime on a wrt1900acv1 and a n150r.)

I hope the perf report could help dev figure out this issue.

1 Like

Has anyone looked into cpu frequency scaling? Maybe ipq806x scales its frequency much lower than ar71xx when idle?

Forcing the CPU to run at max frequency doesn't do much, we tried that.

@kyva1929 I did it on the R7800 and got the same results: https://pastebin.com/Hy3h7hL4

If anyone else wants to do this please take note that the PID of the spiking kworker changes between reboots. To find the PID install the htop package, run htop -d1 and press Shift + K. In order to make it easier to spot the offending kworker press F4, and type in "kworker" without the quotation marks. This will only show the kworker threads making it easy to spot the one spiking.

I would say it does help, but marginally. I noticed a positive impact on pings over wireless: they become more stable.

Is the thesis still valid that there are two different versions of R7800 with different behavior or did I get it wrong?

No, it's not a valid theory any more:

2 Likes

Hey Guys,

@huaracheguarache generated a stack trace profile graph here:
https://www.dropbox.com/s/pr3ztzi9724bnf3/kworker.svg?dl=0

It looks like most of the burst work really come from ar8xxx which is the switch driver.

@dissent1 I saw some of your PR on openwrt, any comments on this?

Summary of the issue:

the kworker thread hogging 10-30% (ar71xx), and 40-80% cpu% (ipq806x) every 2~3 seconds
this kworker thread consumes a rather large portion of cpu time compared to uptime. (~11hours cpu time out of 10 days uptime, compared to ~2hrs of total cpu time of all kworker threads out of 130days+ uptime on a wrt1900acv1 and a n150r.)

qca8k would be a DSA based alternative for the QCA8337 switch - however it will crash with PPPoE traffic (and there hasn't been much development on it recently, although it's supposed to be used in the future).

Guys look what I found...

https://github.com/openwrt/openwrt/commit/eff3549c5883a9abc5dbff00c084cabbcfdf4437

This seems to be exactly what was causing the high cpu load. I just found it, still need to see what the current master head look like. Based on this commit it should have been removed but perhaps it got somehow introduced again?

3 Likes

I think this is what the get port stats function does:
On a n150r:

swconfig dev switch0 show

Global attributes:
        enable_vlan: 1
Port 0:
        pvid: 1
        link: port:0 link:up speed:1000baseT full-duplex txflow rxflow
Port 1:
        pvid: 1
        link: port:1 link:up speed:100baseT full-duplex auto
Port 2:
        pvid: 1
        link: port:2 link:up speed:100baseT full-duplex auto
Port 3:
        pvid: 1
        link: port:3 link:down
Port 4:
        pvid: 1
        link: port:4 link:down
VLAN 1:
        vid: 1
        ports: 0 1 2 3 4

On r7500/r7800, it gives a ~300lines detailed statistics for each port.

The ar8xxx_mib_fetch_port_stat should have been removed from the drivers. Need to check where openwrt is compiling the driver from.

This is great stuff! Keep up the good work!

I disabled MIB counter and hardcode all capture statistics action to return -ETIMEDOUT. Compiling a new firmware for r7500 will report soon if the cpu usage is gone.

1 Like

As far as I can tell the huge 40%-80% cpu spike is gone. Now I only see ~7% spike once in a while across different kworker thread and only observable with htop -d1. With default htop (I think -d20) refresh rate I can only see 0.7% bump once in a while.

Not sure what this means to the latency spike though - my firmware still has the timer frequency and voluntary preemption options enabled.

My change is a quick dirty hack for testing and I would be ashamed to share the patch out - here's what I did:

In

/target/linux/generic/files/drivers/net/phy/ar8216.c

hardcode

ar8xxx_mib_capture

ar8xxx_mib_flush

to return

-ETIMEDOUT

disable

ar8xxx_mib_fetch_port_stat

and comment out

/* Enable MIB counters */
ar8xxx_rmw(priv, AR8216_REG_MIB_FUNC, AR8216_MIB_FUNC | AR8236_MIB_EN,
	   (AR8216_MIB_FUNC_NO_OP << AR8216_MIB_FUNC_S) |
AR8236_MIB_EN);

swconfig dump:
On the r7800 without the patch:

On the r7500 with my hack:

The mib counter stats are gone as expected.

2 Likes

I just applied the patch and can confirm that the CPU spikes appear to be gone.

1 Like

So you guys actually found the problem... As I said in some private message... The problem was in the switch driver actually... Anyway have timer and voluntary should improve performance anyway...

1 Like

I'm running a firmware without any modifications apart from the switch driver patch, and I ran a ping session with two laptops and the R7800 between them. One of the laptops is connected to the WAN port and the other is connected to the LAN port. Here's the result:

switch_patch_laptops

The spikes appear to be solved for me :slightly_smiling_face:

1 Like