Possible cause of R7800 latency issues

Forcing the CPU to run at max frequency doesn't do much, we tried that.

@kyva1929 I did it on the R7800 and got the same results: https://pastebin.com/Hy3h7hL4

If anyone else wants to do this please take note that the PID of the spiking kworker changes between reboots. To find the PID install the htop package, run htop -d1 and press Shift + K. In order to make it easier to spot the offending kworker press F4, and type in "kworker" without the quotation marks. This will only show the kworker threads making it easy to spot the one spiking.

I would say it does help, but marginally. I noticed a positive impact on pings over wireless: they become more stable.

Is the thesis still valid that there are two different versions of R7800 with different behavior or did I get it wrong?

No, it's not a valid theory any more:

2 Likes

Hey Guys,

@huaracheguarache generated a stack trace profile graph here:
https://www.dropbox.com/s/pr3ztzi9724bnf3/kworker.svg?dl=0

It looks like most of the burst work really come from ar8xxx which is the switch driver.

@dissent1 I saw some of your PR on openwrt, any comments on this?

Summary of the issue:

the kworker thread hogging 10-30% (ar71xx), and 40-80% cpu% (ipq806x) every 2~3 seconds
this kworker thread consumes a rather large portion of cpu time compared to uptime. (~11hours cpu time out of 10 days uptime, compared to ~2hrs of total cpu time of all kworker threads out of 130days+ uptime on a wrt1900acv1 and a n150r.)

qca8k would be a DSA based alternative for the QCA8337 switch - however it will crash with PPPoE traffic (and there hasn't been much development on it recently, although it's supposed to be used in the future).

Guys look what I found...

https://github.com/openwrt/openwrt/commit/eff3549c5883a9abc5dbff00c084cabbcfdf4437

This seems to be exactly what was causing the high cpu load. I just found it, still need to see what the current master head look like. Based on this commit it should have been removed but perhaps it got somehow introduced again?

3 Likes

I think this is what the get port stats function does:
On a n150r:

swconfig dev switch0 show

Global attributes:
        enable_vlan: 1
Port 0:
        pvid: 1
        link: port:0 link:up speed:1000baseT full-duplex txflow rxflow
Port 1:
        pvid: 1
        link: port:1 link:up speed:100baseT full-duplex auto
Port 2:
        pvid: 1
        link: port:2 link:up speed:100baseT full-duplex auto
Port 3:
        pvid: 1
        link: port:3 link:down
Port 4:
        pvid: 1
        link: port:4 link:down
VLAN 1:
        vid: 1
        ports: 0 1 2 3 4

On r7500/r7800, it gives a ~300lines detailed statistics for each port.

The ar8xxx_mib_fetch_port_stat should have been removed from the drivers. Need to check where openwrt is compiling the driver from.

This is great stuff! Keep up the good work!

I disabled MIB counter and hardcode all capture statistics action to return -ETIMEDOUT. Compiling a new firmware for r7500 will report soon if the cpu usage is gone.

1 Like

As far as I can tell the huge 40%-80% cpu spike is gone. Now I only see ~7% spike once in a while across different kworker thread and only observable with htop -d1. With default htop (I think -d20) refresh rate I can only see 0.7% bump once in a while.

Not sure what this means to the latency spike though - my firmware still has the timer frequency and voluntary preemption options enabled.

My change is a quick dirty hack for testing and I would be ashamed to share the patch out - here's what I did:

In

/target/linux/generic/files/drivers/net/phy/ar8216.c

hardcode

ar8xxx_mib_capture

ar8xxx_mib_flush

to return

-ETIMEDOUT

disable

ar8xxx_mib_fetch_port_stat

and comment out

/* Enable MIB counters */
ar8xxx_rmw(priv, AR8216_REG_MIB_FUNC, AR8216_MIB_FUNC | AR8236_MIB_EN,
	   (AR8216_MIB_FUNC_NO_OP << AR8216_MIB_FUNC_S) |
AR8236_MIB_EN);

swconfig dump:
On the r7800 without the patch:

On the r7500 with my hack:

The mib counter stats are gone as expected.

2 Likes

I just applied the patch and can confirm that the CPU spikes appear to be gone.

1 Like

So you guys actually found the problem... As I said in some private message... The problem was in the switch driver actually... Anyway have timer and voluntary should improve performance anyway...

1 Like

I'm running a firmware without any modifications apart from the switch driver patch, and I ran a ping session with two laptops and the R7800 between them. One of the laptops is connected to the WAN port and the other is connected to the LAN port. Here's the result:

switch_patch_laptops

The spikes appear to be solved for me :slightly_smiling_face:

1 Like

Amazing work, very well done! :slight_smile: Would you mind creating a pull request for this fix?

1 Like

@kyva1929 or someone else should do that. I don't know enough about the internals to make pull requests, I just followed the guide kyva1929 wrote. The patch needs to be cleaned up and done properly.

Could you share this build for others to test please?

1 Like

If anyone here wants to try a preemptive (non voluntary) kernel, here is the patch from @dissent1 :

UPDATE: theoreemtive kernel still needs to be enabled, but this patch fixes a boot issue.

diff --git a/target/linux/ipq806x/patches-4.9/0038-clk-qcom-Add-support-for-High-Frequency-PLLs-HFPLLs.patch b/target/linux/ipq806x/patches-4.9/0038-clk-qcom-Add-support-for-High-Frequency-PLLs-HFPLLs.patch
index 7092614..a206757 100644
--- a/target/linux/ipq806x/patches-4.9/0038-clk-qcom-Add-support-for-High-Frequency-PLLs-HFPLLs.patch
+++ b/target/linux/ipq806x/patches-4.9/0038-clk-qcom-Add-support-for-High-Frequency-PLLs-HFPLLs.patch
@@ -117,7 +117,7 @@ Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
 +       * H/W requires a 5us delay between disabling the bypass and
 +       * de-asserting the reset. Delay 10us just to be safe.
 +       */
-+      usleep_range(10, 100);
++      udelay(10);
 +
 +      /* De-assert active-low PLL reset. */
 +      regmap_update_bits(regmap, hd->mode_reg, PLL_RESET_N, PLL_RESET_N);
@@ -128,7 +128,7 @@ Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
 +                      regmap_read(regmap, hd->status_reg, &val);
 +              } while (!(val & BIT(hd->lock_bit)));
 +      } else {
-+              usleep_range(60, 100);
++              udelay(60);
 +      }
 +
 +      /* Enable PLL output. */
1 Like

Hmm, I am not seeing any impact; might have to recompile the firmware. However, I did notice the AR8XXX_CAP_MIB_COUNTERS constant that is used to indicate the mib counters capability. If it is not present and num_mibs, mib_desc, and mib_func are set to NULL then it should disable all mib counters.