Possible cause of R7800 latency issues

#1

I've been looking into the R7800 latency issue detailed in this previous forum thread:

First of all I tried and succeeded in reproducing the latency spikes. Here I'm pinging 8.8.8.8 directly from the R7800 every second:

figure_baseline_router


I also tried doing the same from my computer connected to the router by ethernet. And a ping frequency of 1 per second didn't give me any large spikes:

figure_baseline_pc


However, increasing the ping frequency to every 0.2 seconds gave a different picture:

figure_baseline_pc_interval_0_2

Pinging 8.8.8.8 with the computer connected to the modem directly didn't produce any spikes.


So, in trying to figure out what might cause this I had a look at what htop -d1 showed. htop -d1 updates every 1/10 of a second and I consistently spotted a kworker thread hogging 30-85 percent of core 2 approximately every 2 seconds. This high CPU load lasts for a fraction of a second.

I used ftrace to figure out what kworker was doing by following this guide:

What I found was this:

kworker/1:2-83 [001] dns. 278.480327: workqueue_queue_work: work struct=bf1f6004 function=gc_worker [nf_conntrack] workqueue=dd480400


Running a ping session on the router with timestamps I was able to correlate the high latency to a proceeding kworker spike. Here are some examples:

kworker/1:2-83    [001] dns.   278.480327: workqueue_queue_work: work struct=bf1f6004 function=gc_worker [nf_conntrack] workqueue=dd480400

278.50 64 bytes from 8.8.8.8: seq=15 ttl=59 time=53.656 ms

kworker/1:2-83    [001] dns.   347.600138: workqueue_queue_work: work struct=bf1f6004 function=gc_worker [nf_conntrack] workqueue=dd480400 req_cpu=4 cpu=1
          <idle>-0     [000] dnh.   347.629976: workqueue_queue_work: work struct=dccf33c0 function=dbs_work_handler workqueue=dd480200 req_cpu=0 cpu=0

347.63 64 bytes from 8.8.8.8: seq=84 ttl=59 time=85.801 ms

kworker/1:2-83    [001] d.s.   416.730239: workqueue_queue_work: work struct=bf1f6004 function=gc_worker [nf_conntrack] workqueue=dd480400 req_cpu=4 cpu=1

416.75 64 bytes from 8.8.8.8: seq=153 ttl=59 time=66.578 ms



It seems that something related to nf_conntrack is misbehaving.

0 Likes

Build for Netgear R7800
Packet loss and Latency R7800
MIB Counters for QCA8337 (Netgear R7800)
#2

disable firewall and try again? any difference?

0 Likes

#3

If issue is with netfilter should it affect all routers, instead of just for the R7800?

0 Likes

#4

Well other router doesn't have frequency scaling so they handle this spike in a better way... Check also the frequency when the spike happen...

0 Likes

#5
/etc/init.d/firewall disable
/etc/init.d/firewall stop

A reboot later and no difference (pinging 8.8.8.8 directly from the R7800):

ping_firewall_off_router

0 Likes

#6

To keep the CPU running at max frequency at all times I set:

echo “performance” > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo “performance” > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

I ran a ping session again directly from the R7800 to 8.8.8.8 (with firewall enabled):

ping_governor_performance

The spikes improved, but they are still there.

0 Likes

#7

I have run these tests with scaling governor set to performance instead of ondemand (both cores) with the same results. The spikes to 100ms did not go away.

0 Likes

#8

Even at its lowest freq, @ 384MHz it should be more than enough grunt to handle just ICMP traffic. Shouldn’t be due to CPU freq scaling.

The netfilter issue is an interesting observation but should not be the cause, else every router running the same kernel version would see the same latency.

Probably the interrupt handlers is not handling interrupts efficiently?

0 Likes

#9

Is there a way to verify/prove this theory?

0 Likes

#10

Another strange fact worth noting is that @slh doesn't experience these spikes on his ZyXEL NBG6817, even though it also has the IPQ8065 and QCA9984.

0 Likes

#11

Well, I can’t think of how to test for interrupt issues effectively. Probably have to review the kernel codes. Maybe we can turn off all peripheral components like WiFi, UART, LEDs etc and just test the LAN ports?

0 Likes

#12

I know how to turn off WiFi, but not the others. Can you provide some pointers?

0 Likes

#13

Have to compile a custom firmware with all those device drivers removed. Then we’ll know for sure those are not interfering since it’ll not register the interrupts.

0 Likes

#14

Did you see this?

0 Likes

#16

Yes, but a ping test run locally on my computer also shows substantial spikes as seen in the plot with interval = 0.2 seconds.

0 Likes

#17

Yes, I know. You also stated this:

So...what do you think happens to a CPU, when you multiply the given load by a factor of 5?

0 Likes

#18

I'm not running htop -d1 while doing the ping tests - I just used it to have a look if something is hogging the CPU periodically. I assume you implied that I ran it during the pings. Sorry if I misunderstood you.

1 Like

#19

But you are increasing your pings to a public DNS server though?

Have you tested something more suited to responding in a timely fashion?

0 Likes

#20

Since the kworker hogging the CPU appeared in short bursts I thought it was a good idea to increase the frequency of the pings to make it more likely for a ping to coincide with a CPU utilisation spike. That way any initially hard-to-spot latency issues might become more apparent.

No, but I'm open to suggestions :slightly_smiling_face:

0 Likes

#21

A Speedtest...perhaps?

DNS servers are configured optimize UDP/53...not ICMP Echo-Request.

0 Likes