Possible cause of R7800 latency issues

huaracheguarache · March 22, 2018, 9:39pm

I've been looking into the R7800 latency issue detailed in this previous forum thread:

First of all I tried and succeeded in reproducing the latency spikes. Here I'm pinging 8.8.8.8 directly from the R7800 every second:

I also tried doing the same from my computer connected to the router by ethernet. And a ping frequency of 1 per second didn't give me any large spikes:

However, increasing the ping frequency to every 0.2 seconds gave a different picture:

Pinging 8.8.8.8 with the computer connected to the modem directly didn't produce any spikes.

So, in trying to figure out what might cause this I had a look at what htop -d1 showed. htop -d1 updates every 1/10 of a second and I consistently spotted a kworker thread hogging 30-85 percent of core 2 approximately every 2 seconds. This high CPU load lasts for a fraction of a second.

I used ftrace to figure out what kworker was doing by following this guide:

What I found was this:

kworker/1:2-83 [001] dns. 278.480327: workqueue_queue_work: work struct=bf1f6004 function=gc_worker [nf_conntrack] workqueue=dd480400

Running a ping session on the router with timestamps I was able to correlate the high latency to a proceeding kworker spike. Here are some examples:

kworker/1:2-83    [001] dns.   278.480327: workqueue_queue_work: work struct=bf1f6004 function=gc_worker [nf_conntrack] workqueue=dd480400

278.50 64 bytes from 8.8.8.8: seq=15 ttl=59 time=53.656 ms

kworker/1:2-83    [001] dns.   347.600138: workqueue_queue_work: work struct=bf1f6004 function=gc_worker [nf_conntrack] workqueue=dd480400 req_cpu=4 cpu=1
          <idle>-0     [000] dnh.   347.629976: workqueue_queue_work: work struct=dccf33c0 function=dbs_work_handler workqueue=dd480200 req_cpu=0 cpu=0

347.63 64 bytes from 8.8.8.8: seq=84 ttl=59 time=85.801 ms

kworker/1:2-83    [001] d.s.   416.730239: workqueue_queue_work: work struct=bf1f6004 function=gc_worker [nf_conntrack] workqueue=dd480400 req_cpu=4 cpu=1

416.75 64 bytes from 8.8.8.8: seq=153 ttl=59 time=66.578 ms

It seems that something related to nf_conntrack is misbehaving.

Ansuel · March 22, 2018, 10:01pm

disable firewall and try again? any difference?

quarky · March 22, 2018, 10:44pm

If issue is with netfilter should it affect all routers, instead of just for the R7800?

Ansuel · March 23, 2018, 9:40am

Well other router doesn't have frequency scaling so they handle this spike in a better way... Check also the frequency when the spike happen...

huaracheguarache · March 23, 2018, 10:10am

/etc/init.d/firewall disable
/etc/init.d/firewall stop

A reboot later and no difference (pinging 8.8.8.8 directly from the R7800):

huaracheguarache · March 23, 2018, 11:08am

To keep the CPU running at max frequency at all times I set:

echo “performance” > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo “performance” > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

I ran a ping session again directly from the R7800 to 8.8.8.8 (with firewall enabled):

The spikes improved, but they are still there.

fantom-x · March 23, 2018, 11:08am

I have run these tests with scaling governor set to performance instead of ondemand (both cores) with the same results. The spikes to 100ms did not go away.

quarky · March 23, 2018, 12:04pm

Even at its lowest freq, @ 384MHz it should be more than enough grunt to handle just ICMP traffic. Shouldn’t be due to CPU freq scaling.

The netfilter issue is an interesting observation but should not be the cause, else every router running the same kernel version would see the same latency.

Probably the interrupt handlers is not handling interrupts efficiently?

fantom-x · March 23, 2018, 12:08pm

Is there a way to verify/prove this theory?

huaracheguarache · March 23, 2018, 12:23pm

Another strange fact worth noting is that @slh doesn't experience these spikes on his ZyXEL NBG6817, even though it also has the IPQ8065 and QCA9984.

quarky · March 23, 2018, 12:44pm

Well, I can’t think of how to test for interrupt issues effectively. Probably have to review the kernel codes. Maybe we can turn off all peripheral components like WiFi, UART, LEDs etc and just test the LAN ports?

fantom-x · March 23, 2018, 12:52pm

I know how to turn off WiFi, but not the others. Can you provide some pointers?

quarky · March 23, 2018, 1:13pm

Have to compile a custom firmware with all those device drivers removed. Then we’ll know for sure those are not interfering since it’ll not register the interrupts.

lleachii · March 23, 2018, 2:05pm

Did you see this?

huaracheguarache · March 23, 2018, 2:56pm

Yes, but a ping test run locally on my computer also shows substantial spikes as seen in the plot with interval = 0.2 seconds.

lleachii · March 23, 2018, 3:00pm

Yes, I know. You also stated this:

So...what do you think happens to a CPU, when you multiply the given load by a factor of 5?

huaracheguarache · March 23, 2018, 3:06pm

I'm not running htop -d1 while doing the ping tests - I just used it to have a look if something is hogging the CPU periodically. I assume you implied that I ran it during the pings. Sorry if I misunderstood you.

lleachii · March 23, 2018, 3:09pm

But you are increasing your pings to a public DNS server though?

Have you tested something more suited to responding in a timely fashion?

huaracheguarache · March 23, 2018, 3:15pm

Since the kworker hogging the CPU appeared in short bursts I thought it was a good idea to increase the frequency of the pings to make it more likely for a ping to coincide with a CPU utilisation spike. That way any initially hard-to-spot latency issues might become more apparent.

No, but I'm open to suggestions

lleachii · March 23, 2018, 3:38pm

A Speedtest...perhaps?

DNS servers are configured optimize UDP/53...not ICMP Echo-Request.