[SOLVED] Router (Netgear R7800) introduced latency spikes >100ms

Is that a digit 1 ? I do not get anything in dmesg ....

UPDATE: Oh, that is a lower case l, but when I run that command it takes a few seconds, but nothing shows up in dmesg.

I needed double quotes there. Here what I am getting now, but I do not think that is what I was supposed to get:

[ 1806.587534] sysrq: SysRq : Show backtrace of all active CPUs
[ 1806.589341] Sending NMI to all CPUs:
[ 1816.592787] ath10k_warn: 54 callbacks suppressed
[ 1816.592820] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 1, skipped old beacon
[ 1816.592878] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 2, skipped old beacon
[ 1816.592935] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
[ 1816.593040] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 1, skipped old beacon
[ 1816.593141] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 2, skipped old beacon
[ 1816.593198] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
[ 1816.593253] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 1, skipped old beacon
[ 1816.593352] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 2, skipped old beacon
[ 1816.593405] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 0, skipped old beacon
[ 1816.593506] ath10k_pci 0000:01:00.0: SWBA overrun on vdev 1, skipped old beacon
[ 1821.988657] pppoe-wan: renamed from ppp0

So I have run one more test to make sure my line is clean, the path is stable, and eliminate the PPPoE software on R7800. I have built an x86_64 LEDE router out of spare parts and a USB flash drive. I configured a PPPoE client and connected my PC through it: I do not think I saw such fast internet before! The pages were loading much faster. This x86 router only added approximately 0.3ms of latency and in about 15 minutes there were less than five spikes to 11.X ms. All the other times ping 8.8.8.8 would return in less than 11ms and often in less than 10ms.
Then I connected my R7800 to it (double NAT) and configured it as a DHCP client. My PC now connected to R7800. The latency spikes to 100ms came back and R7800 addeded another 1ms latency.
I can now rule out the PPPoE client on R7800, as it was not used in this test.
I do not have experience debugging routers, so not sure what else I can try now.

I would test by running stock firmware on the R7800 with the VDSL modem bridged.

If you get the same thing, it's not OpenWrt/LEDE.

No such luck. Just flushed the latest stock firmware and there were around 20 spikes to between 12 and 18ms over a 30 minute interval while most of the time they were under 11ms (ping 8.8.8.8): with LEDE I get several spikes of 50..100ms a minute. The stock firmware adds at most half the latency of LEDE.

I tested with the router in the same location and using all the same wiring and ports. Also tested over wireless and it is an additional 2..5ms and very stable, with much less jitter than with LEDE.

I understand that (and why) LEDE would be a little slower, but hoping that the the 100ms ping spikes can be eliminated. Anything else I cold try?

I think you've identified the source of the issue.

Open a bug report...

https://bugs.lede-project.org/

Seems like it's a known issue...from reading some of the posts in this community build thread for the R7800.

You might try one of the builds over there.

Have you distributed the IRQs to different CPU cores? By default everything runs on core0 in R7800
See search the R7800 exploration thread for discussion on that.

That does not have much impact on normal operations, but as you are looking for the last millisecond, you might be interested.

Yes, I have. But I am no longer looking at the last ms: I am looking at the 50..100ms latency spikes twice a minute or so.

I have read all the posts I could find including that entire thread and unless I missed anything, the people are talking about just a an increased latency by 2..3ms, and I have already made my peace with that. I am just puzzled as to why I am the only one experiencing these order of magnitude spikes. No-one said they also experience them in this thread either.

Maybe no one else has a VDSL modem bridged to a R7800...

It may well be that it is something in your package selection, configuration and resource utilisation that every now and then causes a brief CPU or I/O utilisation spike.

For finding the possible culprit package combination, you may need to disable all your packages and if there are then no spikes, then start packages one by one until you find what adds enough load t cause the spikes.

Just for reference, I tested ping to an external site while watching a streamed HD video, and the latency is steadily 5.6-8.2 ms. No spikes for my R7800. (And I have rather similar packages as yours. adblock, luci statistics, sqm, ddns (no actual usage), nlbwmon, but I am not logging anything to USB.)

--- www.tut.fi ping statistics ---
199 packets transmitted, 199 packets received, 0% packet loss
round-trip min/avg/max = 5.588/6.356/8.169 ms

My router is configured with five SSID's: three for 5GHz (two guest networks) and two for 2.4GHz (one guest network) plus one VLAN on one of the switch ports: all three guest WiFI's and the VLAN are in the same firewall guest zone. I can easily have at least 15..20 devices connected at any given moment of time and most of them are wireless and mostly idle. Besides that, I have added BCP38 and a few firewall rules for the guest network as well as to force using local DNS only.

I there a reasonable limit to haw many SSID's and/or connected devices this router (R7800) is expected to handle? Right after a factory reset, there is only one device connected and the spikes are smaller and further apart. I am using your latest master built.

I agree with @jwoods...chasing .001 seconds with pinpoint accuracy will be difficult.

I've watched this thread for quite some time, and I'm surprised it's flowed this far.

  • Your "baseline" was connecting a PC directly to your DSL modem, versus through your R7800. This will obviously add latency.
  • Another note, any software you run on the router to test - will add latency.

YES! As @slh mentioned, it's quite normal to see ~1 ms added when attaching a NAT-based router (i.e. a "hop").

Multiple things could be occurring:

  • If you're a gamer...someone could be cheating and crafting/spoofing/generating packets to your WAN port, causing things such as ICMP-Echo-Replies, ICMP-Host/Destination-Unreachable, etc. Many types of these packets cause CPU usage to increase on the router. For an example, see: https://www.cisco.com/c/en/us/about/security-center/ttl-expiry-attack.html (note: this example specifically mitigates Traceroute attacks)
  • Using the Zone/Global Option REJECT instead of DROP will increase CPU usage during attack, mis-configured remote host, recently closed TCP sessions (i.e. during speed tests), etc. This is because the router's CPU must create the proper ICMP or RST messages during REJECT.
  • A high TCP port timeout and fast, port intensive connections, P2P, etc. (i.e. back-to-back speedtests, P2P, etc. may benefit from setting smaller TCP timeouts)
  • TCP connections take more CPU resources than UDP...if you use P2P such as for file sharing, CPU usage on the router will increase, as these TCP connections are built and destroyed in the NAT table.
  • Utilizing the near maximum of your connection (e.g. during a speed test) will actually cause a CPU increase, as thats when traffic shaping actually "activates."
  • Being logged into LuCI with Autorefresh ON increases CPU usage.
  • An improperly configured QoS table can cause such latency as well. It's possible you dedicated Priority bandwidth to something that should have less priority.

Yeah, that is how it began and I agree that it is ok. But then I started chasing the 50..100ms latency spikes (at least) a couple of times a minute. They do not happen with the stock firmware nor with a PC directly connected to the modem.

  • As I noted about the directly-connected PC

Regarding the stock firmware:

Many have observed that phenomenon as well and noted it in the forums. I speculate many reasons for it, it is mostly attributable to software improvements in OSes (i.e. the Linux Kernel) tend to add some some CPU lag. I observe this increase even as I upgrade versions of the same OS on PCs. One major known cause of latency in OpenWRT as compared to stock firmware is noted in-depth here: Hardware NAT For LEDE

Yes, I started looking at the wrong number during a 10 second ping test, but once I ran it for minutes I noticed the real issue, which was latency spikes.

No games played when I run the test

Everything is set to DROP

No-one else was using the network

Just 10 ping 8.8.8.8 sessions; no-one else connected.

Learned that quickly and even disabled uhttpd

Using SQM

I read that, but got an impression that the latency increase would be relatively constant and not an order of magnitude. Did I misunderstood it?

Interestingly the max spikes dropped from over 100ms to under 50ms once I switched the CPU scaling governors from ondemand to performance. Everything looks much better now even though not perfect.

1 Like

Good catch!

I wonder how that wil affect the life of the CPU, since it will now be running at maximum frequency.

This brings back vague memories of working with an underpowered phone running Android on its Linux kernel. There was a lot of tuning of the parameters around the governor required to get "snappy" response without draining the battery or putting the CPU into thermal protection.