[SOLVED] Router (Netgear R7800) introduced latency spikes >100ms

And the final test: disabling both radios made no difference. At some point during testing I had several ping sessions open locally and on the router and noticed that the spikes are actually more frequent than I realized. I guess with multiple sessions sessions it is more likely to catch those short lived spikes...

Question, how do you run top? I tend to use "top -d 1" to get at least an update every second (the shortest busybox top allows AFAIK) otherwise short stalls might not show in idle (if the accumulation time is much larger than the cpu stall duration).

Well, that seems to rule out wifi as a prime suspect, which is good. Could you install and run flent (flent.org) on a client and initiate a longtime (say 600 seconds) though your router to see whether the stalls are periodic?

top -d 1 | awk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; fflush(); }' | tee /tmp/top.log

That gives me a complete history of top with timestamps.

Can you suggest the command to run flent? I cannot find a good example.

Nice, but this still "suffers" from the low sampling/long accumulation time of 1 second...

Are you usIng stable build? On kernel 4.4 I’ve had the same issue, try snapshot with kernel 4.9 this issue goes away (@hnyman build)

Hmm, I am indeed using the stable build and have been trying to stay away form the bleeding edge. Do I need to manually reconfigure the router or I can preserve the settings?

You can preserve settings

And one more question: is the snapshot build stable to run?

Yes, I’m using @hnyman latest build r6152, works really well.
Or try @escalade builds, also really good, both on snapshot.

No such luck, unfortunately:

ping 8.8.8.8 | awk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; fflush(); }' | egrep -v " time=11| time=12| time=10| time=13"
2018-02-21 18:26:54 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
2018-02-21 18:27:02 64 bytes from 8.8.8.8: icmp_seq=9 ttl=60 time=14.0 ms
2018-02-21 18:27:30 64 bytes from 8.8.8.8: icmp_seq=37 ttl=60 time=31.1 ms
2018-02-21 18:28:11 64 bytes from 8.8.8.8: icmp_seq=78 ttl=60 time=25.1 ms
2018-02-21 18:28:39 64 bytes from 8.8.8.8: icmp_seq=106 ttl=60 time=52.8 ms
2018-02-21 18:29:20 64 bytes from 8.8.8.8: icmp_seq=147 ttl=60 time=21.2 ms
2018-02-21 18:29:35 64 bytes from 8.8.8.8: icmp_seq=162 ttl=60 time=35.9 ms
2018-02-21 18:30:20 64 bytes from 8.8.8.8: icmp_seq=207 ttl=60 time=18.0 ms
2018-02-21 18:30:48 64 bytes from 8.8.8.8: icmp_seq=235 ttl=60 time=46.5 ms
2018-02-21 18:35:55 64 bytes from 8.8.8.8: icmp_seq=542 ttl=60 time=35.2 ms
2018-02-21 18:36:43 64 bytes from 8.8.8.8: icmp_seq=589 ttl=60 time=32.0 ms
2018-02-21 18:36:47 64 bytes from 8.8.8.8: icmp_seq=593 ttl=60 time=33.1 ms
2018-02-21 18:36:48 64 bytes from 8.8.8.8: icmp_seq=594 ttl=60 time=15.1 ms
2018-02-21 18:36:49 64 bytes from 8.8.8.8: icmp_seq=595 ttl=60 time=249 ms
2018-02-21 18:38:09 64 bytes from 8.8.8.8: icmp_seq=675 ttl=60 time=40.3 ms
2018-02-21 18:38:15 64 bytes from 8.8.8.8: icmp_seq=681 ttl=60 time=17.8 ms
2018-02-21 18:39:22 64 bytes from 8.8.8.8: icmp_seq=748 ttl=60 time=52.8 ms

If it were my router, I would probably reset LEDE to factory settings and use that as a baseline test.

However, the time spent chasing .001 of a second is probably better spent on things like hardening security and improving other areas of your network.

1 Like

That is true, but do you have any suggestions about those latency spikes? Are they normal/expected?

Different hardware here, but with a WiFi-attached RPi and a wired path that goes through a couple Cisco SG300-series switches, a FreeBSD firewall, a switch on one Archer C7, over a wired link to the other Archer C7, over WiFi to the RPi, I don't see significant variance in ping time, especially given the wireless link.

100 packets transmitted, 100 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.385/4.896/8.965/1.253 ms

I'd install flent on a known-good "desktop" box with a good network card and look at the distribution of ping time to both the "LAN" port as well as to the "WAN" port (assuming your router has two interfaces internally). You'll need to install netperf on the LEDE/OpenWRT router. The other test would be a straight-through forwarding test, without using the modem (box-to-box -- or tweak the routing table on a two-interface box to prevent using loopback).

Wow, I do not get anything like that here. If I get myself one of those C7’s, is there a hardware version that I need to stay away from?

I would use a tool like MTR, WinMTR, or PingPlotter to see what is happening on the route.

https://bitwizard.nl/mtr/

http://winmtr.net/

Whoa! Before you rush out and buy new hardware, those numbers only include

  • Servicing the interface associated with the LAN through the switch and bridge
  • Forwarding the packet (no NAT)
  • Servicing the interface associated with the WiFi through the bridge
    and in a relatively unloaded setting as well.

Since NAT under LEDE/OpenWRT is typically done in software, that can be a load on the system at high rates. Servicing the interfaces can also be a significant load, becoming a limiting factor on throughput.

As I recall, the C7 is good for a few hundred Mbps throughput under LEDE/OpenWRT. By today's standards, it's a decent mid-level router. If I were buying now, I'd be looking at a dual-core router -- not so much to increase throughput, but to help keep other processes from interfering with it as much. With an open-source OS, you'll generally never get the throughput reported in testing such as https://www.smallnetbuilder.com/ since the driver for the hardware NAT accelerator is typically proprietary.

The tests I would look at are, for both low rates and "flood"

  • to/from LAN interface
  • to/from WAN interface
  • through LAN <-> WAN, kernel forwarding only
  • through LAN <-> WAN, NAT enabled

Well, it might an extreme to replace a router, but I did test LAN/WiFi pings and there is no NAT involved here. Are these normal ping times? WiFi is 5GHz and all devices are on the same network.

Wired to Wireless

64 bytes from 192.168.1.42: icmp_seq=108 ttl=64 time=64.4 ms
64 bytes from 192.168.1.42: icmp_seq=109 ttl=64 time=83.5 ms
64 bytes from 192.168.1.42: icmp_seq=110 ttl=64 time=5.06 ms
64 bytes from 192.168.1.42: icmp_seq=111 ttl=64 time=26.3 ms
64 bytes from 192.168.1.42: icmp_seq=112 ttl=64 time=49.2 ms
64 bytes from 192.168.1.42: icmp_seq=113 ttl=64 time=71.6 ms
64 bytes from 192.168.1.42: icmp_seq=114 ttl=64 time=95.2 ms
64 bytes from 192.168.1.42: icmp_seq=115 ttl=64 time=15.0 ms
64 bytes from 192.168.1.42: icmp_seq=116 ttl=64 time=38.2 ms
64 bytes from 192.168.1.42: icmp_seq=117 ttl=64 time=61.2 ms
64 bytes from 192.168.1.42: icmp_seq=118 ttl=64 time=84.0 ms
64 bytes from 192.168.1.42: icmp_seq=119 ttl=64 time=107 ms
64 bytes from 192.168.1.42: icmp_seq=120 ttl=64 time=26.4 ms
64 bytes from 192.168.1.42: icmp_seq=121 ttl=64 time=49.4 ms
64 bytes from 192.168.1.42: icmp_seq=122 ttl=64 time=71.2 ms
64 bytes from 192.168.1.42: icmp_seq=123 ttl=64 time=94.1 ms

Wireless to Wireless

64 bytes from 192.168.1.42: icmp_seq=178 ttl=64 time=100.934 ms
64 bytes from 192.168.1.42: icmp_seq=179 ttl=64 time=19.201 ms
64 bytes from 192.168.1.42: icmp_seq=180 ttl=64 time=40.173 ms
64 bytes from 192.168.1.42: icmp_seq=181 ttl=64 time=60.225 ms
64 bytes from 192.168.1.42: icmp_seq=182 ttl=64 time=80.121 ms
64 bytes from 192.168.1.42: icmp_seq=183 ttl=64 time=102.256 ms
64 bytes from 192.168.1.42: icmp_seq=184 ttl=64 time=19.270 ms
64 bytes from 192.168.1.42: icmp_seq=185 ttl=64 time=37.216 ms
64 bytes from 192.168.1.42: icmp_seq=186 ttl=64 time=58.799 ms
64 bytes from 192.168.1.42: icmp_seq=187 ttl=64 time=78.216 ms
64 bytes from 192.168.1.42: icmp_seq=188 ttl=64 time=97.292 ms
64 bytes from 192.168.1.42: icmp_seq=189 ttl=64 time=317.608 ms
64 bytes from 192.168.1.42: icmp_seq=190 ttl=64 time=35.863 ms
64 bytes from 192.168.1.42: icmp_seq=191 ttl=64 time=6.736 ms
64 bytes from 192.168.1.42: icmp_seq=192 ttl=64 time=5.320 ms
64 bytes from 192.168.1.42: icmp_seq=193 ttl=64 time=4.146 ms
64 bytes from 192.168.1.42: icmp_seq=194 ttl=64 time=7.696 ms
64 bytes from 192.168.1.42: icmp_seq=195 ttl=64 time=35.684 ms
64 bytes from 192.168.1.42: icmp_seq=196 ttl=64 time=7.312 ms
64 bytes from 192.168.1.42: icmp_seq=197 ttl=64 time=74.690 ms
64 bytes from 192.168.1.42: icmp_seq=198 ttl=64 time=97.320 ms
64 bytes from 192.168.1.42: icmp_seq=199 ttl=64 time=318.934 ms
64 bytes from 192.168.1.42: icmp_seq=200 ttl=64 time=35.296 ms

Wired to Wired

64 bytes from 192.168.1.36: icmp_seq=2 ttl=64 time=0.794 ms
64 bytes from 192.168.1.36: icmp_seq=3 ttl=64 time=0.764 ms
64 bytes from 192.168.1.36: icmp_seq=4 ttl=64 time=0.763 ms
64 bytes from 192.168.1.36: icmp_seq=5 ttl=64 time=0.774 ms
64 bytes from 192.168.1.36: icmp_seq=6 ttl=64 time=0.750 ms
64 bytes from 192.168.1.36: icmp_seq=7 ttl=64 time=0.747 ms
64 bytes from 192.168.1.36: icmp_seq=8 ttl=64 time=0.761 ms
64 bytes from 192.168.1.36: icmp_seq=9 ttl=64 time=0.749 ms
64 bytes from 192.168.1.36: icmp_seq=10 ttl=64 time=0.731 ms
64 bytes from 192.168.1.36: icmp_seq=11 ttl=64 time=0.768 ms
64 bytes from 192.168.1.36: icmp_seq=12 ttl=64 time=0.744 ms

Wired to Router

64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.634 ms
64 bytes from 192.168.1.1: icmp_seq=2 ttl=64 time=0.580 ms
64 bytes from 192.168.1.1: icmp_seq=3 ttl=64 time=0.580 ms
64 bytes from 192.168.1.1: icmp_seq=4 ttl=64 time=0.569 ms
64 bytes from 192.168.1.1: icmp_seq=5 ttl=64 time=2.56 ms
64 bytes from 192.168.1.1: icmp_seq=6 ttl=64 time=0.547 ms
64 bytes from 192.168.1.1: icmp_seq=7 ttl=64 time=0.649 ms
64 bytes from 192.168.1.1: icmp_seq=8 ttl=64 time=0.529 ms
64 bytes from 192.168.1.1: icmp_seq=9 ttl=64 time=0.651 ms
64 bytes from 192.168.1.1: icmp_seq=10 ttl=64 time=0.974 ms
64 bytes from 192.168.1.1: icmp_seq=11 ttl=64 time=0.629 ms