A question to the experts here. For a while now I have been experiencing significant latency spikes that affect VoIP and page loading. I tested by pinging 8.8.8.8 when the router (Netgear R7800) is pretty much idle. I would start 6..8 concurrent ping sessions and every minute I would see two or three spikes to 50..100 ms like below and they would most of the time happen synchronously in multiple sessions.
2018-03-03 15:26:07 64 bytes from 8.8.8.8: icmp_seq=525 ttl=60 time=58.013 ms
2018-03-03 15:26:35 64 bytes from 8.8.8.8: icmp_seq=553 ttl=60 time=37.198 ms
2018-03-03 15:27:07 64 bytes from 8.8.8.8: icmp_seq=585 ttl=60 time=76.856 ms
2018-03-03 15:28:06 64 bytes from 8.8.8.8: icmp_seq=643 ttl=60 time=60.067 ms
As suggested in this thread, I tried moving IRQ's to CPU1/0 in different permutations, tried both 17.01 and master, etc and all with no success. Then I noticed that while the IRQ's are running on their CPU exclusively, all the other processes (kernel workers, hostapd, dnsmasq, etc) are constantly jumping back and forth between the CPU's.
I used @hnyman's build env and built a firmware with (and without) isolcpus=1 (based on master code): the only differences from the original here are an additional boot param and a few more recent commits. The rest remained the same.
Then I tested four permutations below (using wired connection and the same source code):
- No isolcpus=1 and IRQ's on CPU0
- No isolcpus=1 and IRQ's on CPU1
-
isolcpus=1 and IRQ's on CPU0
-
isolcpus=1 and IRQ's on CPU1
The first three yielded no difference, but the last one dropped the size of the spikes to ~20ms and they are now 10+ minutes apart vs several each minute.
2018-03-03 15:34:03 PING 8.8.8.8 (8.8.8.8): 56 data bytes
2018-03-03 15:37:49 64 bytes from 8.8.8.8: icmp_seq=226 ttl=60 time=21.615 ms
2018-03-03 15:50:47
2018-03-03 15:50:47 --- 8.8.8.8 ping statistics ---
2018-03-03 15:50:47 1000 packets transmitted, 1000 packets received, 0.0% packet loss
2018-03-03 15:50:47 round-trip min/avg/max/stddev = 11.082/11.963/21.615/0.475 ms
2018-03-03 15:50:47 PING 8.8.8.8 (8.8.8.8): 56 data bytes
2018-03-03 15:57:12 64 bytes from 8.8.8.8: icmp_seq=382 ttl=60 time=22.734 ms
2018-03-03 16:07:33
2018-03-03 16:07:33 --- 8.8.8.8 ping statistics ---
2018-03-03 16:07:33 1000 packets transmitted, 1000 packets received, 0.0% packet loss
2018-03-03 16:07:33 round-trip min/avg/max/stddev = 10.913/11.839/22.734/0.492 ms
2018-03-03 16:07:33 PING 8.8.8.8 (8.8.8.8): 56 data bytes
2018-03-03 16:24:15
2018-03-03 16:24:15 --- 8.8.8.8 ping statistics ---
2018-03-03 16:24:15 1000 packets transmitted, 1000 packets received, 0.0% packet loss
2018-03-03 16:24:15 round-trip min/avg/max/stddev = 11.192/11.921/15.246/0.326 ms
So CPU1 is now only for servicing IRQ's for eth0, eth1, wifi0, and wifi1 while everything else is running on CPU0.
Does this make sense or I am seeing things? I am not quite sure I can explain why there is such a difference.