Possible cause of R7800 latency issues

So, I've pinged the first hop.

Here's from my computer connected via ethernet to the R7800:

pc_til_r7800


This one is with a D-Link DIR-655 A5 between my computer and modem:

pc_til_dir655

Note the lack of spikes.


Computer connected to R7800 without the modem; the R7800 is connected straight to the fibre media converter, running VLAN 102 tagged on the WAN port for internet:

pc_til_r7800_nomodem

I have no idea why there's much less spikes here than with the modem between. The DIR-655 doesn't have any issues with the modem, and neither does my computer when it's connected directly to it, as seen in previous plots.


And lastly, here's a plot of the R7800 pinging itself (192.168.1.1):

r7800_to_self

The ping is so low and stable that it's just a thin line at the bottom of the graph. Here's a summary of the ping session:

--- 192.168.1.1 ping statistics ---
1801 packets transmitted, 1801 packets received, 0% packet loss
round-trip min/avg/max = 0.100/0.288/0.548 ms

Interesting, so the spikes really are related to R7800.

So, your occurrence frequency with a "new" R7800 matches what I saw with my ancient one. (Note that I did my own testing without any IRQ optimisation or CPU isolation.)

I am not sure how much effort you should spend on such a minor issue. Tracking down the possible reasons will be very hard, as it may well be something in hardware, Linux itself, PCIe timings or ...

There is not that much R7800-specific source code (is there actually any?), so the same may well be produceable also with other IPQ8065 routers. ( I am not sure if the others like @slh have really looked for spikes at 0.4% frequency.)

One comment regarding OEM stock firmware showing no spikes: it likely runs a very old Linux. Totally different kernel and routing.

@hnyman I think many people will not agree with you. This router is sold as fit for gamers but does not deliver.
I have two young adults at home, complaining to me every day about the lag spikes they experience during gaming.

1 Like

The spikes become way more moderate and less frequent if you use cpu isolation and have a dedicated core for network interrupts. My kid has almost stopped complaining.

Correct, I built my own image with all your suggestions implemented. But as you write, almost is the keyword here...
Testing stock FW now, no complaints until now. I have asked them to specifically look for lag spikes during this week.

I agree that this is an unacceptable issue on a router marketed towards gaming, but then again it's not present on stock firmware, thought I haven't performed anywhere near the same level of exhaustive testing on stock.

@hnyman what about porting the R7800 to kernel 4.14? The kernel partition needs to be expanded for that to happen, right? Who knows, maybe 4.14 will solve this issue. I can't hope to find the cause with my limited knowledge if it's something really intricate.

I haven't noticed any issues on stock while gaming, especially with Streamboost on. Maybe client issue interacting with the 9984? Most of my testing is with an Intel 9260ac and 2x 8265ac adapters in my 3 laptops. I did run into some though who had Killer 1535ac (aka QCA6174A rebrand) with latency issues even on stock R7800 firmware. Or maybe I havenā€™t tried hard enough to look for it. I will do some testing this weekend. In the mean time, if on an Intel adapter disable Packet Coalesing if you have intermittent latency issues and on the 1535 I believe one solution was to change roaming to minimal as it was highly agressive. Canā€™t fault them for the gaming monicker if these spikes are mostly on LEDE though as thatā€™s not under their control and they donā€™t garuntee performance on third party firmware. Whatever the case Iā€™m sure you guys will figure it out but Iā€™m not as capable as some of you so all I can do is offer to test any fixes on LEDE/OpenWRT as I have a spare R7800.

I have tested ~1000 pings and more at a time, without significant outliers, not that I'd consider 0.4% an outlier myself. Personally I don't really care about ping times (within reason) at all, if I did, I'd switch to lantiq and halve them over the board.

That is a very correct statement, but how do a newer kernel and routing explain 100ms latency spikes? Added latency is a different thing and this topic is about the latency spikes. Especially peculiar when this router is supposed to have plenty of muscle to handle routing, collectd, nlbwmon, etc.

We are just hoping to get some pointers/advice on how to move forward, what to test, etc. All in the hopes of narrowing down the component at fault.

I have a weird idea...

Can someone try flashing Chaos Calmer; and see if there's a notable difference?

Is that to prove that the newer kernel is slower and the routing is all different? Or there is something else that you suspect that can be causing this? Can you elaborate a bit more?

Absolutely...at this time we have 3 devices:

  • the R7800 version B (problem)
  • the R7800 version A (no issues)
  • other devices like it (no issues)

So, we want to see if there's a difference. If Chaos Calmer existed for the board, lets see if it works. You could then successfully (and quickly) eliminate the hardware as the problem-causing issue.

e.g. by introducing new sporadic race conditions that affect just certain hardware combinations and use cases. Linux kernel development is not specifically tailored toward router usage.

One example is that HTB qdisc performance deteriorated between kernel 4.4 and 4.8 in a way that affect dual-core devices like 8065 in R7800. Discussion about that can be found in SQM bug discussion:
https://github.com/tohojo/sqm-scripts/issues/48#issuecomment-260991014
Kernel devs had noticed that problem and had later implemented a fix with
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/net/sched/sch_htb.c?id=a9efad8b24bd22616f6c749a6c029957dc76542b

Other example is the clearly decreased routing power between AA12.09 and BB14.07, four years ago: Linux kernel developers removed the routing cache from kernel in kernel 3.6, and that decreased the routing power substantially for BB14.07 with kernel 3.10. Openwrt devs figured some ways to mitigate that, but there was still an impact for a while. Fixed in CC15.05 (possibly backported to 14.07, but I don't bother to check).
https://forum.openwrt.org/viewtopic.php?id=51726
https://dev.openwrt.org/changeset/43587

So, kernel version changes may have really large routing & network traffic performance issues, that affect only some user groups. It is similarly quite possible that current kernels consume/reserve/queue hardware resource slightly differently than old kernels used in R7800 OEM firmware. I haven't checked the new OEM firmware versions, but at least the original ones were using kernel 3.4 and were compiled with GCC 4.6.3 from 2012:
https://forum.openwrt.org/viewtopic.php?pid=325409#p325409

My personal guess is that it is some minor race condition type of critical resource scarcity that pops up sporadically, but affects only certain processors (CPU family, core amount, ...) touching just R7800. Alternatively, some error/imperfection either in hardware design and/or DTS hardware description.

As it only pop ups something like 0.4% packets, tracking it down will be very hard, and can be rather impossible for us laymen, especially if it is something tied to multiple factors around CPU frequency speeds, IRQs, core context switches, hardware DTS, etc. factors.

It could also be something that CPU oriented patches from @dissent1 could fix:
https://github.com/openwrt/openwrt/pull/632

My point is merely that it can be really hard to track down, as it is likely tightly hardware related if that does not surface for other similar routers using the same chip.

@lleachii
Not sure if there any problem-free "version A" as I found the problem yesterday with my old R7800. But as problem frequency is so rare (0.2-0.4% of packets???), it is very hard to notice unless you are doing something real real-time stuff.

2 Likes

Does not look like it does, though.

1 Like

Thx for an explanation, I agree on all counts. But I would still like to try to identify what is causing the spikes even with my limited knowledge. Thank you for pointing me to that pull request. I am gonna try it as well as the other one that includes some fixes and cleanups for this board.

I wonder why that hasn't been merged after being open for almost 3 months now.

Letā€™s help testing it! Not enough testing might have something to do with it. Unless @dissent1 can provide more details.

1 Like

Other reason mey be that @blogic who has mainly taken care of ipq806x target maintenance, is now more involved with ipq40xx.

First thing first: @huaracheguarache and I are gonna test it shortly. If it does not fix the latency, we could at least deploy these anyway to test regressions. That might help moving the pull requests along.

2 Likes