Possible cause of R7800 latency issues

e.g. by introducing new sporadic race conditions that affect just certain hardware combinations and use cases. Linux kernel development is not specifically tailored toward router usage.

One example is that HTB qdisc performance deteriorated between kernel 4.4 and 4.8 in a way that affect dual-core devices like 8065 in R7800. Discussion about that can be found in SQM bug discussion:
https://github.com/tohojo/sqm-scripts/issues/48#issuecomment-260991014
Kernel devs had noticed that problem and had later implemented a fix with
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/net/sched/sch_htb.c?id=a9efad8b24bd22616f6c749a6c029957dc76542b

Other example is the clearly decreased routing power between AA12.09 and BB14.07, four years ago: Linux kernel developers removed the routing cache from kernel in kernel 3.6, and that decreased the routing power substantially for BB14.07 with kernel 3.10. Openwrt devs figured some ways to mitigate that, but there was still an impact for a while. Fixed in CC15.05 (possibly backported to 14.07, but I don't bother to check).
https://forum.openwrt.org/viewtopic.php?id=51726
https://dev.openwrt.org/changeset/43587

So, kernel version changes may have really large routing & network traffic performance issues, that affect only some user groups. It is similarly quite possible that current kernels consume/reserve/queue hardware resource slightly differently than old kernels used in R7800 OEM firmware. I haven't checked the new OEM firmware versions, but at least the original ones were using kernel 3.4 and were compiled with GCC 4.6.3 from 2012:
https://forum.openwrt.org/viewtopic.php?pid=325409#p325409

My personal guess is that it is some minor race condition type of critical resource scarcity that pops up sporadically, but affects only certain processors (CPU family, core amount, ...) touching just R7800. Alternatively, some error/imperfection either in hardware design and/or DTS hardware description.

As it only pop ups something like 0.4% packets, tracking it down will be very hard, and can be rather impossible for us laymen, especially if it is something tied to multiple factors around CPU frequency speeds, IRQs, core context switches, hardware DTS, etc. factors.

It could also be something that CPU oriented patches from @dissent1 could fix:
https://github.com/openwrt/openwrt/pull/632

My point is merely that it can be really hard to track down, as it is likely tightly hardware related if that does not surface for other similar routers using the same chip.

@lleachii
Not sure if there any problem-free "version A" as I found the problem yesterday with my old R7800. But as problem frequency is so rare (0.2-0.4% of packets???), it is very hard to notice unless you are doing something real real-time stuff.

2 Likes