Kernel: normal (throughput) vs voluntary preemption vs forced preemption (low latency)

fantom-x · July 6, 2019, 11:58pm

Is there a reason why the kernels for a lot of routers including the high end ones are not compiled as "Preemptible Kernel (Low-Latency Desktop)" by default? My understanding that routers should favour response time over raw throughput, but I could be missing a bigger picture here. Does anyone know?

config PREEMPT_NONE
	bool "No Forced Preemption (Server)"
	help
	  This is the traditional Linux preemption model, geared towards
	  throughput. It will still provide good latencies most of the
	  time, but there are no guarantees and occasional longer delays
	  are possible.

	  Select this option if you are building a kernel for a server or
	  scientific/computation system, or if you want to maximize the
	  raw processing power of the kernel, irrespective of scheduling
	  latencies.

config PREEMPT_VOLUNTARY
	bool "Voluntary Kernel Preemption (Desktop)"
	help
	  This option reduces the latency of the kernel by adding more
	  "explicit preemption points" to the kernel code. These new
	  preemption points have been selected to reduce the maximum
	  latency of rescheduling, providing faster application reactions,
	  at the cost of slightly lower throughput.

	  This allows reaction to interactive events by allowing a
	  low priority process to voluntarily preempt itself even if it
	  is in kernel mode executing a system call. This allows
	  applications to run more 'smoothly' even when the system is
	  under load.

	  Select this if you are building a kernel for a desktop system.

config PREEMPT
	bool "Preemptible Kernel (Low-Latency Desktop)"
	help
	  This option reduces the latency of the kernel by making
	  all kernel code (that is not executing in a critical section)
	  preemptible.  This allows reaction to interactive events by
	  permitting a low priority process to be preempted involuntarily
	  even if it is in kernel mode executing a system call and would
	  otherwise not be about to reach a natural preemption point.
	  This allows applications to run more 'smoothly' even when the
	  system is under load, at the cost of slightly lower throughput
	  and a slight runtime overhead to kernel code.

	  Select this if you are building a kernel for a desktop or
	  embedded system with latency requirements in the milliseconds
	  range.

eas · July 7, 2019, 7:51am

Someone will correct me where I'm wrong, but, my understanding is:

Routers aren't "interactive," in the sense of a desktop system. Most of what they do happens in kernel code. The "applications," the userland software, are daemons things like hostapd, dnsmasq, or uhttpd. Letting them preempt the kernel means interrupting the kernels packet processing, which potentially increases network latency and jitter, which is counterproductive for responsiveness of the network.

As for the daemons, they sit around waiting for a packet to come in, via the kernel. When the buffer is ready, the kernel switches execution to the daemon process which will, typically, either:

Do a little processing and return a result quickly.
Make a network request of its own (ie dnsmasq), and then block while waiting for a reply, thus yielding to the kernel.
Fork a process, then block until the process returns output.

In most of these cases, the userland gets CPU as soon as it has work to do.

fantom-x · July 7, 2019, 4:01pm

I do not not think it is only about the user land processes: a higher priority interrupt would also preempt a lower priority one as well as a user land process, which should be beneficial to the network responsiveness. The first column below is priority (lower value == higher priority) and a fully preemptable kernel would make sure that network IRQ's (28/29/31/32) were serviced as soon as possible even at the expense of the user land processes and the lower priority IRQ's.
My understand is that the default kernel does not do that, because it is build for max throughout and not for low-latency.

           CPU0       CPU1       
 16:   33204276    8031824     GIC-0  18 Edge      gp_timer
 18:         33          0     GIC-0  51 Edge      qcom_rpm_ack
 19:          0          0     GIC-0  53 Edge      qcom_rpm_err
 20:          0          0     GIC-0  54 Edge      qcom_rpm_wakeup
 26:          0          0     GIC-0 241 Edge      ahci[29000000.sata]
 27:          0          0     GIC-0 210 Edge      tsens_interrupt
 28:   52575564          0     GIC-0  67 Edge      qcom-pcie-msi
 29:   16533508          0     GIC-0  89 Edge      qcom-pcie-msi
 30:    1958157          0     GIC-0 202 Edge      adm_dma
 31:         13   18492628     GIC-0 255 Level     eth0
 32:          5    5890525     GIC-0 258 Level     eth1
 33:          0          0     GIC-0 130 Level     bam_dma
 34:          0          0     GIC-0 128 Level     bam_dma
 35:          0          0   PCI-MSI   0 Edge      aerdrv
 36:   52575564          0   PCI-MSI   1 Edge      ath10k_pci
 68:          0          0   PCI-MSI   0 Edge      aerdrv
 69:   16533508          0   PCI-MSI   1 Edge      ath10k_pci
101:         12          0     GIC-0 184 Level     msm_serial0
102:          2          0   msmgpio   6 Edge      keys
103:          2          0   msmgpio  54 Edge      keys
104:          2          0   msmgpio  65 Edge      keys
105:          0          0     GIC-0 142 Level     xhci-hcd:usb1
106:          0          0     GIC-0 237 Level     xhci-hcd:usb3
IPI0:          0          0  CPU wakeup interrupts
IPI1:          0          0  Timer broadcast interrupts
IPI2:     187606      13271  Rescheduling interrupts
IPI3:          0         14  Function call interrupts
IPI4:          0          0  CPU stop interrupts
IPI5:       4245          1  IRQ work interrupts
IPI6:          0          0  completion interrupts
Err:          0

jeff · July 7, 2019, 4:49pm

As most kernel interrupt handlers are very quick to return, I would be very surprised if this made a noticeable change in routing performance (say 1 ms or more). If tests show otherwise, it would be worth exploring what other impacts it might have.

My suspicion is that it is intended for things like Android phones to improve UI responsiveness (as the UI is managed by the “JVM”).

fantom-x · July 7, 2019, 7:44pm

It is not about getting that extra 1ms of savings, but rather about making sure that no user land process like collectd, nlbwmon, luci with auto-refresh or not, dnsmasq, dnscrypt-proxy, samba, etc which have nothing to do with routing, are ever allowed to steel any CPU cycles from the routing function. It is about consistency.
If a fully pre-emtable kernel can provide a better response time guarantee, then why is it not a default? It is a default kernel mode for some targets, but not for all.

egrep -r "^CONFIG_PREEMPT_VOLUNTARY=y|^CONFIG_PREEMPT=y" target
target/linux/pistachio/config-4.14:CONFIG_PREEMPT_VOLUNTARY=y
target/linux/gemini/config-4.19:CONFIG_PREEMPT=y
target/linux/gemini/config-4.14:CONFIG_PREEMPT=y
target/linux/samsung/s5pv210/config-4.14:CONFIG_PREEMPT=y
target/linux/mediatek/mt7623/config-4.14:CONFIG_PREEMPT=y
target/linux/zynq/config-4.14:CONFIG_PREEMPT=y
target/linux/archs38/config-4.14:CONFIG_PREEMPT=y
target/linux/layerscape/armv8_64b/config-4.14:CONFIG_PREEMPT=y
target/linux/sunxi/config-4.19:CONFIG_PREEMPT=y
target/linux/sunxi/config-4.14:CONFIG_PREEMPT=y
target/linux/arc770/config-4.14:CONFIG_PREEMPT=y

eas · July 8, 2019, 3:22am

People have put a great deal of effort in the past 4-5 years to optimize the linux networking stack to lower latency and improve consistency. Edge networking devices in general (ie consumer routers) were the primary usecase for much of that work, and OpenWRT was where a lot of the end result was first deployed widely. Graduate theses and dissertations have been written and published about some of the work.

I don't know whether or not optimizing kernel preemption got much attention. I doubt it did, because my understanding is that the root cause of high and unpredictable latency in linux-based routers was largely due to a pathological combination of speed missmatches (ie GigE lan to DSL link) and naive or misguided queue implementations in the network stack. Much of that has been addressed for a few years now, but there is still ongoing work on WiFi drivers.

Ignoring all that, I, from my naive POV, have doubts that forced preemption of kernel threads would be an obvious win on its own.

Are you concerned about lower latency, or not? Because if you are, improving latency involves chasing after nanoseconds and microseconds in the hope you can get enough to add up to 1ms.

Why do you think any of them they can steal CPU cycles from packet-forwarding functions? They are all preempt-able, and more importably, they are all networking daemons. On a typical consumer router they spend most of their time blocked on network IO. The kernel is processing packets, and if it gets some packets destined for one of the network daemons, it transfers execution to them. If they take to long to finish what they are doing, the kernel forces them to take a break (ie preempts them)so that it has time to do its work and so other processes that need execution time can get it.

As for forced preemption of kernel tasks in order to service network interrupts, that's not an obvious win. For one thing, most of the interrupts are going to be network interrupts, and those that aren't are going to be various timers.

Further, any kernel tasks that might be forcibly pre-empted by a network interrupt are still going to have to get done. Stopping them in the middle means more overhead.

Finally, context changes, any context changes (both within the kernel, and between userland and kernel, are "expensive," they consume CPU time. Avoiding them leaves more CPU time and memory bandwidth to do "real work." Some context changes are necessary, but other can be avoided. For example, preempting a task requires both that pipelines are emptied and registers are saved, and then that they are restored again when the task is scheduled again. On the other hand, letting the task complete allows a simpler and smoother transition.

Avoiding preemption allows more efficient use of computational resources, which improves throughput and can improve latency. It can also benefit consistency of response times.

Remember too, this isn't a theoretical exercise. These are typically gigabit ethernet interfaces and 802.11ac WiFi with code executing on router SoCs. The SoCs have 1-4 cores, some have hardware support for multiple threads/core (which reduces the cost of a context change). Clock speeds are 500-1.5GHz. They don't devote a lot of silicon to extracting instruction level parallelism, and so they probably only retire ~1 instruction/cycle. GigE can receive, at most 1,488,096 frames/second, or one every 670ns. I don't know about WiFi, but it is probably 1-2 orders of magnitude less.

What's forced preemption in the kernel going to gain you within those constraints? What are the costs?

It looks to me like most of the architectures in your list are not network SoCs. The mediatek kind of is, but the kind that includes dedicated hardware for video and a high-bandwidth camera interface. Zynq is little ARM SoC with a less-little FPGA attached. Pistachio was/is targeted to things like chromebooks.

dlakelan · July 8, 2019, 3:34am

if I remember correctly there was a long thread on issues with ping spikes on the r7800 in which the preemption options were seen to cut not a few ms but potentially 50-100ms

so all the theorizing is nice but voluntary preemption kernels clearly have their place. the fully preemptive one didn't run on this Arch if I remember correctly

fantom-x · July 8, 2019, 11:42am

It was broken at first, but quickly fixed. So both ran better than the default one, but an attempt to make it a default was rejected in favour of fixing the real issue. So the kernel remained the same while a preemtable one has a lot of potential. I was hoping someone would know why it is the case.
As far as I remember, the OEM for a r7800 is preemtable, but I might be wrong here.

dlakelan · July 8, 2019, 7:00pm

preemptive kernels should also help a lot when binary blobs are involved in drivers like ath10k and soforth. we can't "fix" those if they have latency issues but we can potentially do a better job of managing latency by letting the kernel preempt stuff.

I run Debian on my x86 router and it uses the voluntary preempt kernel by default if I remember correctly, it's also tickless. I think this should be the default for x68 and ARM and other higher end SOC, I'm not sure what should be done for the lower end stuff.

fantom-x · July 15, 2019, 1:46am

The more I read about this, the more I am getting confused. Everyone is saying that the user land processes are (and have always been) preeemtable: if the kernel needs the CPU (for a high priority interrupt, etc), a user land process gets kicked out. The preemptable kernel is to benefit the user space processes so that a single user process in a kernel call does not make every other process wait for too long. The same goes for the timer clock (100Hz vs 1000Hz): it is about improving the user space process' responsiveness at the expense of the kernel doing more work (more context switching, etc). So according to all of that, I had it all backwards and a router is spending most of the time in kernel space, so making it preemptible in any way (and using higher freq timer clock) is supposed to be detrimental to its performance.
Having said that, I no longer understand, how running a preemptible kernel helped removing 50..100ms latency spikes in an r7800 a while ago? The root cause is now known: an inefficient implementation of the MIB counters on the switch, but I would assume that code runs in the kernel space so should not have been affected by a preemptible kernel.
Any isight?

slh · July 15, 2019, 1:53am

Querying the MIB counters on AR8337 takes time, apparently the kernel busy waits for the results (not sleeping and allowing other processes to do something, but blocking and (re-)checking for results in a tight loop). With preemption and higher ticks the busy-waiting threads can be (to some extent) forcefully intercepted and tasks are being rescheduled. But this (increasing tick rate and allowing preemption) was a workaround, not a bugfix.

fantom-x · July 15, 2019, 2:16am

No argument here. At this time I am just trying to understand if spending time/effort compiling my own custom kernel is worth the effort in general: I only have one router, so cannot really test all the permutations. If not, I could use the image builder and spin up a new firmware in five minutes.
The original MIB issue seems to have been fixed (merged to master back in May by https://github.com/openwrt/openwrt/pull/1984). So if my new understanding of how preemption works is correct, I should just stay with all defaults.