Optimizing smp_affinity for irqs on interfaces under conditions of max load

dl12345 · April 10, 2020, 9:33pm

I'm looking to tune the interface irqs by assigning them to specific cpus on my 4-core x86_64

What's the best way of doing this? I have a 400mbps connection running off eth0 which is shaped by sqm using layer cake, as well as another on eth1 also shaped by cake that is a ~55mbps connection.

My lan is unshaped.

The aim is to reduce cpu load and optimize latency under conditions of max utilization of the link, which would occur mainly on download, as the connection is heavily asymmetric (400mbps download, 38mbps upload).

So what is the optimal strategy?

Try to balance the irqs by assigning them across cpu1, cpu2 and cpu3 (not to cpu0) such that on a maxed out download each cpu gets more or less than same ratio of interrupts, without regard to whether i'm mixing rx and tx interrupts (and indeed interrupts from different interfaces) on a specific cpu?
Or, assign all tx irqs on the wan interface to one cpu, all rx interrupts on the wan interface to another cpu and then both tx and rx on the lan interface to the third cpu
Some other way that I've not listed

Each ethernet interface has 4 tx and 4 rx queues, each serviced by a distinct interrupt

Gigabit · December 12, 2020, 11:02am

Did you ever find resolution to this?

ergamus · December 12, 2020, 11:08am

Hmm, wouldn't the irqbalance package do point #1 automatically?

But if I remember correctly, SQM on a single connection will always run off of one core, due to the latency issue that pops up if you have to start sharing data between cores. I wonder if running two separate instances on one connection (one for download, one for upload) would work better if you're being bottlenecked by single threaded performance.

mindwolf · April 25, 2023, 3:39pm

Irqbalance doesn't work efficiently and may even cause more problems on cpu's with rx/tx queues. Manually assigning them is always best. Be sure to pin only the receive queues with a hex of "e" (every other core besides cpu0). I don't know who is setting up pinning in the background of development but I wish they would take the time to understand how pinning works before fiddling around. RSS is the hardware implementation that's supposed to automatically assign interrupts to the proper cores but the hash function isn't instantiated so it's useless and just consumes more memory/cpu. RPS is software based and uses more resources. Without RSS, it's the only proper option. The admin needs to set RSS to equal 1 queue before configuring RPS. There's also RFS which is for more of a server type workload.

fantom-x · April 26, 2023, 1:23am

How would one set this?

mindwolf · April 29, 2023, 12:08am

opkg update
opkg install ethtool

This command will set the max rx queues to only one. Perform this action on all CPUs. example below

for NICS in eth0 eth1 eth2 eth3; do ethtool -X $NICS equal 1; done

fantom-x · April 29, 2023, 12:26am

It was worth a try: Cannot get RX ring count: Not supported (x86-64).

One more question about Be sure to pin only the receive queues with a hex of "e" (every other core besides cpu0): if I am reading this correctly, on an 8-core res should be set to 10101010 ? Why "every other core" and why exclude cpu0?

mindwolf · April 29, 2023, 12:40am

To be clear, spread the workload over all other cores besides eth0, which is the default core. The hex number would be 11111110 (cores eth1-7, excluding eth0) for example.

mindwolf · April 29, 2023, 12:53am

Ethtool doesn't support all drivers, or some drivers lack all features.

mindwolf · April 29, 2023, 1:21am

RPS will still work

fantom-x · April 29, 2023, 1:25am

But why do you suggest excluding cpu0? Are they now equal?

mindwolf · April 29, 2023, 9:15am

CPU0 is the default for everything including ssd and video, which negates using smp for better utilization. Most would assume to only use other processors as a failover in case of saturation but this is not intuitive and waste of underused resources.

Intel suggests:
For IP forwarding, a transmit/receive queue pair should use the same processor core and reduce any cache synchronization between different cores. This means that for the core performing the actual forwarding (default in this case being eth0), leave the rx/tx pair as is and feel free to move the other cores around for lan traffic.

dl12345 · April 29, 2023, 1:59pm

Lots of replies to a very old thread I created. I since found a better solution, a script from Intel that sets the irq affinities.

I invoke it thus on my machine (eth0 and eth1 are my WAN and WAN2 interfaces, since I have two internet links. eth3 is my LAN interface). It took a bit of experimentation to get the correct scheme - I recall that attempting to specify 1-6 on the eth3 LAN interface didn't give the right result, all though I can't remember why.

I got much better results out of this script than attempting to manually assign interrupt affinities.

/bin/bash -c "/usr/bin/set_irq_affinity 1-6 eth0"
/bin/bash -c "/usr/bin/set_irq_affinity 1-6 eth1"
/bin/bash -c "/usr/bin/set_irq_affinity eth3"

I deliberately don't balance interrupts on core 0 and core 7 as I have certain CPU hungry processes pinned to those cores. On my machine at least, this balances interrupts very nicely such that softirqs are pretty much evenly spread across all cores when maxing out SQM on a 1Gbps stream.

#!/bin/bash
#
# Copyright (c) 2015, Intel Corporation
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
#     * Redistributions of source code must retain the above copyright notice,
#       this list of conditions and the following disclaimer.
#     * Redistributions in binary form must reproduce the above copyright
#       notice, this list of conditions and the following disclaimer in the
#       documentation and/or other materials provided with the distribution.
#     * Neither the name of Intel Corporation nor the names of its contributors
#       may be used to endorse or promote products derived from this software
#       without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# Affinitize interrupts to cores
#
# typical usage is (as root):
# set_irq_affinity -x local eth1 <eth2> <eth3>
#
# to get help:
# set_irq_affinity

usage()
{
	echo
	echo "Usage: $0 [-x] {all|local|remote|one|custom} [ethX] <[ethY]>"
	echo "	options: -x		Configure XPS as well as smp_affinity"
	echo "	options: {remote|one} can be followed by a specific node number"
	echo "	Ex: $0 local eth0"
	echo "	Ex: $0 remote 1 eth0"
	echo "	Ex: $0 custom eth0 eth1"
	echo "	Ex: $0 0-7,16-23 eth0"
	echo
	exit 1
}

if [ "$1" == "-x" ]; then
	XPS_ENA=1
	shift
fi

num='^[0-9]+$'
# Vars
AFF=$1
shift

case "$AFF" in
    remote)	[[ $1 =~ $num ]] && rnode=$1 && shift ;;
    one)	[[ $1 =~ $num ]] && cnt=$1 && shift ;;
    all)	;;
    local)	;;
    custom)	;;
    [0-9]*)	;;
    -h|--help)	usage ;;
    "")		usage ;;
    *)		IFACES=$AFF && AFF=all ;;	# Backwards compat mode
esac

# append the interfaces listed to the string with spaces
while [ "$#" -ne "0" ] ; do
	IFACES+=" $1"
	shift
done

# for now the user must specify interfaces
if [ -z "$IFACES" ]; then
	usage
	exit 1
fi

# support functions

set_affinity()
{
	VEC=$core
	if [ ${VEC} -eq 32 ]
	then
		MASK_FILL=""
		MASK_ZERO="00000000"
		let "IDX = $VEC / 32"
		for ((i=1; i<=$IDX;i++))
		do
			MASK_FILL="${MASK_FILL},${MASK_ZERO}"
		done

		let "VEC -= 32 * $IDX"
		MASK_TMP=$((1<<$VEC))
		MASK=$(printf "%X%s" $MASK_TMP $MASK_FILL)
	else
		MASK_TMP=$((1<<$VEC))
		MASK=$(printf "%X" $MASK_TMP)
	fi

	printf "%s" $MASK > /proc/irq/$IRQ/smp_affinity
	printf "%s %d %s -> /proc/irq/$IRQ/smp_affinity\n" $IFACE $core $MASK
	if ! [ -z "$XPS_ENA" ]; then
		printf "%s %d %s -> /sys/class/net/%s/queues/tx-%d/xps_cpus\n" $IFACE $core $MASK $IFACE $((n-1))
		printf "%s" $MASK > /sys/class/net/$IFACE/queues/tx-$((n-1))/xps_cpus
	fi
}

# Allow usage of , or -
#
parse_range () {
        RANGE=${@//,/ }
        RANGE=${RANGE//-/..}
        LIST=""
        for r in $RANGE; do
		# eval lets us use vars in {#..#} range
                [[ $r =~ '..' ]] && r="$(eval echo {$r})"
		LIST+=" $r"
        done
	echo $LIST
}

# Affinitize interrupts
#
setaff()
{
	CORES=$(parse_range $CORES)
	ncores=$(echo $CORES | wc -w)
	n=1

	# this script only supports interrupt vectors in pairs,
	# modification would be required to support a single Tx or Rx queue
	# per interrupt vector

	queues="${IFACE}-.*TxRx"

	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
	                         do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
	                         done)
	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"

	echo "IFACE CORE MASK -> FILE"
	echo "======================="
	for IRQ in $irqs; do
		[ "$n" -gt "$ncores" ] && n=1
		j=1
		# much faster than calling cut for each
		for i in $CORES; do
			[ $((j++)) -ge $n ] && break
		done
		core=$i
		set_affinity
		((n++))
	done
}

# now the actual useful bits of code

# these next 2 lines would allow script to auto-determine interfaces
#[ -z "$IFACES" ] && IFACES=$(ls /sys/class/net)
#[ -z "$IFACES" ] && echo "Error: No interfaces up" && exit 1

# echo IFACES is $IFACES

CORES=$(</sys/devices/system/cpu/online)
[ "$CORES" ] || CORES=$(grep ^proc /proc/cpuinfo | cut -f2 -d:)


for IFACE in $IFACES; do
	# echo $IFACE being modified

	dev_dir=/sys/class/net/$IFACE/device
	[ -e $dev_dir/numa_node ] && node=$(<$dev_dir/numa_node)
	[ "$node" ] && [ "$node" -gt 0 ] || node=0

	case "$AFF" in
	one)
		[ -n "$cnt" ] || cnt=0
		CORES=$cnt
	;;
	all)
		CORES=$CORES
	;;
	custom)
		echo -n "Input cores for $IFACE (ex. 0-7,15-23): "
		read CORES
	;;
	[0-9]*)
		CORES=$AFF
	;;
	*)
		usage
		exit 1
	;;
	esac

	# call the worker function
	setaff
done

# check for irqbalance running
IRQBALANCE_ON=`ps ax | grep -v grep | grep -q irqbalance; echo $?`
if [ "$IRQBALANCE_ON" == "0" ] ; then
	echo " WARNING: irqbalance is running and will"
	echo "          likely override this script's affinitization."
	echo "          Please stop the irqbalance service and/or execute"
	echo "          'killall irqbalance'"
fi

Running a speedtest:

mindwolf · April 30, 2023, 4:07pm

You can lower the usage even more. Using piece-of-cake keywords ingress docsis & docsis. The OOTB openwrt v19.07.10 is faster for my platform but all cake versions with openwrt stock builds are cpu hogs. After numerous testing, I have proof that custom options due significantly improve performance. Don't sell your device short if you know it has the specs.

200Mbps 20Mbps Waveform test custom v19.07.10, default-no-irq-pin, Vol-Pre-Tickless, -O2 -pipe -march="silvermont", kernel pc platform selection "intel-atom" htop +/- 5%

dl12345 · April 30, 2023, 6:11pm

I use layer_cake since I have a bunch of custom DSCP tagging going on and already use the ingress and docsis keywords.

To be honest, it's not worth the effort to try and further optimize things, since there's so much cpu headroom anyway...I could add a second gigabit line and still have headroom to shape it in addition to the current gigabit link and backup link.

My build is a customized version of master (using glibc) somewhat later than 19.07 (git version r16346+14-87046e87e2)

mindwolf · May 1, 2023, 1:56pm

I reverted back down to v19 because the latest versions of x86 have something controlling the smp affinity assignments. I've posted about it hoping to find some information as to disable it but with no response.

dl12345 · May 1, 2023, 7:18pm

I went and read a couple of your posts on the subject. I think as more people start moving towards x86_64 platforms, these issues are going to become more frequent and dissatisfaction at network performance is going to come to the fore, since the default OS and hardware configuration is largely optimized for use-cases where the connection terminates on the host and not for a router-use case.

This thread was originally posted 3 years ago when I had the previous iteration of my hardware platform, which used igb drivers and Intel i354 controllers. When upgrading, I chose a more capable platform with network hardware (Intel X553) that uses ixgbe drivers. The poor-performance problem is then magnified by literally an order of magnitude due to misconfiguration.

There was quite a steep learning curve involved in optimizing for the router use-case, as documentation about how everything functions is hard to find and somewhat arcane.

So in the interests of someone landing on this thread seeking info, here's a more detailed description of what I needed to do to optimize ixgbe on Openwrt.

As an aside, I think that when choosing an x86_64 platform, people need to consider the network interface hardware and driver functionality as carefully as they do the CPU and core count. igb interfaces have only two RX/TX RSS queues, which means that the interrupts are only going to be processed on two physical cores, no matter how high the core count of the CPU. For this reason, desktop x86_64 platforms are a bit of a mismatch to a router use-case, since a lot of them use igb drivers or less capable network hardware.

With ixgbe capable X553 hardware, there are 63 channels.

root@openwrt:~# ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:             0
TX:             0
Other:          1
Combined:       63

However, this in itself poses a challenge, since left at default, the experience out of the box is not satisfactory, with heavy spikes in individual core usage.

ixgbe / X553 has Intel Flow Director enabled out of the box, which is really only suitable for connections terminating on the host - it's designed to match flows to cores where the packet consuming process is running, mostly irrelevant in a router (unless you have some user-space daemon doing packet captures, and even then, it's a blunt instrument). Intel Flow Director overrides RSS and will cause any manual tuning done to be ignored.

So, firstly, turn it off:

root@openwrt:~# ethtool --features ntuple off

Then, RX/TX RSS channels need to be set to the number of physical cores in the CPU

root@openwrt:~# ethtool -L eth0 combined 8
root@openwrt:~# ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:             0
TX:             0
Other:          1
Combined:       63
Current hardware settings:
RX:             0
TX:             0
Other:          1
Combined:       8

root@openwrt:~# ethtool -x eth0
RX flow hash indirection table for eth0 with 8 RX ring(s):
    0:      0     1     2     3     4     5     6     7
    8:      0     1     2     3     4     5     6     7
   16:      0     1     2     3     4     5     6     7
   24:      0     1     2     3     4     5     6     7
   32:      0     1     2     3     4     5     6     7
   40:      0     1     2     3     4     5     6     7
   48:      0     1     2     3     4     5     6     7
   56:      0     1     2     3     4     5     6     7
   64:      0     1     2     3     4     5     6     7
   72:      0     1     2     3     4     5     6     7
   80:      0     1     2     3     4     5     6     7
   88:      0     1     2     3     4     5     6     7
   96:      0     1     2     3     4     5     6     7
  104:      0     1     2     3     4     5     6     7
  112:      0     1     2     3     4     5     6     7
  120:      0     1     2     3     4     5     6     7
  128:      0     1     2     3     4     5     6     7
  136:      0     1     2     3     4     5     6     7
  144:      0     1     2     3     4     5     6     7
  152:      0     1     2     3     4     5     6     7
  160:      0     1     2     3     4     5     6     7
  168:      0     1     2     3     4     5     6     7
  176:      0     1     2     3     4     5     6     7
  184:      0     1     2     3     4     5     6     7
  192:      0     1     2     3     4     5     6     7
  200:      0     1     2     3     4     5     6     7
  208:      0     1     2     3     4     5     6     7
  216:      0     1     2     3     4     5     6     7
  224:      0     1     2     3     4     5     6     7
  232:      0     1     2     3     4     5     6     7
  240:      0     1     2     3     4     5     6     7
  248:      0     1     2     3     4     5     6     7
  256:      0     1     2     3     4     5     6     7
  264:      0     1     2     3     4     5     6     7
  272:      0     1     2     3     4     5     6     7
  280:      0     1     2     3     4     5     6     7
  288:      0     1     2     3     4     5     6     7
  296:      0     1     2     3     4     5     6     7
  304:      0     1     2     3     4     5     6     7
  312:      0     1     2     3     4     5     6     7
  320:      0     1     2     3     4     5     6     7
  328:      0     1     2     3     4     5     6     7
  336:      0     1     2     3     4     5     6     7
  344:      0     1     2     3     4     5     6     7
  352:      0     1     2     3     4     5     6     7
  360:      0     1     2     3     4     5     6     7
  368:      0     1     2     3     4     5     6     7
  376:      0     1     2     3     4     5     6     7
  384:      0     1     2     3     4     5     6     7
  392:      0     1     2     3     4     5     6     7
  400:      0     1     2     3     4     5     6     7
  408:      0     1     2     3     4     5     6     7
  416:      0     1     2     3     4     5     6     7
  424:      0     1     2     3     4     5     6     7
  432:      0     1     2     3     4     5     6     7
  440:      0     1     2     3     4     5     6     7
  448:      0     1     2     3     4     5     6     7
  456:      0     1     2     3     4     5     6     7
  464:      0     1     2     3     4     5     6     7
  472:      0     1     2     3     4     5     6     7
  480:      0     1     2     3     4     5     6     7
  488:      0     1     2     3     4     5     6     7
  496:      0     1     2     3     4     5     6     7
  504:      0     1     2     3     4     5     6     7
RSS hash key:
dd:55:0f:b5:5f:6b:23:dc:e4:58:7c:ce:24:78:79:5a:58:56:39:da:a1:cd:fe:67:76:9b:97:f6:1e:63:23:1f:96:62:05:43:5c:88:5e:c3
RSS hash function:
    toeplitz: on
    xor: off

Only once that's done can interrupt affinities be tuned using the set_irq_affinity script I posted earlier in the thread, resulting in the nice even distribution of interrupts across cores seen in the htop screenshot above. This configuration workflow is a must if using SQM, otherwise you may as well opt for a cpu with only one or two cores.

IFACE CORE MASK -> FILE
=======================
eth0 1 2 -> /proc/irq/47/smp_affinity
eth0 2 4 -> /proc/irq/48/smp_affinity
eth0 3 8 -> /proc/irq/49/smp_affinity
eth0 4 10 -> /proc/irq/50/smp_affinity
eth0 5 20 -> /proc/irq/51/smp_affinity
eth0 6 40 -> /proc/irq/52/smp_affinity
eth0 1 2 -> /proc/irq/53/smp_affinity
eth0 2 4 -> /proc/irq/77/smp_affinity

moeller0 · May 1, 2023, 7:27pm

So often soft interrupts are the big issue, if the come from a software traffic shaper qdisc, all of the distributing over CUs is pretty problematic, as there is a single lock required for qdiscs IIRC, so the best you can do is move the ingress shaper to CPU A and the egress shaper to CPU B... you can not e.g. distribute cake for a full interface over multiple CPUs, at least for gigabit ethernet the shaper overhead will dwarf the hardware interrupt overhead, as a result e.g. on a dual core CPU one is better off putting ingress shaper and WAN interrupt processing on one CPU and egress shaper and LAN interrupt processing on the other CPU, than using one CPU for both interrupt sources and the other for both shapers...

Tl;dr: I fully agree one needs to match hardware capability and load carefully.

fantom-x · May 1, 2023, 11:50pm

Are you saying that if SQM/CAKE is used, then RPS and distributing soft IRQs across multiple CPUs is useless? I am asking because I am using CAKE and network.globals.packet_steering is enabled, which distributes software IRQs across all CPUs, yet I am seeing all four or six cores nicely and evenly loaded during high load times.

A related question: how do I put a shaper on its own CPU?

moeller0 · May 2, 2023, 12:18am

I guess I was phrasing my point inartfully. The load of traffic shaping per direction is monolithic, that is you can not multithread a single cake instance over multiple CPUs. But it is possible to move the ingress and the egress shaping on different CPUs, and that helps. I think that RPS actually can help in that, otherwise RPS can help in moving other processing to other CPUs.