Xrx200 IRQ balancing between VPEs

Did you changed the lantiq switch driver as well? Or only the IRQ Patches?
I tried to use all patches from the mailing list, but I can't get activated the wifi for example.
EDIT: Nevermind. I mixed up IRQ and How can we make the lantiq xrx200 devices faster. The latter one causes some problems ( My testing branch is located here ).

Great !
I use the files from Post #1 + the script from Post 7#
And it looks like working:

[    0.076718] smp: Bringing up secondary CPUs ...

[    0.082140] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.

[    0.082154] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes

[    0.082299] !!!!!!!!init secondary smp-mt.c

[    0.082319] CPU1 revision is: 00019556 (MIPS 34Kc)

[    0.113360] Synchronize counters for CPU 1: 

[    0.136976] smp finish!

[    0.136980] done.

[    0.145775] === MIPS MT State Dump ===

[    0.149588] -- Global State --

[    0.152715]    MVPControl Passed: deadbeef

[    0.156881]    MVPControl Read: 00000003

[    0.160874]    MVPConf0 : b8008403

[    0.164344] -- per-VPE State --

[    0.167558]   VPE 0

[    0.169727]    VPEControl : 00000001

[    0.173373]    VPEConf0 : 800f0003

[    0.176847]    VPE0.Status : 1100ff00

[    0.180605]    VPE0.EPC : 8000d470 r4k_wait_irqoff+0x1c/0x24

[    0.186308]    VPE0.Cause : 50808000

[    0.189954]    VPE0.Config7 : 80080400

[    0.193771]   VPE 1

[    0.195941]    VPEControl : 00000001

[    0.199588]    VPEConf0 : 802f0003

[    0.203061]    VPE1.Status : 1100ff00

[    0.206810]    VPE1.EPC : 80018ff0 vsmp_smp_finish+0x3c/0x5c

[    0.212522]    VPE1.Cause : 50808000

[    0.216168]    VPE1.Config7 : 80080400

[    0.219985] -- per-TC State --

[    0.223111]   TC 0

[    0.225194]    TCStatus : 18102000

[    0.228666]    TCBind : 00000000

[    0.231973]    TCRestart : 8000d468 r4k_wait_irqoff+0x14/0x24

[    0.237780]    TCHalt : 00000000

[    0.241079]    TCContext : 4a04c89e

[    0.244638]   TC 1 (current TC with VPE EPC above)

[    0.249499]    TCStatus : 00000001

[    0.252971]    TCBind : 00200001

[    0.256278]    TCRestart : 8007daa0 printk+0x10/0x30

[    0.261304]    TCHalt : 00000000

[    0.264603]    TCContext : 284c1256

[    0.268161]   TC 2

[    0.270245]    TCStatus : 00000400

[    0.273717]    TCBind : 00400001

[    0.277020]    TCRestart : fdaeaa4e 0xfdaeaa4e

[    0.281529]    TCHalt : 00000001

[    0.284828]    TCContext : 2a6cc6e3

[    0.288386]   TC 3

[    0.290470]    TCStatus : 00000400

[    0.293942]    TCBind : 00600001

[    0.297244]    TCRestart : 90840b2e 0x90840b2e

[    0.301754]    TCHalt : 00000001

[    0.305053]    TCContext : acfc2301

[    0.308611] ===========================

[    0.312760] smp: Brought up 1 node, 2 CPUs

[    0.321970] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
root@OpenWrt:/# cat /proc/interrupts
           CPU0       CPU1       
  0:       1597       1534      MIPS   0  IPI_resched
  1:        741       7335      MIPS   1  IPI_call
  7:      40889      41039      MIPS   7  timer
  8:          0          0      MIPS   0  IPI call
  9:          0          0      MIPS   1  IPI resched
 62:          0          0       icu  62  1e101000.usb, dwc2_hsotg:usb1
 63:        314          0       icu  63  mei_cpe
 72:       6227          0       icu  72  vrx200_rx
 73:          0       6278       icu  73  vrx200_tx
 75:          0          0       icu  75  vrx200_tx_2
 91:        178          0       icu  91  1e106000.usb, dwc2_hsotg:usb2
112:        891          0       icu 112  asc_tx
113:          0        154       icu 113  asc_rx
114:          0          0       icu 114  asc_err
126:          0          0       icu 126  gptu
127:          0          0       icu 127  gptu
128:          0          0       icu 128  gptu
129:          0          0       icu 129  gptu
130:          0          0       icu 130  gptu
131:          0          0       icu 131  gptu
ERR:          0

I will do more tests when the build are complett ready

@QAuge

As if I'm deserting what I'm doing
But i guess no, but i must do a look into the vr9.dtsi
Unfortunately i have not build the wifi driver yet, because buildproblems.

1 Like

I've sent a benchmark mail to the openwrt devel mailing list. If somebody wants a tarball upload somewhere, please tell me where.

I have do some little test but my VDSL2 speed does not increase, it is beetween 80Mb/s and 88Mb/s and should 95Mb/s (Data Rate)
But i thing the VDSL2 Modem does not 100% working.
It shows me Annex A but my Provider have Annexe B and i have choice an Annex B firmware
"option ds_snr_offset" Have no effect
And normally i should have 100Mb/s and last but not least i have no idea how much my ISP reduce my line at the moment.

I remember i have reached the same speed without the irq-balancing patch

With only one core + telefon support i get only 70Mb/s

New combined patches in thread How can we make the lantiq xrx200 devices faster.

Finally sent patch to openwrt-devel, you can watch the discussion in this mailing list thread.

Hi,

as was already pointed out on the mailing list, your mail client appears to have broken the patch.
Any chance for another resend and or upload somewhere?

I'm also interested in the eth driver patches, as I only get about 80 MBit/s throughput on LAN with both of my FRITZ!Box 3370s.
Are these still up to date?

Thanks.

1 Like

Hi,

All new patches are here. The last ethernet patch is in the "[OpenWrt-Devel,RFC,v4] lantiq: IRQ balancing, ethernet driver, wave300" thread. I don't have a working snapshot now as I'm trying to find the best way to implement 5+ port variants of the platform.

The broken patch is fine, you just need to correct the lines ;-). I will post better series when I manage to create the ethernet one.

Well I'll just stick to the old ones for now. I don't wanna rebuild my firmware from scratch. I'll wait for it to get merged so I can use it in the next major update maybe. Hopefully it will get merged by then.

Hi again.

So, I've fixed up the smp patch and applied the latest ethernet driver patch, as suggested.
I've also changed the target optimization flags to:

-O2 -pipe -mno-branch-likely -march=34kc -mmt -mdsp -mtune=34kc

With all this the iperf3 LAN throughput did go up about 10 Mbit/s, so it's roughly around 90 Mbit/s now.
I have no idea if it happened before the patches, but during the iperf3 tests [ksoftirqd/1] uses up to 50% of the CPU according to top.
So, the box just chokes on interrupts.

Also, enabling/disabling software flow offloading does not do anything at all.

Anything else I can try?

If your modem and your host machine really have 1Gbit ports both, then 90Mbps is still sort of low. But if your modem has MII1 (WAN port) it could be incompatibility of my betaversion of the ethernet driver which doesn't support it (currently working on it so my driver can fully replace the old one).

Note my tests were done with a minimal openwrt configuration (no WIFI enabled, no DSL, only the cable to the host, no NAT, no web configuration). Any of these can impact the speed negatively. Also does your modem use both VPEs for the openwrt or is one VPE reserved for the voice firmware? There still may be bugs for a single VPE configurations. A single VPE system will impact the speed too.

Please try to run the iperf on a vanilla kernel so we can exclude vanilla oriented speed impacts (there were some vanilla slowdowns around 4.14.100, so if it goes up 10Mbps it was either fixed or it was improved by my patches :wink: ). Also run the iperf on the different LAN ports of your modem (to see if it really works and if your modem uses WAN port we will know).

BTW if you run iperf on localhost of your modem (both iperf server and client on the modem), you should observe speeds around 300-400Mbps. It would be nice to know how fast does your openwrt configuration works, localhost is just memcopy oriented test, so my patches should not impact it and it basicaly benchmarks the raw kernel network stack. Only thing from my patches which could impact the performace that way would be probably a high IRQ load. But if there is no other LAN/VDSL/WIFI traffic, it should not matter that much. For my setup the TX speed from xrx200 never crossed the maximal localhost speed.

The IRQ driver needs to be tested (thats why it is not in the vanilla patch queue yet). There is a define:

#define AUTO_AFFINITY_ROTATION

Which basically route every odd IRQ to the first VPE and every even IRQ to the second VPE. This will assure that IRQ are autobalanced. For low frequency interrupts it doesn't impact the speed, probably even RX DMA fifo for ethernet is fine. But TX DMA fifo is impacted by it (altought only about 33% in my setup, the TX speed is much higher than 90Mbps). You can disable this function by commenting out the macro and recompiling the kernel (but all interrupts will be on VPE0, which will again impact the TX speed) ... or you can disable the rotation in the running kernel by limiting the VPEs on which it can rotate the affinity:

echo 1 > /proc/irq/73/smp_affinity #TX FIFO for VPE0
echo 2 > /proc/irq/75/smp_affinity #TX FIFO for VPE1

You can set this thing even with AUTO_AFFINITY_ROTATION commented out. And you can set it for every interrupt from ICU. Possible interrupts can be seen in:

cat /proc/interrupts

Also, enabling/disabling software flow offloading does not do anything at all.
What command did you use? My ethernet driver has only added the support of fragmented packets (parts of the packet may be in a different location in the vanilla driver it must be first copied to a linear buffer, my driver just skip this step).

Anyway thanks for the test and If your affinity rotation tests won't cause any significant slowdowns I will format and send new patches for vanilla/openwrt.

The ethernet driver will take longer as it seems the support for MII1/WAN port breaks the structure of my driver :frowning: .

Well, I've found my problem:
The ethernet cable used for testing was apparently broken :unamused:
I plugged in another CAT 6 cable and now I get 110-130 Mbit/s :slight_smile:

Right, I'm also doing my test without WiFi, DSL and web configuration, but I do use NAT.

With the patch both VPEs are used, without only the first.

I did, same kernel 4.14.123, no patches and I got about 80 Mbit/s with software flow offloading.
But that was with the broken cable, so I'd have to rebuild and reflash to retest.

Ports 2-4 do about 110 Mbit/s, while port 1 does about 130 Mbit/s.
I've configured port 1 to be WAN and ports 2-4 to be LAN, so that may have something to do with it.

Both my 3370 Boxes do around 490 Mbit/s on localhost iperf3.
If they did half of that on actual LAN I'd be a happy camper :wink:

Well, i tried different combinations of pinning irq 72, 73 and 75 and the difference was negligible.
Sometimes it's up by about 5 Mbit/s, sometimes down by 5 Mbit/s.
In any case setting the same affinity after reboot changes the results, so I don't think there is some perfect setting here.

I commented out flow_offloading in /etc/config/firewall, rebooted ran iperf3.
Then I uncommented it again, rebooted and ran iperf3, is there a better way to do it?

Which iperf -c command options are you using? I use this:

iperf3 -c 10.0.0.80 -t 1000 -R -u -b 0

Client and duration are irelevant for comparing our tests, -R is for TX FIFO, -u is for UDP (so only one way will be tested) a -b 0 is unlimited bandwidth (values like -b 1G or -b 100M are possible, but the rate control in iperf3 has pretty high overhead -> measured speeds are lower). Try test it with my command (but even with TCP I've got TX over 200 Mbit/s).

Try to disable the NAT and have all 4 ports in one group.

Well, i tried different combinations of pinning irq 72, 73 and 75 and the difference was negligible.
It is possible your speed is too low it doesn't have any impact.

I commented out flow_offloading in /etc/config/firewall

I did same and I didn't got any repeatable results. One time I've got a random speed up of 20Mbit/s by playing with uci configuration but I wasn't able to find what commands I've used (It may be possible firewall got disabled) can change it. You can change the configuration by something like this: uci set firewall.@defaults[0].flow_offloading='1'

The localhost speed is the same on my modem.

My switch configuration is:

...
config switch
        option name 'eth0'
        option reset '1'
        option enable_vlan '1'

config switch_vlan
        option device 'eth0'
        option vlan '1'
        option ports '5 0 2 4 6'

I have TD-W8980 on patched 4.14.121 and did some research on every reboot. I did simple iperf3 tcp tests with firewall and wifi enabled and got folowing results on vrx200 side:
sending can variate in two different measures: 223+-5Mbit/s or less often: 260+-5Mbit/s
receiving at the same time can be or 152+-3 Mbit/s with nearly 0 loss, or 120+-5Mbits with 40-70 packets loss respectively and on some boots receiving can not reach even 60 Mbit/s with 220 Mbits on sending on the same time.
Observed that better receiving results are achved when vrx200_rx is get assigned to VPE0. On the other side better sending results are achived when vrx200_rx is get assigned to VPE1. At the same time both this interrupts become lazy if assigned to VPE1. There is two different results on the same 10 seconds iperf3 test as example:

iperf3 -c 192.168.1.10 -R
vrx200_tx VPE1 vrx200_rx VPE0
bytes 1842991 189831057
pkts 24833 124911
irqs 1998 20968
bytes/IRQ 922.42 9053.37
pkts/IRQ 12.43 5.96

after reboot it become:

vrx200_tx VPE0 vrx200_rx VPE1
bytes 2057756 159916482
pkts 27472 104839
irqs 27462 139
bytes/IRQ 74.93 1150478.29
pkts/IRQ 1.00 754.24

Try longer time periods. About every 20s the scheduler will move the iperf3 process to the other VPE, so the interrupts will be less effective.

Mechanism: In its interrupt handler the RX will clear HW ring, allocate new buffer and send the packet to the network stack. For RX it is better to have the handler on the same VPE as receiving process. For the TX, it will just clear the HW ring and deallocate its packet buffer (sort of fast). Meanwhile there can be another concurrent packet transmit. It is better to have the TX interrupt handler on the opposite core than the process.

Main idea of my research - there is different behavior how IRQ are triggered being assigned to different VPE cores and this phenomenon affects transfer speed on both ways.

I did not saw any differences in this tests and IRQ triggering behavior on longer time periods, only board reboot and IRQ reassigning is changing it complitly.

Is this behavior expected or it is some kind of software or hardware bug and should be fixed?

One version of the driver has enabled assigning of a packet to the IRQ. So it will try to stay at the top speed, you would have to put both TX interrupts on the same core (probably). BTW try to measure UDP when you are trying to do an exactl measures of the IRQ speeds (TCP has TX+RX for reply).

So, I've finally got around to flashing the vanilla and then the patched kernel to do some more tests.
This time I did not change the config apart from flow offloading.
All ports are in one group by default.

I've tested all four ports with these commands:

TCP: iperf3 -c 192.168.1.1 -R
UDP: iperf3 -c 192.168.1.1 -R -u -b 0

vanilla, no offloading:

port1 port2 port3 port4
TCP 44.0 43.3 43.3 43.5 Mbits/s
UDP 67.6 60.7 60.9 60.7 Mbits/s

For some reason port 1 is about 7 Mbits/s faster then port 2-4.
This is reproducible even after multiple reboots.


vanilla with offloading:

port1 port2 port3 port4
TCP 43.5 43.3 43.1 43.1 Mbits/s
UDP 60.7 60.3 60.2 60.5 Mbits/s

Apart from "fixing" the 7 Mbits/s oddity on port1, flow offloading does absolutely nothing.
Turning it off restores said odd behavior.


patched, no offloading:

port1 port2 port3 port4
TCP 185 185 188 187 Mbits/s
UDP 62.0 61.5 61.9 61.6 Mbits/s

TCP is much improved, but UDP seems to be capped.
What is up with that?


patched with offloading:

port1 port2 port3 port4
TCP 184 186 186 188 Mbits/s
UDP 61.4 61.5 60.9 61.5 Mbits/s

Again, flow offloading does absolutely nothing.
Shouldn't flow offloading net at least some improvement, or is the CPU on this thing just too slow?

Your test setup is not clear. Did you test routing/NAT speed with and w/o flow offloading or your iperf3 server was running on xrx200 host? IMHO flow offloading is not targeted for the second usecase.