High softirq on ISP with VLAN/PPPoE

Hi,

I'm in the process of switching ISPs, and am currently using MultiWAN (mwan3) to switch between the old (Cable) and the new (Fiber) one. I'm having issues with the downstream rate of the new fiber connection, which appears to correspond with high softirq (as seen in top/htop).

The new fiber connection is rated at 400 mbit (symmetric), but using iperf3 I only get ~140 mbit from any devices behind the router and ~280 mbit from the router itself. When disabling the new connection, I can easily max out my old cable connection (350 mbit down) from any device behind the router.

I started experiencing this on an older router (TP-Link Archer C7 v5) and figured it might be time to upgrade. This exact same issue, with similar speeds, is now happening on my Miktrotik HAP ac2. Both of these devices shouldn't have an issue at these lines rates, as they work fine with the old cable connection.

So what's different between the two connections:

  • Fiber uses PPPoE with an explicit VLAN, I think this means the CPU has to tag the packets. (Does this mean software offloading doesn't work anymore?)
  • Fiber is connected to the WAN port, cable is connected to one of the LAN ports (I tested the reverse too, to no avail)

Since iperf from the router is about twice as fast as from any device behind it, I'm thinking this may be a clue as the router doesn't need to output the packets.

I'd appreciate any questions or insights into what I might be running into. I've been struggling with this for a few days now, with little progress.

PPPoE means that the router needs to terminate the PPPoE tunnel, that means more processing for every packet than with cable's typical DHCP (PPPoE overhead is likely to be considerably higher than VLAN tagging).
What happens when you disable mwan3 and use the fiber link exclusively?

Could you configure your htop to show individual bars per CPU and in F2-Setup -> Display Options check the box for "Detailed CPU time (System/IO-Wait/Hard-IRQ/Soft-IRQ/Steal/Guest)" and then get a screen shot while running a speedtest? The question here is are all CPUs maxed out of might it help to shift some processes around, either with enabling receive packet steering to by using irqbalance or manual interrupt to CPU assignments.

1 Like

Thank you!

Does PPPoE affect the downstream more than the upstream?

For testing, I've disabled the Cable interface - not sure how I can disable mwan3 more effectively (other than uninstalling it). It doesn't make a difference.

Only a single CPU is maxed out (testing from a device behind the router).

I have enabled Packet Steering under Interfaces -> Global Network Options. This didn't seem to have any effect either. If memory serves, irqbalance doesn't leverage multi-cpu - but rather just switches cpu assignments for the (soft)irq. Do you believe that would still help?

As far as I know it should not differ.

Okay, that means there is a chance that you might be able to spread the load around the other CPUs unless it is one single single-threaded task that causes this load.

I think it could help, and it is easy to test, maybe start with postimg the output of cat /proc/interrupts to see what all landed on CPU2?

There is a bug in OpenWrt 22.03.x that means that Software Flow Offloading does not work for PPPoE. Therefore here is my suggestion:

  • If you don't use IPv6, then downgrade to 21.02.x, delete WAN6, delete the ULA prefix, and enable software flow offloading.
  • If you use IPv6, then sorry - the downgrade will increase the speed, but will bring bugs such as early termination of idle IPv6 TCP connections. Get something faster, which can deal with the high-speed connection without the flow offloading. E.g., Netgear R7800 (or XR500) or Linksys e8450.
1 Like

Here's both /proc/interrupts/ and /proc/softirqs. I'm surprised nothing really stands out on CPU2

# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
 26:    3610566    1180079    1208880    1644014     GIC-0  20 Level     arch_timer
 30:      72615          0          0          0     GIC-0 270 Level     bam_dma
 31:     731743          0          0          0     GIC-0 127 Level     78b5000.spi
 32:          0          0          0          0     GIC-0 239 Level     bam_dma
 33:          2          0          0          0     GIC-0 139 Level     msm_serial0
 50:    4152895          0          0          0     GIC-0 200 Level     ath10k_ahb
 67:         23          0          0          0     GIC-0 201 Level     ath10k_ahb
 68:    2465756          0          0          0     GIC-0  97 Edge      c080000.ethernet:txq0
 72:    2371632          0          0          0     GIC-0 101 Edge      c080000.ethernet:txq4
 76:    2631397          0          0          0     GIC-0 105 Edge      c080000.ethernet:txq8
 80:    1500903          0          0          0     GIC-0 109 Edge      c080000.ethernet:txq12
 84:    3206279          0          0          0     GIC-0 272 Edge      c080000.ethernet:rxq0
 86:     932989          0          0          0     GIC-0 274 Edge      c080000.ethernet:rxq2
 88:     826731          0          0          0     GIC-0 276 Edge      c080000.ethernet:rxq4
 90:     897340          0          0          0     GIC-0 278 Edge      c080000.ethernet:rxq6
100:          0          0          0          0   msmgpio  63 Edge      keys
101:          0          0          0          0   msmgpio   5 Edge      keys
102:    5573503          0          0          0     GIC-0 164 Level     xhci-hcd:usb1
IPI0:          0          0          0          0  CPU wakeup interrupts
IPI1:          0          0          0          0  Timer broadcast interrupts
IPI2:      58844      77706      79898      68131  Rescheduling interrupts
IPI3:    1543810    6316816    3991097    5102983  Function call interrupts
IPI4:          0          0          0          0  CPU stop interrupts
IPI5:     540086     819151     716852     544159  IRQ work interrupts
IPI6:          0          0          0          0  completion interrupts
Err:          0
# cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3
          HI:          2          0          0          0
       TIMER:     760603     286488     226174     365211
      NET_TX:      18751       1068        687       1059
      NET_RX:    8677859    1832982    2935474    1368808
       BLOCK:          0          0          0          0
    IRQ_POLL:          0          0          0          0
     TASKLET:    5660669        828       1493        791
       SCHED:    1635784     924584     836389     939993
     HRTIMER:         47         60        116         58
         RCU:     976484     644343     641435     693296

Thank you! That sounds a lot like what I'm struggling with.

I wish I would've known this sooner. After upgrading my router, downgrading to 22.03 is no longer an option (hardware not supported). In fact, I'm running off snapshot now due to the need for DSA support for the Mikrotik hap ac2.

I'd still like to know if I can spread the load across multiple CPUs to alleviate the problem for now, while eagerly awaiting a fix for the bug you mentioned.

You can still use the old router (TP-Link Archer C7 v5), it is fast enough when paired with the previous OpenWrt release.

There are two relevant options:

  • Activating packet steering in Network > Interfaces > Global network options
  • Installing irqbalance

I am not sure which of these options will work best for you.

Unfortunately, neither of them seem to make any difference. After a reboot the softirq load is now on CPU3, but it doesn't move around or get spread between multiple CPUs.

Ugh... After this much messing around (and buying a new router) - I'm not a fan of going down that path.

If I recall correctly: the MT7621 target has PPPoE offload (in hardware) enabled with OpenWrt. You might want to look at a cheap (second hand) device. Even new some of those are only around 25-ish euro/usd. (something like the xiaomi 4a-gigabit or a Youhua WR1200JS). Keep your C7 as addition access point / managed switch.

Alternatively: try using packet steering as you said BUT: actually redirect to a specific CPU.

ssh into your router:
cd /sys/class/net/wan/queues/rx-0
echo 2 > rps_cpus

For some reason on my targets I can't do the same for the tx-0 queues; I don't have an IPQ40xx to try. I am also not sure if you have multi queues (rx-0 rx-1 etc.). Try changing them as well.

If you are not using DSA, change "wan" for eth0.1 or what ever on your build.

Appreciate your suggestion, but I can live with the current speeds, and I'd rather keep poking around for a real solution on the hardware I invested (mostly effort and bit of money) in.

Interesting, doing these exact commands does give me a little bit of a speed increase (~180 mbit) and the softirq load correctly moves to this CPU. Unfortunately, I don't have multiple queues - which may be why packet steering isn't doing anything for me.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.