I'm in the process of switching ISPs, and am currently using MultiWAN (mwan3) to switch between the old (Cable) and the new (Fiber) one. I'm having issues with the downstream rate of the new fiber connection, which appears to correspond with high softirq (as seen in top/htop).
The new fiber connection is rated at 400 mbit (symmetric), but using iperf3 I only get ~140 mbit from any devices behind the router and ~280 mbit from the router itself. When disabling the new connection, I can easily max out my old cable connection (350 mbit down) from any device behind the router.
I started experiencing this on an older router (TP-Link Archer C7 v5) and figured it might be time to upgrade. This exact same issue, with similar speeds, is now happening on my Miktrotik HAP ac2. Both of these devices shouldn't have an issue at these lines rates, as they work fine with the old cable connection.
So what's different between the two connections:
Fiber uses PPPoE with an explicit VLAN, I think this means the CPU has to tag the packets. (Does this mean software offloading doesn't work anymore?)
Fiber is connected to the WAN port, cable is connected to one of the LAN ports (I tested the reverse too, to no avail)
Since iperf from the router is about twice as fast as from any device behind it, I'm thinking this may be a clue as the router doesn't need to output the packets.
I'd appreciate any questions or insights into what I might be running into. I've been struggling with this for a few days now, with little progress.
PPPoE means that the router needs to terminate the PPPoE tunnel, that means more processing for every packet than with cable's typical DHCP (PPPoE overhead is likely to be considerably higher than VLAN tagging).
What happens when you disable mwan3 and use the fiber link exclusively?
Could you configure your htop to show individual bars per CPU and in F2-Setup -> Display Options check the box for "Detailed CPU time (System/IO-Wait/Hard-IRQ/Soft-IRQ/Steal/Guest)" and then get a screen shot while running a speedtest? The question here is are all CPUs maxed out of might it help to shift some processes around, either with enabling receive packet steering to by using irqbalance or manual interrupt to CPU assignments.
Does PPPoE affect the downstream more than the upstream?
For testing, I've disabled the Cable interface - not sure how I can disable mwan3 more effectively (other than uninstalling it). It doesn't make a difference.
Only a single CPU is maxed out (testing from a device behind the router).
I have enabled Packet Steering under Interfaces -> Global Network Options. This didn't seem to have any effect either. If memory serves, irqbalance doesn't leverage multi-cpu - but rather just switches cpu assignments for the (soft)irq. Do you believe that would still help?
Okay, that means there is a chance that you might be able to spread the load around the other CPUs unless it is one single single-threaded task that causes this load.
I think it could help, and it is easy to test, maybe start with postimg the output of cat /proc/interrupts to see what all landed on CPU2?
There is a bug in OpenWrt 22.03.x that means that Software Flow Offloading does not work for PPPoE. Therefore here is my suggestion:
If you don't use IPv6, then downgrade to 21.02.x, delete WAN6, delete the ULA prefix, and enable software flow offloading.
If you use IPv6, then sorry - the downgrade will increase the speed, but will bring bugs such as early termination of idle IPv6 TCP connections. Get something faster, which can deal with the high-speed connection without the flow offloading. E.g., Netgear R7800 (or XR500) or Linksys e8450.
Thank you! That sounds a lot like what I'm struggling with.
I wish I would've known this sooner. After upgrading my router, downgrading to 22.03 is no longer an option (hardware not supported). In fact, I'm running off snapshot now due to the need for DSA support for the Mikrotik hap ac2.
I'd still like to know if I can spread the load across multiple CPUs to alleviate the problem for now, while eagerly awaiting a fix for the bug you mentioned.
Unfortunately, neither of them seem to make any difference. After a reboot the softirq load is now on CPU3, but it doesn't move around or get spread between multiple CPUs.
Ugh... After this much messing around (and buying a new router) - I'm not a fan of going down that path.
If I recall correctly: the MT7621 target has PPPoE offload (in hardware) enabled with OpenWrt. You might want to look at a cheap (second hand) device. Even new some of those are only around 25-ish euro/usd. (something like the xiaomi 4a-gigabit or a Youhua WR1200JS). Keep your C7 as addition access point / managed switch.
Alternatively: try using packet steering as you said BUT: actually redirect to a specific CPU.
ssh into your router:
cd /sys/class/net/wan/queues/rx-0
echo 2 > rps_cpus
For some reason on my targets I can't do the same for the tx-0 queues; I don't have an IPQ40xx to try. I am also not sure if you have multi queues (rx-0 rx-1 etc.). Try changing them as well.
If you are not using DSA, change "wan" for eth0.1 or what ever on your build.
Appreciate your suggestion, but I can live with the current speeds, and I'd rather keep poking around for a real solution on the hardware I invested (mostly effort and bit of money) in.
Interesting, doing these exact commands does give me a little bit of a speed increase (~180 mbit) and the softirq load correctly moves to this CPU. Unfortunately, I don't have multiple queues - which may be why packet steering isn't doing anything for me.