LAN issue with ipq40xx (24.10 and main)

badulesia · November 10, 2024, 5:36pm

crippled LAN to LAN file transfers (MR8300 with snapshot)

opened 08:35PM - 29 Oct 24 UTC

bug invalid

### Describe the bug With snapshot, LAN to LAN file transfers capped at 100 M…B/s, with a huge load average (>0.80). LAN to LAN file transfers should reach 115 MB/s (full Gb capability) with a low load average. ### OpenWrt version main ### OpenWrt release r27935-bcd95cb9c4 ### OpenWrt target/subtarget ipq40xx/generic ### Device Linksys MR8300 ### Image kind Self-built image ### Steps to reproduce Run a snapshot with default settings. WAN is not connected. Perform file tranfers between two wired computers (LAN <-> LAN). I used two high-end PCs running Win11 24H2. ### Actual behaviour Huge load average >0.80 bandwidth capped at 100 MB/s ### Expected behaviour Low load average <0.10 bandwith at 115 MB/s (full Gb/s capability) This is what the same device gives when running on 23.05.5/ ### Additional info Image is build using with packages : luci luci-theme-material luci-app-advanced-reboot -luci-proto-ppp also non-CT wifi drivers, which are not involved in the test. I'm a monitoring this behavior for several months now, it was the same back in spring 2024 when snapshots where running kernel 6.1. Performances were even worst at this time. ### Diffconfig _No response_ ### Terms - [X] I am reporting an issue for OpenWrt, not an unsupported fork.

Hello
I have opened a thread on github to describe a LAN to LAN issue with the MR8300 (ipq40xx/generic) running main, and so 24.10 snapshot.

The issue is easy to reproduce : install 24.10 snapshot or main with default settings. Plug two computers to LAN and perform huge file transfers between them. Speed hardly reaches 100 MB/s with a high load average. With 23.05, bandwith reaches 115 MB/s with a low load average. In both cases, there is no routing involved.

Can someone reproduce this on this device, or any other ipq40xx device ?

The github threads also provides informations about the IRQ differences between 23.05 and main. This is a research path.

Thank you.

Lucky1 · November 12, 2024, 10:09am

I think git hub would be better for this
as V24.10 is very early in the oven it's not even rc1 yet

wilsonyan · November 12, 2024, 11:25am

This has been around for a while now and I notice you've participated in old threads around this issue. So I assume you know about using an irq affinity script and possibly the bridger daemon as well already. Also there's a patch to boost the cpu clock speed as bit as well.

That said getting more performance from the ethernet/dsa driver would be great, I just think this regression is something they're probably aware of and it seems consistent across qualcomm targets at this point.

Klingon · November 12, 2024, 2:42pm

After testing my DAP-2610 (ipq40xx device - dumb ap) with snaphots, I had to get back to stable version 23.05.05.

Performance in network is affected by snapshot's, not in 23.05.05.

badulesia · November 12, 2024, 4:37pm

One basic task for a router is to perform LAN to LAN traffic to full Gb capacity. There should be no high CPU ressource involved in this task. It is done perfectly with 23.05, which already uses DSA. I have tested/used a lot of routers, with various OpenWrt versions and never encountered such regression.

This means that other devices and/or ipq subtargets are also impacted? Let's hope some solution is on its way.

Thank you.

SC_VT · February 7, 2025, 8:21pm

Hi, I just upgraded my family's MR8300 router from v22.03.7 to v24.10.0. I'm no expert but I thought I could use iperf to test between two devices on the local network both before and after, so that's what I did. The results I got are a bit further down, but in short they seem to match/ confirm this issue.

I don't think this issue is a big deal for our use case of just basic home wireless router (rarely sharing files/ data within local network), but I thought I could test and provide a datapoint in the process of upgrading at least. In fact it seems the wifi connection with my laptop is quite a bit faster on the new version, so I'm certainly not complaining

NOTE: While I was writing this up I tried running this test a few more times and once in a while the bandwidth tests much better (in line with before)! The load average is still high each time however. I also took a snapshot of 'top' while the test was running, that's also below but as you can see '[napi/eth0-8]' seems to be the high CPU user while the test is running.

Test Results:

Before upgrade (OpenWrt v22.03.7):

Bandwidth: 948 Mbits/s
1min load avg: 0.01 (at end of test)
(from Realtime graphs / System load page, for 1 min load) Average: 0.02, peak: 0.08

After upgrade (OpenWrt v24.10.0):

Test 1:
Bandwidth: 730 Mbits/s
1min load avg: 1.12 (at end of test)
(from Realtime graphs / System load page, for 1 min load) Average: 1.24, peak: 1.46

Test 2:
Bandwidth: 606 Mbits/s
1min load avg: 1.14 (at end of test)
(from Realtime graphs / System load page, for 1 min load) Average: 1.07, peak: 1.29

Test Notes:

Basically I just ran the basic iperf3 test, with one machine running 'iperf3 -s' and the other running 'iperf3 -c <ip_addr> -t 240'. Both are laptops, one running windows 10, the other fedora Linux.

Top Sample while test is running:

Mem: 194884K used, 311920K free, 428K shrd, 0K buff, 19828K cached
CPU:   0% usr  16% sys   0% nic  68% idle   0% io   0% irq  14% sirq
Load average: 1.11 0.83 0.62 5/116 4454
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
   69     2 root     RW       0   0%  24% [napi/eth0-8]
   76     2 root     RW       0   0%   4% [napi/eth0-0]
   75     2 root     SW       0   0%   2% [napi/eth0-5]
   14     2 root     SW       0   0%   0% [ksoftirqd/0]
 1150  1105 network  S     3284   1%   0% /usr/sbin/hostapd -s -g /var/run/hostapd/global
 4416  4408 root     R     1144   0%   0% top

badulesia · February 7, 2025, 8:39pm

Thank you for your tests.

I do transfer a lot of data between local computers on LAN. Anyway LAN to LAN traffic should work flawlessly, it's a primary task for a router.
Is any other ipq40xx device impacted by this issue ?

Specimen · February 11, 2025, 4:54pm

I upgraded my MR8300 to 24.10.0 and I can confirm I'm experiencing this too.

badulesia · February 11, 2025, 4:58pm

Did you try to toy with packet steering or anything else? I won't be able to experiment myself in a near future with 24.10.0.
EDIT : in about a week.

Specimen · February 11, 2025, 5:01pm

Nope, nothing, I have SQM installed but it's not active, it wasn't active before the upgrade also.

EDIT: Just found out it's on enabled but nor for all CPUs, I guess I can test this.

badulesia · February 11, 2025, 5:09pm

I haven't been able to try this since nov. 2024. Maybe someting changed.

Specimen · February 11, 2025, 5:21pm

Same high (1.0) CPU Load on LAN to LAN transfers (a 5 GB file), speed around 750 Mbps, roughly the same with any packet steering setting.

stragies · February 11, 2025, 5:51pm

To me, this sounds like packets are getting erroneous trapped from the Switch to the CPU-port. Normally LAN2LAN packets should not at all be seen by the CPU, and be treated entirely inside the Switch-Block of the IC. That happens at wirespeed.

You could check, if the RX/TX stats of the CPU-Port rise while you transfer LAN2LAN. They shouldn't. If they do, then packets are getting trapped to the CPU-port, and the Main CPU then does the bridging, and not the Switch-Block.

EDIT: Via SSH: ifconfig -a | grep -E '(Link |RX by)', or LUCI, as you found.

Specimen · February 11, 2025, 5:56pm

Thanks, can you guide me on how to check that, not finding that in Luci, so it's probably via CLI? Or is it eth0?

Edit:
Transfer was between lan 1 and 2, and eth0 is showing in and out.

psherman · February 11, 2025, 6:45pm

@SC_VT and @Specimen -- can we see your configs to make sure there isn't a problem there? If there is an error in the way the network config is setup, it's possible that it could result in this unexpected behavior.

Please connect to your OpenWrt device using ssh and copy the output of the following commands and post it here using the "Preformatted text </> " button:

Remember to redact passwords, MAC addresses and any public IP addresses you may have:

ubus call system board
cat /etc/config/network

stragies · February 11, 2025, 6:47pm

There was an issue with the "special" VLANs 0 and 1, and trapping recently on some other targets/platforms. You could, as an additional datapoint:

set up 2 LAN ports as e.g. VLAN 7 untagged primary PVID
Don't extend this VLAN to the CPU (unselect 'Local') -> VLAN only exists within Switch-Block
Give 2 computers connected to these VLANed ports static IPs (from a different subnet than your main)
run iperf again

With this setup, packets from/to these VLANed ports should NEVER reach the CPU-Port. If my suspicion holds, the test at the end should show wirespeed 1GB transfers. And then we go from there.

badulesia · February 11, 2025, 7:03pm

Latest time I tested, I simply used the default settings. I'll make further tests ASAP.

Specimen · February 11, 2025, 10:08pm

Thanks! Here you go:

root@******:~# ubus call system board
{
        "kernel": "6.6.73",
        "hostname": "******",
        "system": "ARMv7 Processor rev 5 (v7l)",
        "model": "Linksys MR8300 (Dallas)",
        "board_name": "linksys,mr8300",
        "rootfs_type": "squashfs",
        "release": {
                "distribution": "OpenWrt",
                "version": "24.10.0",
                "revision": "r28427-6df0e3d02a",
                "target": "ipq40xx/generic",
                "description": "OpenWrt 24.10.0 r28427-6df0e3d02a",
                "builddate": "1738624177"
        }
}
root@******:~# cat /etc/config/network

config interface 'loopback'
        option device 'lo'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'

config globals 'globals'
        option ula_prefix 'fd...'
        option packet_steering '1'

config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'lan4'
        option stp '1'

config device
        option name 'lan1'
        option macaddr '**:**:**:**:**:**'

config device
        option name 'lan2'
        option macaddr '**:**:**:**:**:**'

config device
        option name 'lan3'
        option macaddr '**:**:**:**:**:**'

config device
        option name 'lan4'
        option macaddr '**:**:**:**:**:**'

config interface 'lan'
        option device 'br-lan'
        option proto 'static'
        option ipaddr '192.168.2.1'
        option netmask '255.255.255.0'
        option ip6assign '60'

config device
        option name 'wan'
        option macaddr ''**:**:**:**:**:*2'

config interface 'wan'
        option device 'wan'
        option proto 'dhcp'
        option peerdns '0'
        option metric '10'

config interface 'wan6'
        option device 'wan'
        option proto 'dhcpv6'
        option reqaddress 'try'
        option reqprefix 'auto'
        option peerdns '0'

config interface 'wwan'
        option proto 'dhcp'
        option peerdns '0'
        option metric '20'

psherman · February 11, 2025, 10:14pm

Specimen:

config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'lan4'
        option stp '1'

What happens if you remove stp from here? (restart and test again)

Specimen · February 11, 2025, 10:34pm

Yeah, exactly the same, 1 minute load has been 1.5 in the last attemps. Sorry I can't test anymore, I don't have time this week.
EDIT: I'll just Boot into the old partition an use the last version (1 minute load ~0.27)