LAN issue with ipq40xx (24.10 and main)

Hello
I have opened a thread on github to describe a LAN to LAN issue with the MR8300 (ipq40xx/generic) running main, and so 24.10 snapshot.

The issue is easy to reproduce : install 24.10 snapshot or main with default settings. Plug two computers to LAN and perform huge file transfers between them. Speed hardly reaches 100 MB/s with a high load average. With 23.05, bandwith reaches 115 MB/s with a low load average. In both cases, there is no routing involved.

Can someone reproduce this on this device, or any other ipq40xx device ?

The github threads also provides informations about the IRQ differences between 23.05 and main. This is a research path.

Thank you.

2 Likes

I think git hub would be better for this
as V24.10 is very early in the oven it's not even rc1 yet

1 Like

This has been around for a while now and I notice you've participated in old threads around this issue. So I assume you know about using an irq affinity script and possibly the bridger daemon as well already. Also there's a patch to boost the cpu clock speed as bit as well.

That said getting more performance from the ethernet/dsa driver would be great, I just think this regression is something they're probably aware of and it seems consistent across qualcomm targets at this point.

1 Like

After testing my DAP-2610 (ipq40xx device - dumb ap) with snaphots, I had to get back to stable version 23.05.05.

Performance in network is affected by snapshot's, not in 23.05.05.

1 Like

One basic task for a router is to perform LAN to LAN traffic to full Gb capacity. There should be no high CPU ressource involved in this task. It is done perfectly with 23.05, which already uses DSA. I have tested/used a lot of routers, with various OpenWrt versions and never encountered such regression.

This means that other devices and/or ipq subtargets are also impacted? Let's hope some solution is on its way.

Thank you.

Hi, I just upgraded my family's MR8300 router from v22.03.7 to v24.10.0. I'm no expert but I thought I could use iperf to test between two devices on the local network both before and after, so that's what I did. The results I got are a bit further down, but in short they seem to match/ confirm this issue.

I don't think this issue is a big deal for our use case of just basic home wireless router (rarely sharing files/ data within local network), but I thought I could test and provide a datapoint in the process of upgrading at least. In fact it seems the wifi connection with my laptop is quite a bit faster on the new version, so I'm certainly not complaining :slight_smile:

NOTE: While I was writing this up I tried running this test a few more times and once in a while the bandwidth tests much better (in line with before)! The load average is still high each time however. I also took a snapshot of 'top' while the test was running, that's also below but as you can see '[napi/eth0-8]' seems to be the high CPU user while the test is running.

Test Results:

Before upgrade (OpenWrt v22.03.7):

Bandwidth: 948 Mbits/s
1min load avg: 0.01 (at end of test)
(from Realtime graphs / System load page, for 1 min load) Average: 0.02, peak: 0.08

After upgrade (OpenWrt v24.10.0):

Test 1:
Bandwidth: 730 Mbits/s
1min load avg: 1.12 (at end of test)
(from Realtime graphs / System load page, for 1 min load) Average: 1.24, peak: 1.46

Test 2:
Bandwidth: 606 Mbits/s
1min load avg: 1.14 (at end of test)
(from Realtime graphs / System load page, for 1 min load) Average: 1.07, peak: 1.29

Test Notes:

Basically I just ran the basic iperf3 test, with one machine running 'iperf3 -s' and the other running 'iperf3 -c <ip_addr> -t 240'. Both are laptops, one running windows 10, the other fedora Linux.

Top Sample while test is running:

Mem: 194884K used, 311920K free, 428K shrd, 0K buff, 19828K cached
CPU:   0% usr  16% sys   0% nic  68% idle   0% io   0% irq  14% sirq
Load average: 1.11 0.83 0.62 5/116 4454
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
   69     2 root     RW       0   0%  24% [napi/eth0-8]
   76     2 root     RW       0   0%   4% [napi/eth0-0]
   75     2 root     SW       0   0%   2% [napi/eth0-5]
   14     2 root     SW       0   0%   0% [ksoftirqd/0]
 1150  1105 network  S     3284   1%   0% /usr/sbin/hostapd -s -g /var/run/hostapd/global
 4416  4408 root     R     1144   0%   0% top
1 Like

Thank you for your tests.

I do transfer a lot of data between local computers on LAN. Anyway LAN to LAN traffic should work flawlessly, it's a primary task for a router.
Is any other ipq40xx device impacted by this issue ?

I upgraded my MR8300 to 24.10.0 and I can confirm I'm experiencing this too.

1 Like

Did you try to toy with packet steering or anything else? I won't be able to experiment myself in a near future with 24.10.0.
EDIT : in about a week.

Nope, nothing, I have SQM installed but it's not active, it wasn't active before the upgrade also.

EDIT: Just found out it's on enabled but nor for all CPUs, I guess I can test this.

1 Like

I haven't been able to try this since nov. 2024. Maybe someting changed.

Same high (1.0) CPU Load on LAN to LAN transfers (a 5 GB file), speed around 750 Mbps, roughly the same with any packet steering setting.

1 Like

To me, this sounds like packets are getting erroneous trapped from the Switch to the CPU-port. Normally LAN2LAN packets should not at all be seen by the CPU, and be treated entirely inside the Switch-Block of the IC. That happens at wirespeed.

You could check, if the RX/TX stats of the CPU-Port rise while you transfer LAN2LAN. They shouldn't. If they do, then packets are getting trapped to the CPU-port, and the Main CPU then does the bridging, and not the Switch-Block.

EDIT: Via SSH: ifconfig -a | grep -E '(Link |RX by)', or LUCI, as you found.

2 Likes

Thanks, can you guide me on how to check that, not finding that in Luci, so it's probably via CLI? Or is it eth0?

Edit:
Transfer was between lan 1 and 2, and eth0 is showing in and out.

1 Like

@SC_VT and @Specimen -- can we see your configs to make sure there isn't a problem there? If there is an error in the way the network config is setup, it's possible that it could result in this unexpected behavior.

Please connect to your OpenWrt device using ssh and copy the output of the following commands and post it here using the "Preformatted text </> " button:
grafik
Remember to redact passwords, MAC addresses and any public IP addresses you may have:

ubus call system board
cat /etc/config/network
1 Like

There was an issue with the "special" VLANs 0 and 1, and trapping recently on some other targets/platforms. You could, as an additional datapoint:

  • set up 2 LAN ports as e.g. VLAN 7 untagged primary PVID
  • Don't extend this VLAN to the CPU (unselect 'Local') -> VLAN only exists within Switch-Block
  • Give 2 computers connected to these VLANed ports static IPs (from a different subnet than your main)
  • run iperf again

With this setup, packets from/to these VLANed ports should NEVER reach the CPU-Port. If my suspicion holds, the test at the end should show wirespeed 1GB transfers. And then we go from there.

1 Like

Latest time I tested, I simply used the default settings. I'll make further tests ASAP.

Thanks! Here you go:

root@******:~# ubus call system board
{
        "kernel": "6.6.73",
        "hostname": "******",
        "system": "ARMv7 Processor rev 5 (v7l)",
        "model": "Linksys MR8300 (Dallas)",
        "board_name": "linksys,mr8300",
        "rootfs_type": "squashfs",
        "release": {
                "distribution": "OpenWrt",
                "version": "24.10.0",
                "revision": "r28427-6df0e3d02a",
                "target": "ipq40xx/generic",
                "description": "OpenWrt 24.10.0 r28427-6df0e3d02a",
                "builddate": "1738624177"
        }
}
root@******:~# cat /etc/config/network

config interface 'loopback'
        option device 'lo'
        option proto 'static'
        option ipaddr '127.0.0.1'
        option netmask '255.0.0.0'

config globals 'globals'
        option ula_prefix 'fd...'
        option packet_steering '1'

config device
        option name 'br-lan'
        option type 'bridge'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'lan4'
        option stp '1'

config device
        option name 'lan1'
        option macaddr '**:**:**:**:**:**'

config device
        option name 'lan2'
        option macaddr '**:**:**:**:**:**'

config device
        option name 'lan3'
        option macaddr '**:**:**:**:**:**'

config device
        option name 'lan4'
        option macaddr '**:**:**:**:**:**'

config interface 'lan'
        option device 'br-lan'
        option proto 'static'
        option ipaddr '192.168.2.1'
        option netmask '255.255.255.0'
        option ip6assign '60'

config device
        option name 'wan'
        option macaddr ''**:**:**:**:**:*2'

config interface 'wan'
        option device 'wan'
        option proto 'dhcp'
        option peerdns '0'
        option metric '10'

config interface 'wan6'
        option device 'wan'
        option proto 'dhcpv6'
        option reqaddress 'try'
        option reqprefix 'auto'
        option peerdns '0'

config interface 'wwan'
        option proto 'dhcp'
        option peerdns '0'
        option metric '20'

What happens if you remove stp from here? (restart and test again)

1 Like

Yeah, exactly the same, 1 minute load has been 1.5 in the last attemps. Sorry I can't test anymore, I don't have time this week.
EDIT: I'll just Boot into the old partition an use the last version (1 minute load ~0.27)

1 Like