ICMP echo replies get lost often when using the last eth ports

a7ypically · December 3, 2024, 6:58pm

I'm using an x86/64 machine with OpenWrt 23.05 and 24.10. I recently setup luci-app-statistics to watch ping stats for multiple hosts from my wan.
I noticed I got a lot of timeouts and lost packets on the dashboard when I'm pinging 8 hosts or more in parallel.
The way collectd works through liboping is it sends all the ICMP echo requests one after the other and then waits for replies. Using tcpdump I can see all the requests and also see that I get all the replies but they do not reach the socket. Some do, some don't.
I have two wans, one connected to eth1 and one to eth4. If I run the pings on the wan on eth1 all is fine. On eth4, about 50% are lost (you do need to do more than 5 hosts in parallel). Again I see all the replies reaching the kernel but they do not reach the socket of the ping library. To make sure this is not a library bug, I wrote a simple c program that sends in a loop ICMP echo requests and wait for the replies and I get the same thing. Works fine on wan on eth1 and 50% are lost on eth4.
That was on 23.05. I tried running on 24.10 with the newer kernel to see if there's any difference. 24.10 changed the order if my eth ports. I do have a setup of 4 ethernet ports and 2 sfp ports. One of the wans running on sfp and one on ethernet. So now on 24.10 the wan that was working fine on eth1 was renamed to eth5 and the wan that didn't work well on eth4 was renamed to eth2. Running the tests, the wan that didn't work previously on 23.05 works fine (now named eth2) and the one that worked fine on 24.10 (now named eth5) is losing packets. As far as I can see nothing changed other then the name of the port. Same routing rules, same metrics.
So, something in the kernel does not send the packet (only about 50% of the replies) to the socket if it's on eth4 or eth5.
Is there any configuration on OpenWrt that can explain why the packets are lost more often for the last numbered ports?

brada4 · December 3, 2024, 7:21pm

Are there any drops in ifconfig -a and cat /proc/net/softnet_stat

Please also post output from:

ubus call system board
ethtool -i eth0
...
ethtool -i eth5
lspci -nn # (the section about network adapters in question)

a7ypically · December 3, 2024, 8:09pm

no drops on /proc/net/softnet_stat

I do see drops on ifconfig on eth5 (which loses packets) and none on eth2, however as I see the reply on tcpdump doesn't it imply it was not dropped by the interface?

ifconfig -a

eth2      Link encap:Ethernet  HWaddr XX
          inet addr:  Bcast:  Mask:255.255.240.0
          inet6 addr:  Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:11683 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6402 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:882001 (861.3 KiB)  TX bytes:565031 (551.7 KiB)
          Memory:50600000-506fffff

eth5      Link encap:Ethernet  HWaddr XX
          inet addr:  Bcast:  Mask:255.255.255.224
          inet6 addr:  Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:11253031 errors:0 dropped:11625 overruns:0 frame:0
          TX packets:5532620 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:13592379547 (12.6 GiB)  TX bytes:4090296973 (3.8 GiB)

ubus call system board
{
        "kernel": "6.6.63",
        "hostname": "lede_pc",
        "system": "13th Gen Intel(R) Core(TM) i5-1345U",
        "model": "Fisusen Technology Co., Ltd FSX-ALU4L2S",
        "board_name": "fisusen-technology-co-ltd-fsx-alu4l2s",
        "rootfs_type": "ext4",
        "release": {
                "distribution": "OpenWrt",
                "version": "24.10-SNAPSHOT",
                "revision": "r28046-7ef734deac",
                "target": "x86/64",
                "description": "OpenWrt 24.10-SNAPSHOT r28046-7ef734deac",
                "builddate": "1732661318"
        }
}

ethtool -i eth2
driver: igc
version: 6.6.63
firmware-version: 2014:8877
expansion-rom-version:
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

ethtool -i eth5
driver: ixgbe
version: 6.6.63
firmware-version: 0x800006d1
expansion-rom-version:
bus-info: 0000:07:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

lspci -nn
03:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I226-V [8086:125c] (rev 04)
04:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I226-V [8086:125c] (rev 04)
05:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I226-V [8086:125c] (rev 04)
06:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I226-V [8086:125c] (rev 04)
07:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
07:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)

So now on 24.10 running the pings on sfp wan named eth5 the socket does not get all the icmp replies (even though I can see the replies coming on tcpdump) and eth2 rj45 is fine.
On 23.04 without changing anything, the sfp is named eth1 and then it works fine while the rj45 named eth4 don't see all the replies.

brada4 · December 3, 2024, 8:43pm

Try to rise (doubling at a time) following sysctl:

net.core.netdev_max_backlog=1000

And see if withing hour any rx drops occur.

a7ypically · December 3, 2024, 9:09pm

I reached 10000 and still same issue.
Maybe I wasn't clear but this is easily reproducible.
Just running:

oping -c 1 -D eth1 1.1.1.1 8.8.8.8 4.2.2.2 208.67.222.222 208.67.220.220 208.67.220.2 208.67.222.2

Which successfully returns result for all hosts.
vs
oping -c 1 -D eth4 1.1.1.1 8.8.8.8 4.2.2.2 208.67.222.222 208.67.220.220 208.67.220.2 208.67.222.2
Which returns timeout for some results. If I switch openwrt version which renames the ports, I always see issues with the connection that gets a name of eth4 or eth5.
I switched back to 23.05 and I see the issue occurs on eth4 while ifconfig does not show any dropped packets. Which make sense as I do see the replies coming using tcpdump. Something in the kernel hijacks the packet before it reaches the socket but only if the interface is named as one of the last ports.

brada4 · December 3, 2024, 9:11pm

Is it same physical port always glitching?
If so, check ethtool -S ethX for likes of checksums fails overflows.

a7ypically · December 3, 2024, 9:13pm

No. Again it's just the name of the port. By switching openwrt versions the same physical interfaces get different eth numbers. So the same physical interface works fine if it is named eth1 or eth2 and start losing replies when it is named eth4 or eth5. As I see the reply packets on tcpdump, this does not seem like a physical or driver issue.

brada4 · December 3, 2024, 9:29pm

Can you quicly switch to debian on a usb stick and check behavior there?

flygarn12 · December 3, 2024, 10:08pm

Do default OpenWrt even support more than one wan port?

brada4 · December 3, 2024, 10:13pm

By default one wan one routing table.

flygarn12 · December 3, 2024, 10:14pm

Yea, but usually everything crash if you put two ports in the wan vlan.

If you want more than one wan port you need a package installed which name I don’t remember at this time.

a7ypically · December 4, 2024, 2:01pm

There is no vlan for the wans.
Anyway, after disabling the firewall it works fine, so I guess there is something in the firewall rules to cause my issue.

brada4 · December 4, 2024, 2:14pm

Tell me more about disabling firewall.

flygarn12 · December 4, 2024, 3:38pm

You need something to connect the data streams for the defined ports on L2 level to the logical L3 level.

Lan is vlan1 and wan is vlan2 in OpenWrt, but they are default untagged so everything is working in the background and everyone forget about them, until someone start tagging the vlans.

mlichvar · February 26, 2025, 7:59am

I was investigating a similar issue and it turned out it was the banip service I had enabled that was configuring the firewall to limit icmp to 10 packets per second. Setting ban_icmplimit to 0 fixed the issue for me.

Use nft list ruleset to see the firewall configuration.