Mtk_soc_eth watchdog timeout after r11573

And you did get timeouts without those patches applied? If so, how often would these trigger? I am still going to be skeptical until we can get 3+ weeks of uptime without the issue. I am currently using a snapshot from the 19.07 branch with the flow control patches, and I see the issue roughly once per week.

yeah xiaomi mir3g was spamming the log with timeouts when i have done a iperf with a 5Ghz wifi client. Using a test size of 5GB was sufficient to trigger the bug. Client was an iphone xs and server was an odroid n2 connected with a cable. Transfer rate was like 1.......8sec) 340Mbit than 9.....18sec) 0mbit with timeout in the log happening.

edit: at least the bug is not triggerd constantly on trunk anymore
my mir3g has only 1.5d uptime so cant say anything about in the long

Most likely this one: https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=blob;f=target/linux/generic/pending-5.4/770-03-net-ethernet-mtk_eth_soc-fix-unnecessary-tx-queue-st.patch;h=fd28bb86925b773abe332d0fd897cefac2022312;hb=59d236f11df7539cfd9524a8d7857faefe40f74b

Quite interesting...

Didn't see this patch in master. Does that mean master has no problem on this, if it's confirmed?

I'm guessing it does. I have no hardware to test.

My D-Link DIR-860L B1 has been up and running for over 4 days, and I just checked and didn't found anything unusual with neither dmesg nor logread.

However it has been used as a wireless AP only, no routing function used since long time ago.

point-to-point testing its just 1 row in NAT
More correct testing - point (lan)-to-multipoint (wan)
with few thousand NAT translation

My test didnt use NAT at all, or did it? It was completly done in local network and the Xiaomi 3G router is setup as a dumb AP. What I wanted to say with "5GB was sufficient to trigger the bug" is that the network got unreliable by only transferring large files. I dont know if the bug continues to exist in the corner cases but before the patch, openwrt trunk was not reliable on my Xiaomi 3g because network stopped for around 5sec very often.

I think this is not one problem.
I get "error 320" exactly when no LAN / Wi-Fi client is connected.
"day" - LAN client is up, "night" - PC is off, just embedded Transmission on HDD in USB rack

And also various kernel panics with heavy load on the router

day [887759.048154] mtk_soc_eth 1e100000.ethernet eth0: port 3 link up
night [895194.783012] mtk_soc_eth 1e100000.ethernet eth0: port 3 link down
day [895196.445110] mtk_soc_eth 1e100000.ethernet eth0: port 3 link up
night [895210.473327] mtk_soc_eth 1e100000.ethernet eth0: port 3 link down
[909190.837214] ------------[ cut here ]------------
[909190.841931] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:320 0x8038ba10

skip

[909191.130945] mtk_soc_eth 1e100000.ethernet: PPE started
day [930690.538077] mtk_soc_eth 1e100000.ethernet eth0: port 3 link up
night [930716.048694] mtk_soc_eth 1e100000.ethernet eth0: port 3 link down
day [930718.771747] mtk_soc_eth 1e100000.ethernet eth0: port 3 link up

I am on 5.4.60 for almost 5 day uptime and no issues at all. On a 1000/300 PPPoE connection, wifi is also used, NAT is also used (SW offload). Speed is around 600/300 now, multiple speedtests done. No mtk_soc_eth timeouts (or anything esle).

And that is without the patches in @nbd's staging tree, right?

Yep, just a normal master from 5 days ago.

Of course we know from the past that at least 2 weeks has to pass in order to say with relative confidence that the issue is really not present.

MOD: I can see that Felix merged his fixes an hour ago. I think I will compile a new image later today and test with that to see if it changes anything.

The patchset has now hit the master branch: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=b5d425af237dc03327078d6b9be178a38b5f8723

Anyone up for trying it out to see if it fixes their issues? :slight_smile:

1 Like

Just installed it. Disabled the packet steering under global network option. Speed is the same, did a couple speedtests, with PPPoE on a 1000/300 line I got 600/324 relatively consistently. During DL speedtest this is the CPU load:

The first core is constant 100% during speedtest, the rest of the cores are as such on the picture.

And before these patches you needed packet steering in order to hit these speeds, correct? That means performance wise these patches are doing really well. Let's hope the bug this topic is about is also finally fixed! We'll see in a couple of weeks I guess. Fingers crossed.

No. In previous versions I used packet steering and the speed was similar, though the cores were loaded more evenly.

MOD: now I tried it with packet steering enabled, and it has the same behavior as without packet steering on the latest master

I'm running 19.07.1 with the FC Off and Interrupt handling patches and it's been 100% stable for months. Not a single transmit timed out error, 0 interrupt errors, and no hangs or reboots. The maximum execution time has been 37 days due to power outages.

What i did to achieve this was assign each ethernet port a different VLAN (or several), but without there being more than one port in the same VLAN (and thus not use the integrated switch) and remove the OpenVPN interface tap0 from the br-lan bridge .

Which interrupt handling patches specifically? Could you maybe link all the patches that you applied so there is no confusion about the patches you used?

Also, how often did you run into this issue before you applied these patches?

Edit: Also, on what device is that? Is that with WiFi enabled?

Disabled the packet steering and try this.Did it make some diffrence?

echo "4" > "/sys/class/net/eth0/queues/rx-0/rps_cpus"
echo "4" > "/proc/irq/21/smp_affinity"
echo "8" > "/proc/irq/23/smp_affinity"
echo "8" > "/proc/irq/24/smp_affinity"
echo "2" > "/sys/class/net/wan/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/pppoe-wan/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan1/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan2/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan3/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan4/queues/rx-0/rps_cpus"

The ancients that were discussed in this thread:

https://patchwork.ozlabs.org/project/openwrt/patch/20200211101741.17350-1-ynezz@true.cz/

https://patchwork.ozlabs.org/project/openwrt/patch/20191029172328.85861-2-rosenp@gmail.com/

Only with these two patches transmit timed out errors disappear completely. Without them, i had them several times a day or every two days. But even with those patches i had reboots and freezes. Finally, since i separated the ethernet ports in different VLANs it has not happened again.

The router is an ER-X. For those who have wifi, the interrupt errors will not go away completely because it is because of it. But with this patches and separete VLAN for each port, it should not give more transmit timed out errors, reboots or hangs caused by internal switch.

1 Like