Mtk_soc_eth watchdog timeout after r11573

Mushoz · August 29, 2020, 10:43am

And you did get timeouts without those patches applied? If so, how often would these trigger? I am still going to be skeptical until we can get 3+ weeks of uptime without the issue. I am currently using a snapshot from the 19.07 branch with the flow control patches, and I see the issue roughly once per week.

Rising_Sun · August 29, 2020, 11:43am

yeah xiaomi mir3g was spamming the log with timeouts when i have done a iperf with a 5Ghz wifi client. Using a test size of 5GB was sufficient to trigger the bug. Client was an iphone xs and server was an odroid n2 connected with a cable. Transfer rate was like 1.......8sec) 340Mbit than 9.....18sec) 0mbit with timeout in the log happening.

edit: at least the bug is not triggerd constantly on trunk anymore
my mir3g has only 1.5d uptime so cant say anything about in the long

neheb · August 30, 2020, 12:47am

Most likely this one: https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=blob;f=target/linux/generic/pending-5.4/770-03-net-ethernet-mtk_eth_soc-fix-unnecessary-tx-queue-st.patch;h=fd28bb86925b773abe332d0fd897cefac2022312;hb=59d236f11df7539cfd9524a8d7857faefe40f74b

Quite interesting...

Enig123 · August 30, 2020, 4:18am

Didn't see this patch in master. Does that mean master has no problem on this, if it's confirmed?

neheb · August 30, 2020, 4:37am

I'm guessing it does. I have no hardware to test.

Enig123 · August 30, 2020, 4:54am

My D-Link DIR-860L B1 has been up and running for over 4 days, and I just checked and didn't found anything unusual with neither dmesg nor logread.

However it has been used as a wireless AP only, no routing function used since long time ago.

shevalier · August 30, 2020, 12:17pm

point-to-point testing its just 1 row in NAT
More correct testing - point (lan)-to-multipoint (wan)
with few thousand NAT translation

Rising_Sun · August 30, 2020, 1:09pm

My test didnt use NAT at all, or did it? It was completly done in local network and the Xiaomi 3G router is setup as a dumb AP. What I wanted to say with "5GB was sufficient to trigger the bug" is that the network got unreliable by only transferring large files. I dont know if the bug continues to exist in the corner cases but before the patch, openwrt trunk was not reliable on my Xiaomi 3g because network stopped for around 5sec very often.

shevalier · August 31, 2020, 6:49am

I think this is not one problem.
I get "error 320" exactly when no LAN / Wi-Fi client is connected.
"day" - LAN client is up, "night" - PC is off, just embedded Transmission on HDD in USB rack

And also various kernel panics with heavy load on the router

day [887759.048154] mtk_soc_eth 1e100000.ethernet eth0: port 3 link up
night [895194.783012] mtk_soc_eth 1e100000.ethernet eth0: port 3 link down
day [895196.445110] mtk_soc_eth 1e100000.ethernet eth0: port 3 link up
night [895210.473327] mtk_soc_eth 1e100000.ethernet eth0: port 3 link down
[909190.837214] ------------[ cut here ]------------
[909190.841931] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:320 0x8038ba10

skip

[909191.130945] mtk_soc_eth 1e100000.ethernet: PPE started
day [930690.538077] mtk_soc_eth 1e100000.ethernet eth0: port 3 link up
night [930716.048694] mtk_soc_eth 1e100000.ethernet eth0: port 3 link down
day [930718.771747] mtk_soc_eth 1e100000.ethernet eth0: port 3 link up

dchard · August 31, 2020, 8:39pm

I am on 5.4.60 for almost 5 day uptime and no issues at all. On a 1000/300 PPPoE connection, wifi is also used, NAT is also used (SW offload). Speed is around 600/300 now, multiple speedtests done. No mtk_soc_eth timeouts (or anything esle).

Mushoz · August 31, 2020, 9:05pm

And that is without the patches in @nbd's staging tree, right?

dchard · August 31, 2020, 11:19pm

Yep, just a normal master from 5 days ago.

Of course we know from the past that at least 2 weeks has to pass in order to say with relative confidence that the issue is really not present.

MOD: I can see that Felix merged his fixes an hour ago. I think I will compile a new image later today and test with that to see if it changes anything.

Mushoz · September 1, 2020, 6:10pm

The patchset has now hit the master branch: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=b5d425af237dc03327078d6b9be178a38b5f8723

Anyone up for trying it out to see if it fixes their issues?

dchard · September 1, 2020, 7:16pm

Just installed it. Disabled the packet steering under global network option. Speed is the same, did a couple speedtests, with PPPoE on a 1000/300 line I got 600/324 relatively consistently. During DL speedtest this is the CPU load:

The first core is constant 100% during speedtest, the rest of the cores are as such on the picture.

Mushoz · September 1, 2020, 7:32pm

And before these patches you needed packet steering in order to hit these speeds, correct? That means performance wise these patches are doing really well. Let's hope the bug this topic is about is also finally fixed! We'll see in a couple of weeks I guess. Fingers crossed.

dchard · September 1, 2020, 8:33pm

No. In previous versions I used packet steering and the speed was similar, though the cores were loaded more evenly.

MOD: now I tried it with packet steering enabled, and it has the same behavior as without packet steering on the latest master

apocalypse · September 4, 2020, 12:49am

I'm running 19.07.1 with the FC Off and Interrupt handling patches and it's been 100% stable for months. Not a single transmit timed out error, 0 interrupt errors, and no hangs or reboots. The maximum execution time has been 37 days due to power outages.

What i did to achieve this was assign each ethernet port a different VLAN (or several), but without there being more than one port in the same VLAN (and thus not use the integrated switch) and remove the OpenVPN interface tap0 from the br-lan bridge .

Mushoz · September 4, 2020, 6:13am

Which interrupt handling patches specifically? Could you maybe link all the patches that you applied so there is no confusion about the patches you used?

Also, how often did you run into this issue before you applied these patches?

Edit: Also, on what device is that? Is that with WiFi enabled?

MeIsReallyBa · September 4, 2020, 9:47am

Disabled the packet steering and try this.Did it make some diffrence?

echo "4" > "/sys/class/net/eth0/queues/rx-0/rps_cpus"
echo "4" > "/proc/irq/21/smp_affinity"
echo "8" > "/proc/irq/23/smp_affinity"
echo "8" > "/proc/irq/24/smp_affinity"
echo "2" > "/sys/class/net/wan/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/pppoe-wan/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan1/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan2/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan3/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan4/queues/rx-0/rps_cpus"

apocalypse · September 4, 2020, 8:39pm

The ancients that were discussed in this thread:

https://patchwork.ozlabs.org/project/openwrt/patch/20200211101741.17350-1-ynezz@true.cz/

https://patchwork.ozlabs.org/project/openwrt/patch/20191029172328.85861-2-rosenp@gmail.com/

Only with these two patches transmit timed out errors disappear completely. Without them, i had them several times a day or every two days. But even with those patches i had reboots and freezes. Finally, since i separated the ethernet ports in different VLANs it has not happened again.

The router is an ER-X. For those who have wifi, the interrupt errors will not go away completely because it is because of it. But with this patches and separete VLAN for each port, it should not give more transmit timed out errors, reboots or hangs caused by internal switch.