Mtk_soc_eth watchdog timeout after r11573

I am on snapshot (with kernel 5.4) and it is very stable. I have not seen any errors so far, no more mtk_soc_eth timeouts. While when on v19.07.X I experienced them daily.

Mind that kernel 5 introduced different architecture for switches - DSA (distributed switch architecture).

Official v19.07.5 build

Perhaps my lock up is something different, these are first 4 traces that happened in the same second.
A lot more traces in logs until i power cycled the router. I was receiving the traces via UDP log, so router was not completely dead from network prospective.

1:

INFO: rcu_sched self-detected stall on CPU
       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=3000
        (t=6000 jiffies g=600482 c=600481 q=97)
NMI backtrace for cpu 0
CPU: 0 PID: 9933 Comm: kworker/0:2 Not tainted 4.14.209 #0
Workqueue: events 0x802409f4
Stack : 00000000 00000000 804d7568 8fc09d24 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000001 8fc09ce0 ac07f5c8
        8fc09d78 00000000 00000000 00005af8 00000038 8049da98 00000008 00000000
        00000000 80550000 000732bc 00000000 8fc09cc0 00000000 00000000 00000000
        80554500 8054c0cc 000000e0 80550000 00000000 802ae190 00000000 806b0000
        ...
Call Trace:
[<8049da98>] 0x8049da98
[<802ae190>] 0x802ae190
[<8000c1a0>] 0x8000c1a0
[<8000c1a8>] 0x8000c1a8
[<804868d4>] 0x804868d4
[<80072c54>] 0x80072c54
[<8048d8f4>] 0x8048d8f4
[<80008f30>] 0x80008f30
[<80008f30>] 0x80008f30
[<8048d9e0>] 0x8048d9e0
[<80086558>] 0x80086558
[<800859a8>] 0x800859a8
[<80089118>] 0x80089118
[<8009a110>] 0x8009a110
[<8009a168>] 0x8009a168
[<8005f3e4>] 0x8005f3e4
[<80332644>] 0x80332644
[<80079328>] 0x80079328
[<800735e0>] 0x800735e0
[<8025bc44>] 0x8025bc44
[<8025be04>] 0x8025be04
[<800735e0>] 0x800735e0
[<804a475c>] 0x804a475c
[<8025b4c0>] 0x8025b4c0
[<80007488>] 0x80007488

2:

INFO: rcu_sched self-detected stall on CPU
       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=11993
        (t=24003 jiffies g=600482 c=600481 q=473)
NMI backtrace for cpu 0
CPU: 0 PID: 9933 Comm: kworker/0:2 Not tainted 4.14.209 #0
Workqueue: events 0x802409f4
Stack : 00000000 00000000 804d7568 8fc09d24 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000001 8fc09ce0 ac07f5c8
        8fc09d78 00000000 00000000 000062a0 00000038 8049da98 00000008 00000000
        00000000 80550000 00078fca 00000000 8fc09cc0 00000000 00000000 00000000
        80554500 8054c0cc 000000e0 80550000 00000000 802ae190 00000000 806b0000
        ...
Call Trace:
[<8049da98>] 0x8049da98
[<802ae190>] 0x802ae190
[<8000c1a0>] 0x8000c1a0
[<8000c1a8>] 0x8000c1a8
[<804868d4>] 0x804868d4
[<80072c54>] 0x80072c54
[<8048d8f4>] 0x8048d8f4
[<80008f30>] 0x80008f30
[<80008f30>] 0x80008f30
[<8048d9e0>] 0x8048d9e0
[<80086558>] 0x80086558
[<800859a8>] 0x800859a8
[<80089118>] 0x80089118
[<8009a110>] 0x8009a110
[<8009a168>] 0x8009a168
[<80332644>] 0x80332644
[<80079328>] 0x80079328
[<800735e0>] 0x800735e0
[<8025bd98>] 0x8025bd98
[<8025bc44>] 0x8025bc44
[<8025be04>] 0x8025be04
[<800735e0>] 0x800735e0
[<804a475c>] 0x804a475c
[<8025b4c0>] 0x8025b4c0
[<80007488>] 0x80007488

Diff 1 and 2

@@ -1,13 +1,13 @@
 INFO: rcu_sched self-detected stall on CPU
-       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=3000
-        (t=6000 jiffies g=600482 c=600481 q=97)
+       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=11993
+        (t=24003 jiffies g=600482 c=600481 q=473)
 NMI backtrace for cpu 0
 CPU: 0 PID: 9933 Comm: kworker/0:2 Not tainted 4.14.209 #0
 Workqueue: events 0x802409f4
 Stack : 00000000 00000000 804d7568 8fc09d24 00000000 00000000 00000000 00000000
         00000000 00000000 00000000 00000000 00000000 00000001 8fc09ce0 ac07f5c8
-        8fc09d78 00000000 00000000 00005af8 00000038 8049da98 00000008 00000000
-        00000000 80550000 000732bc 00000000 8fc09cc0 00000000 00000000 00000000
+        8fc09d78 00000000 00000000 000062a0 00000038 8049da98 00000008 00000000
+        00000000 80550000 00078fca 00000000 8fc09cc0 00000000 00000000 00000000
         80554500 8054c0cc 000000e0 80550000 00000000 802ae190 00000000 806b0000
         ...
 Call Trace:
@@ -26,10 +26,10 @@
 [<80089118>] 0x80089118
 [<8009a110>] 0x8009a110
 [<8009a168>] 0x8009a168
-[<8005f3e4>] 0x8005f3e4
 [<80332644>] 0x80332644
 [<80079328>] 0x80079328
 [<800735e0>] 0x800735e0
+[<8025bd98>] 0x8025bd98
 [<8025bc44>] 0x8025bc44
 [<8025be04>] 0x8025be04
 [<800735e0>] 0x800735e0

3:

INFO: rcu_sched self-detected stall on CPU
       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=17290
        (t=42006 jiffies g=600482 c=600481 q=872)
rcu_sched kthread starved for 7393 jiffies! g600482 c600481 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=2
rcu_sched       R  running task        0     8      2 0x00100000
Stack : 00000001 8fc3aca0 8fc57e38 8fc57e38 8fc57e38 80550000 814be380 00000000
        81495da0 80052c18 01d9e8ef 814be380 8fc57e38 80550000 814be380 00000000
        00000003 8054c1e8 80554610 8049fe18 8fc57e38 80550000 80550000 00000000
        00000003 804a3630 80550000 00000000 8fc57ebc 80068228 00000000 00000001
        00000200 00000000 01d9e8ef 800886a0 8fc3aca0 0c000003 80554500 00000000
        ...
Call Trace:
[<80052c18>] 0x80052c18
[<8049fe18>] 0x8049fe18
[<804a3630>] 0x804a3630
[<80068228>] 0x80068228
[<800886a0>] 0x800886a0
[<80084bc4>] 0x80084bc4
[<80080000>] 0x80080000
[<80081c0c>] 0x80081c0c
[<80084664>] 0x80084664
[<8004c0f8>] 0x8004c0f8
[<8004bfc8>] 0x8004bfc8
[<8004bfc8>] 0x8004bfc8
[<8004bfc8>] 0x8004bfc8
[<80006f78>] 0x80006f78

4:

NMI backtrace for cpu 0
CPU: 0 PID: 9933 Comm: kworker/0:2 Not tainted 4.14.209 #0
Workqueue: events 0x802409f4
Stack : 00000000 00000000 804d7568 8fc09d24 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000001 8fc09ce0 ac07f5c8
        8fc09d78 00000000 00000000 00006f70 00000038 8049da98 00000008 00000000
        00000000 80550000 0009b1b2 00000000 8fc09cc0 00000000 00000000 00000000
        80554500 8054c0cc 000000e0 80550000 00000000 802ae190 00000000 806b0000
        ...
Call Trace:
[<8049da98>] 0x8049da98
[<802ae190>] 0x802ae190
[<8000c1a0>] 0x8000c1a0
[<8000c1a8>] 0x8000c1a8
[<804868d4>] 0x804868d4
[<80072c54>] 0x80072c54
[<8048d8f4>] 0x8048d8f4
[<80008f30>] 0x80008f30
[<80008f30>] 0x80008f30
[<8048d9e0>] 0x8048d9e0
[<8005384c>] 0x8005384c
[<80051840>] 0x80051840
[<80086558>] 0x80086558
[<8008646c>] 0x8008646c
[<800859a8>] 0x800859a8
[<80089118>] 0x80089118
[<8009a110>] 0x8009a110
[<8009a168>] 0x8009a168
[<80332644>] 0x80332644
[<80079328>] 0x80079328
[<800735e0>] 0x800735e0
[<8025bd98>] 0x8025bd98
[<8025bc44>] 0x8025bc44
[<8025be04>] 0x8025be04
[<800735e0>] 0x800735e0
[<804a475c>] 0x804a475c
[<8025b4c0>] 0x8025b4c0
[<80007488>] 0x80007488

PS: not familiar how to decode these for openwrt, this is:

# cat /etc/openwrt_release
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='19.07.5'
DISTRIB_REVISION='r11257-5090152ae3'
DISTRIB_TARGET='ramips/mt7621'
DISTRIB_ARCH='mipsel_24kc'
DISTRIB_DESCRIPTION='OpenWrt 19.07.5 r11257-5090152ae3'
DISTRIB_TAINTS=''
# uname -a
Linux gw 4.14.209 #0 SMP Sun Dec 6 07:31:03 2020 mips GNU/Linux

For now 7 days and 5 hours with master branch r15223 (Kernel 5.4.82) without any problem.

Spent some time searching and at a loss - what's the way to decode the code traces in openwrt ?

Someone is reporting that disabling TSO fixed the issue for them. They used a patch that can be found here: https://bugs.openwrt.org/index.php?do=details&task_id=2628

It might also be possible to disable TSO via ethtool, but I am not sure if that would also work for fixing it. Has anyone tried this potential solution?

2 Likes

Because of SQM I always disable as much offloads as possible with TSO among them

And do you ever experience the transmit queue has timed out error?

1 Like

Yes, I did but after master switched over to DSA, I have not experienced them anymore

2 Likes

HWNAT is working now under DSA

1 Like

That’s news to me.

You mean on master? Since when? I have seen no commit that would account for such a change. Running a build from a week ago, yet HW NAT is not working.

I tried master yesterday,hwnat is working

I am compiling a fresh build now. How did you tested it?

xiaomir3g working as iperf3 server(192.168.5.1) with it‘s LAN connecting to phicomm K2P(192.168.1.1) 'wan port.

A compuer servring as iperf3 client and was connected to phicomm k2p's lan port.During the iperf3 test,no cpu loads in phicomm k2p.

2 Likes

Did a fresh compile, and of course HW offload does not work:

This is a DL speedtest of about 600megabits via PPPoE:
HW_offload

When HW offload worked, all cores were about 2-5% loaded during the same test, and I was able to reach 900+Mbits in DL.

I tested it again.
Yes,under pppoe it wont't work.But with dhcp environment it works.

1 Like

Hm... Maybe @nbd can comment on this, as previously (on kernel 4.14) it did worked with both pure IP and PPPoE as well, at least with V4 addresses.

1 Like

Hello. If any of you don't mind answering, what's the current status on this situation?

From my understanding in reading this, hwnat works (for some?) right now on OpenWrt built from master but not on PPPoE (server or client?) and some problem with VLANs but I might be wrong.

I would like to know if the following use case will not have any bugs or problems: Edgerouter X with WAN connection using PPPoE (The ER-X will be a client) and multiple VLANs setup on the switch0 interface (some ports are on one VLAN, some others on another VLAN etc.) while using hwnat. Also does it work on the latest stable build or only on master?

As much as I know, VLANs do work on master as well, but you need to configure it via CLI. PPPoE HW NAT for sure does not work (you can get about 600Mbits of speed via PPPoE in its current form with SW offload enabled), and IPv4 HW NAT is confirmed by a single user.