Mtk_soc_eth watchdog timeout after r11573

I am sorry, but I think I am "barking" at the right tree, as I (and many others) have exactly the same issue as the topic starter's problem. The linked conversation - although I did not started it - maybe in the wrong place, non the less the linked part is relevant in this topic, and that is what I would like neheb's comment about.

It's English idiom, and to the best of my knowledge, the developers working on the mt76 wireless driver aren't working on the MT7621 SoC code. You're free to carry on of course.

Looks legit.

Note commits like this: https://github.com/openwrt/openwrt/commit/83997146e76d4097e30facf6ad89e5fa3bd7c65b

Not for this platform but it might be a similar problem.

The upstream kernel uses a completely different driver: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/net/dsa/mt7530.c?h=next-20191220

AFAIK, it does not have an issue like this.

That conversation is in the wrong place (as they note in the responses there). However, the final comment actually discusses a (possible) bug in the flow control mechanism in the mt7621 gigabit switch, which is resolved by disabling flow control

Thanks for the answer, much appreciate it!

I will try to get a version compiled with the flow control off patch and test this.

@neheb

FYI, I am using the patched version for 6 days now, and there are no Mtk_soc_eth timeouts or any other kernel errors in the log. The router sits on a 1000/300Mbit PPPoE FTTH connection, and I also run quite a few full speed speed tests with SW and HW offload turned on. Of course this is not a conclusive test yet, but given the fact the last couple masters developed this fault rather quickly (usually within 2-3 days) I would say there is a high chance that this fix actually fixes this long outstanding issue.

2 Likes

Are there any plans to create a pull request for this patch, to get it mainstreamed?

Also, when the issue was triggered for you, did you experience any issues? Or was it just the error in the log?

On my end, sometimes after an error like this the router's switch part was either very slow or not working at all. Means that the WAN interface was reachable externally, but the LAN side disconnected from the router. Although the LAN devices can still talk to each other. But this did not happen every time the timeout presented itself in the kernel log.

Since the early days of 4.14 I see this issues with varying frequency, but the single common point is that there was not a single time when the router did not developed at least one of this timeout after 10 days. So I will continue the test and will report back, but none the less it looks promising.

What would be beneficial is to create a pull request where the modifications of this patch can be turned on an off without the need to create a custom build. This way more people can test this.

dchard, can you share firmware you are testing now?

Not my build, created by @Bartvz This is for DLink 860L B1 (mt7621)

Please, can you tell me where to find the patch for disable flow control? I only found one very old for kernel 4.4.

I have those errors every few days, most of the time nothing happens but others leave the router innacesible and I have to restart it by disconnecting the power. It is currently in 19.07-snapshot (a bit old, kernel 4.14.152) if the router is not fixed, it will be trash and I will not buy never anything with mediatek chip again

@neheb: I am almost at 10 days uptime, and with this patch applied, there is no kernel errors at all. Since this issue was "introduced", it never happened before that I had a clean kernel log for 10 days. So it seems this patch actually fixes or at least eliminates this issue. Adding it to master would allow more people to test it on more platforms.

Thanks a lot. I will try it as soon as I can. I'm glad to hear that. I've been with this problem since I bought ER-X and put 17.01.7, going through 18.06.x and now 19.07

Is there anyone that wants to create a pull request in order to get it into master?

How i can create de pull? when I try it if it really works I will

I've sent it to the mailing list. Just as most patches, they take forever to land upstream.

I have no push access to the OpenWrt repo.

I'm building 18.06.6 but cannot apply the patch. "Hunk # 1 FAILED", "Hunk # 1 succeeded at 98", "Patch failed! Please fix ./patches-4.14/220-mt7621-disable-flow-control.patch".

I have manually edit the gsw_mt7621.c file with the changes. I don't fix the patch because i don't have experience with them. Compiling...

Response from IRC is to try this if condition instead:

 if (ralink_soc == MT762X_SOC_MT7621AT) {

Sounds bogus to me but ¯_(ツ)_/¯

Sounds bogus to me too.

I'm running my ER-X with the patch. In 15 days will report back.

Only 15 hours running and kernel crash and one more timed out with this patch. I'm starting to hate openwrt and i'm thinking back to EdgeOS.

Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.115113] ------------[ cut here ]------------
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.124343] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0x1ac/0x324
Fri Jan 31 23:32:37 2020 kern.info kernel: [52231.140800] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.154680] Modules linked in: pppoe ppp_async pppox ppp_generic nf_nat_pptp nf_conntrack_pptp nf_conntrack_ipv6 iptable_nat ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY ts_fsm ts_bm slhc nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_rtsp nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4 nf_nat_ipv4 nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_rtsp nf_conntrack_rtcache
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.298454]  nf_conntrack_proto_gre nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack_broadcast ts_kmp nf_conntrack_amanda nf_conntrack iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables tun nls_utf8 nls_iso8859_15 nls_cp852 nls_cp850 nls_cp437 nls_base leds_gpio gpio_button_hotplug
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.370342] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.162 #0
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.382457] Stack : 00000000 8ff49240 805a0000 80070278 805d0000 80567b20 00000000 00000000
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.399091]         8053360c 8fc0fdc4 8fc3cffc 805a2947 8052e638 00000001 8fc0fd68 53261643
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.415727]         00000000 00000000 80600000 00004690 00000000 000000ee 00000008 00000000
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.432354]         00000000 805a0000 0005a6a6 00000000 00000000 805d0000 00000000 8037b920
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.448982]         00000009 00000140 00000003 8ff49240 00000000 8029e618 0000000c 8060000c
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.465609]         ...
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.470465] Call Trace:
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.475348] [<800106a0>] show_stack+0x58/0x100
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.484193] [<8046f074>] dump_stack+0xa4/0xe0
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.492858] [<8002e958>] __warn+0xe0/0x114
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.501000] [<8002e9bc>] warn_slowpath_fmt+0x30/0x3c
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.510891] [<8037b920>] dev_watchdog+0x1ac/0x324
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.520266] [<80087394>] call_timer_fn.isra.3+0x24/0x84
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.530655] [<800875b0>] run_timer_softirq+0x1bc/0x248
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.540887] [<8048c758>] __do_softirq+0x128/0x2ec
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.550250] [<800330d4>] irq_exit+0xac/0xc8
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.558573] [<80254c4c>] plat_irq_dispatch+0xfc/0x138
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.568620] [<8000b5e8>] except_vec_vi_end+0xb8/0xc4
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.578493] [<8000cfb0>] r4k_wait_irqoff+0x1c/0x24
Fri Jan 31 23:32:37 2020 kern.warn kernel: [52231.588183] ---[ end trace c3836cca1bde30a8 ]---
Fri Jan 31 23:32:37 2020 kern.err kernel: [52231.597384] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
Fri Jan 31 23:32:37 2020 kern.info kernel: [52231.609704] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000065
Fri Jan 31 23:32:37 2020 kern.info kernel: [52231.621695] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0, base=0e980000, max=0, ctx=123, dtx=123, fdx=122, next=123
Fri Jan 31 23:32:37 2020 kern.info kernel: [52231.642660] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0, base=0e020000, max=0, calc=2057, drx=2058
Fri Jan 31 23:32:37 2020 kern.info kernel: [52231.663879] mtk_soc_eth 1e100000.ethernet: 0x100 = 0x5a60000c, 0x10c = 0x80818
Fri Jan 31 23:32:37 2020 kern.info kernel: [52231.683626] mtk_soc_etc 1e100000.ethernet: PPE started

Sat Feb  1 01:09:04 2020 kern.err kernel: [58018.076394] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
Sat Feb  1 01:09:04 2020 kern.info kernel: [58018.088733] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000065
Sat Feb  1 01:09:04 2020 kern.info kernel: [58018.100737] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0, base=0e020000, max=0, ctx=1452, dtx=1452, fdx=1451, next=1452
Sat Feb  1 01:09:04 2020 kern.info kernel: [58018.122400] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0, base=0c580000, max=0, calc=1274, drx=1275
Sat Feb  1 01:09:04 2020 kern.info kernel: [58018.143655] mtk_soc_eth 1e100000.ethernet: 0x100 = 0x5a60000c, 0x10c = 0x80818
Sat Feb  1 01:09:04 2020 kern.info kernel: [58018.163488] mtk_soc_eth 1e100000.ethernet: PPE started