Mtk_soc_eth watchdog timeout after r11573

ffee · December 7, 2019, 9:38am

I am not sure this issue after which update.

[100892.575808] mtk_soc_eth 1e100000.ethernet eth0: port 1 link down
[143070.000845] ------------[ cut here ]------------
[143070.010252] WARNING: CPU: 0 PID: 7 at net/sched/sch_generic.c:320 dev_watchdog+0x1ac/0x324
[143070.026913] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[143070.040973] Modules linked in: pppoe ppp_async pppox ppp_generic nf_conntrack_ipv6 mt76x2e mt76x2_common mt76x02_lib mt76 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_socket xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TPROXY xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY slhc nf_socket_ipv6 nf_socket_ipv4 nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack_netlink nf_conntrack libcrc32c iptable_raw iptable_mangle iptable_filter ipt_ECN ip6table_raw
[143070.183731]  ip_tables crc_ccitt compat fuse tcp_bbr sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred ledtrig_usbport xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ifb vfat fat nls_utf8 nls_iso8859_1 nls_cp437 mmc_block mtk_sd mmc_core leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd ahci libahci libata sd_mod scsi_mod gpio_button_hotplug ext4 mbcache
[143070.324755]  jbd2 usbcore nls_base usb_common crc32c_generic
[143070.336349] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 4.14.156 #0
[143070.349015] Stack : 00000000 8fe93f40 ffffffff 80071364 805b0000 80552c80 00000000 00000000
[143070.365868]         8051e440 8fc55ccc 8fc3a8fc 8058c907 805191e0 00000001 8fc55c70 5326167d
[143070.382717]         00000000 00000000 806f0000 00000000 806f8370 0000016f 00000008 00000000
[143070.399554]         00000000 00000000 000521dd 666f736b 00000000 805b0000 00000000 80375830
[143070.416371]         8054cd20 00000140 00000000 8fe93f40 00000018 8029dda0 00000000 806f0000
[143070.433189]         ...
[143070.438224] Call Trace:
[143070.443286] [<8000c3b4>] show_stack+0x58/0x100
[143070.452316] [<80459be4>] dump_stack+0xa4/0xe0
[143070.461167] [<8002f8b0>] __warn+0xe0/0x140
[143070.469486] [<8002f4f4>] warn_slowpath_fmt+0x30/0x3c
[143070.479535] [<80375830>] dev_watchdog+0x1ac/0x324
[143070.489091] [<80087f9c>] call_timer_fn.isra.28+0x24/0x84
[143070.499834] [<800882fc>] run_timer_softirq+0x1bc/0x248
[143070.510250] [<80477b40>] __do_softirq+0x128/0x2e8
[143070.519788] [<80033ca8>] run_ksoftirqd+0x38/0x6c
[143070.529164] [<80050af8>] smpboot_thread_fn+0x1a8/0x1d8
[143070.539576] [<8004ccc8>] kthread+0x130/0x144
[143070.548245] [<80006fb8>] ret_from_kernel_thread+0x14/0x1c
[143070.559397] ---[ end trace 0f4b12fd74c0987f ]---
[143070.568852] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[143070.581528] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000065
[143070.593838] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0, base=0eb50000, max=0, ctx=3602, dtx=3602, fdx=3601, next=3602
[143070.615851] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0, base=0d2d0000, max=0, calc=2340, drx=2348
[143070.641498] mtk_soc_eth 1e100000.ethernet: 0x100 = 0x6060000c, 0x10c = 0x80818
[143070.663752] mtk_soc_eth 1e100000.ethernet: PPE started

neheb · December 8, 2019, 7:02pm

Known issue. Unlikely it will ever get fixed.

dchard · December 8, 2019, 11:20pm

@neheb

Can you elaborate on this? I read quite a few commits, and tickets about this same issue (I myself is experiencing it on a 860L), but this is the very first time I see and established and well respected developer say "Unlikely it will ever get fixed."

I kinda hoped that the ongoing ramips initial 4.19 support plus the 5.x backports might fix this, but if you say it is unlikely thats is not very good news. Do you have any recommendations - if it is really unfixable - about how to avoid it? For example at some point there was a recommendation to disable flow control on the connected client devices, which did not worked, but if you have any recommendations that would be nice to hear.

neheb · December 9, 2019, 2:42am

Make a script to reboot every day maybe. AFAIK, this happens once every 7 days.

The 4.19 update will not bring the upstream mediatek driver (which does not have this issue) because it is slower(DSA and no HW NAT) and does not support enough platforms currently.

dchard · January 1, 2020, 3:12pm

@neheb

Thanks for the answer! Can you please comment on this one: https://github.com/openwrt/mt76/issues/211#issuecomment-569944489

I never tried it, but it seems it shows some good results.

Borromini · January 1, 2020, 3:23pm

Mt76 is wireless. Your issue is with the switch/SoC. Read the comments on that thread.

You are barking up the wrong tree there.

dchard · January 1, 2020, 7:57pm

I am sorry, but I think I am "barking" at the right tree, as I (and many others) have exactly the same issue as the topic starter's problem. The linked conversation - although I did not started it - maybe in the wrong place, non the less the linked part is relevant in this topic, and that is what I would like neheb's comment about.

Borromini · January 1, 2020, 9:32pm

It's English idiom, and to the best of my knowledge, the developers working on the mt76 wireless driver aren't working on the MT7621 SoC code. You're free to carry on of course.

neheb · January 2, 2020, 2:10am

Looks legit.

Note commits like this: https://github.com/openwrt/openwrt/commit/83997146e76d4097e30facf6ad89e5fa3bd7c65b

Not for this platform but it might be a similar problem.

The upstream kernel uses a completely different driver: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/net/dsa/mt7530.c?h=next-20191220

AFAIK, it does not have an issue like this.

bdleonard · January 2, 2020, 2:13pm

That conversation is in the wrong place (as they note in the responses there). However, the final comment actually discusses a (possible) bug in the flow control mechanism in the mt7621 gigabit switch, which is resolved by disabling flow control

dchard · January 2, 2020, 3:01pm

Thanks for the answer, much appreciate it!

I will try to get a version compiled with the flow control off patch and test this.

dchard · January 22, 2020, 11:17am

@neheb

FYI, I am using the patched version for 6 days now, and there are no Mtk_soc_eth timeouts or any other kernel errors in the log. The router sits on a 1000/300Mbit PPPoE FTTH connection, and I also run quite a few full speed speed tests with SW and HW offload turned on. Of course this is not a conclusive test yet, but given the fact the last couple masters developed this fault rather quickly (usually within 2-3 days) I would say there is a high chance that this fix actually fixes this long outstanding issue.

Mushoz · January 22, 2020, 12:21pm

Are there any plans to create a pull request for this patch, to get it mainstreamed?

Also, when the issue was triggered for you, did you experience any issues? Or was it just the error in the log?

dchard · January 22, 2020, 12:42pm

On my end, sometimes after an error like this the router's switch part was either very slow or not working at all. Means that the WAN interface was reachable externally, but the LAN side disconnected from the router. Although the LAN devices can still talk to each other. But this did not happen every time the timeout presented itself in the kernel log.

Since the early days of 4.14 I see this issues with varying frequency, but the single common point is that there was not a single time when the router did not developed at least one of this timeout after 10 days. So I will continue the test and will report back, but none the less it looks promising.

What would be beneficial is to create a pull request where the modifications of this patch can be turned on an off without the need to create a custom build. This way more people can test this.

okaonka · January 22, 2020, 7:11pm

dchard, can you share firmware you are testing now?

dchard · January 22, 2020, 9:01pm

Not my build, created by @Bartvz This is for DLink 860L B1 (mt7621)

apocalypse · January 25, 2020, 3:12pm

Please, can you tell me where to find the patch for disable flow control? I only found one very old for kernel 4.4.

I have those errors every few days, most of the time nothing happens but others leave the router innacesible and I have to restart it by disconnecting the power. It is currently in 19.07-snapshot (a bit old, kernel 4.14.152) if the router is not fixed, it will be trash and I will not buy never anything with mediatek chip again

dchard · January 25, 2020, 3:45pm

@neheb: I am almost at 10 days uptime, and with this patch applied, there is no kernel errors at all. Since this issue was "introduced", it never happened before that I had a clean kernel log for 10 days. So it seems this patch actually fixes or at least eliminates this issue. Adding it to master would allow more people to test it on more platforms.

apocalypse · January 25, 2020, 4:08pm

Thanks a lot. I will try it as soon as I can. I'm glad to hear that. I've been with this problem since I bought ER-X and put 17.01.7, going through 18.06.x and now 19.07

Mushoz · January 25, 2020, 7:27pm

Is there anyone that wants to create a pull request in order to get it into master?