Mtk_soc_eth watchdog timeout after r11573

Great, one user already commented long ago that having each port in a separate VLAN had no transmit timed out or reboots. But I thought it was because he got a "good" SoC. I will prove it. At the moment I am using the mt7530_fix patch with fcoff and although i have transmit timed outs every 2 days, it seems that it does not restart or hang. Are you using any patch or stable version as it comes? Is there any degradation in performance between local transfers on the ports of the software bridge or use more cpu?

I still don't want to switch to kernel 5.4 for "complexity" to configure VLANs in the DSA driver.

Unfortunately, it seems like I jinxed it. The connected was down and the router unreachable for ~20 minutes. The issue cleared itself up eventually, but at the time of the issue the dreaded error was found in the kernel log.

@dchard and others that are testing a build with the 5.4 kernel, is this problem fixed on the 5.4 kernel? How is overall stability on these bleeding edge builds? Is it worth using over a 19.07.2 build on production hardware? Or should I bite the bullet and abandon this platform?

My mir3g has an uptime of 3 days with r13042 + my modifications and I haven't run into the problem.
However, I don't have a very fancy setup. Just an USB 4GLTE dongle and atm 3-4 WiFi clients connected to 5 GHz.

I think, its not Mediatek specific problem.

When I had uplink 80mbit @ 100mbit port, I got no problem.
Now, 100@100 whith FC on at provider switch, I get "transmit queue 0 timed out" on all OWRT device, including MT7621 (Mi3G)

For example, after 3-10 minuts at torrent whith 100 mbit download
tl-wr741nd ver 2.4 target ar71xx/ath79

[ 713.044509] ------------[ cut here ]------------
[ 713.049263] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:306 0x802971f8()


[ 713.056534]NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out


[ 713.063096] Modules linked in: ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug
[ 713.120963] CPU: 0 PID: 0 Comm: swapper Not tainted 4.4.153 #0
[ 713.126828] Stack : 803938d8 00000000 00000001 803f0000 00000000 00000000 00000000 00000000
[ 713.126828] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 713.126828] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 713.126828] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 713.126828] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 713.126828] ...
[ 713.162738] Call Trace:[<80071c44>] 0x80071c44
[ 713.167279] [<80071c44>] 0x80071c44
[ 713.170783] [<80081ac8>] 0x80081ac8
[ 713.174289] [<802971f8>] 0x802971f8
[ 713.177811] [<80081b24>] 0x80081b24
[ 713.181320] [<800b0e54>] 0x800b0e54
[ 713.184843] [<802971f8>] 0x802971f8
[ 713.188342] [<8024c57c>] 0x8024c57c
[ 713.191847] [<8029701c>] 0x8029701c
[ 713.195367] [<800b0e54>] 0x800b0e54
[ 713.198871] [<800b10d8>] 0x800b10d8
[ 713.202370] [<800a8fd0>] 0x800a8fd0
[ 713.205894] [<80084054>] 0x80084054
[ 713.209396] [<800ac3ac>] 0x800ac3ac
[ 713.212896] [<800a88a4>] 0x800a88a4
[ 713.216418] [<8006aa50>] 0x8006aa50
[ 713.219926] [<80060bf8>] 0x80060bf8
[ 713.223417]
[ 713.224926] ---[ end trace c952f5c8864e6bd0 ]---
[ 713.229554] eth0: tx timeout
[ 823.044261] eth0: tx timeout
[ 913.043891] eth0: tx timeout
[ 983.859962] device wlan0 left promiscuous mode

mediatek

[883282.702056] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:320 0x8038b620

[883282.709182] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out

[883282.716192] Modules linked in: pppoe ppp_async pppox ppp_generic nf_conntrack_ipv6 mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat sg ledtrig_usbport nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 msdos vfat fat autofs4 nls_utf8 nls_koi8_r nls_iso8859_1 nls_cp866 nls_cp852 nls_cp850 nls_cp437
[883282.787064]  nls_cp1251 nls_cp1250 uas usb_storage sd_mod scsi_mod ext4 mbcache jbd2 crc32c_generic leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[883282.804792] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.171 #0

It's such a shame flow control is giving us so many issues, despite flow control being super stupid even if it was working properly. Nice comment on why you should hate flow control: https://www.reddit.com/r/networking/comments/1vosfv/flow_control/

1 Like

Unfortunately it has NOT been fixed in kernel 5.4: ER-X-SFP: VLANs not working properly with kernel 5.4

@pmelange is running a branch which has flow control enabled: https://github.com/openwrt/openwrt/pull/2798#issuecomment-618935412

I am curious whether this issue is still present with flow control default, which is the default on the master branch.

If what causes these problems is flow control, why in 4.14 disabling it the problems continue? I still think that before switching to kernel 5.4 with DSA driver this should have been fixed.

Sounds SFP specific.

I know this doesn't help much, but I've been running 20 days without issues, having both 100mbit and gigabit devices.

Well. What patches applied?

I see high amounts of interrupt errors when checking /proc/interrupts. Most likely from the wifi chips. Is this of any relevance? On my r7800 I don't have any errors ...

I tried several times compile and flash kernel 5.4 firmware for rampis MT7621, network and switcher never work. Ping to router always timeout. Now still keep stay on 19.07-SNAPSHOT branch with kernel 4.14.176.

When flashing a 5.4 build, did you make sure you did NOT keep settings?

Network and switch always not working no mater keep settings or reset to factory settings.

Thanks, it does help me a lot, as I need to stay on 4.14 / 19.07.x base, and these patches seem to stabilize the router.

I have a specific issues that just surfaced when switching from a Netgear CM600 to a CM1000 cable modem. The router just constantly reboots on the CM1000, so I suspect it is using Pause frames and causes the router to crash and reboot. So the FC off patch would likely help, right?

Anyone know how to turn of Flow control on port 5 on an existing deployment?
I can't deploy a new build until several weeks from now, but would love to be able to run a cli command or edit a setting to turn off FC so that unit can run with the CM1000.

I have 16+ days of uptime on 5.4.35, and have no kernel errors whatsoever. (Dlink 860L). And one leg of the router still has flow control enabled (all gigabit).

After 20 days of error-free uptime, today I upgraded to 5.4.42, lets see if it continues to be stable.

Any news on HW offload?

Hi, all again :frowning:
OpenWrt GCC 7.5.0 r11093-f99b1d1
with PR "ramips: gsw_mt7621: disable PORT 5 MAC RX/TX flow control by default "
7 Day

[650848.993965] ------------[ cut here ]------------
[650848.998716] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0x1ac/0x324
[650849.007055] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[650849.014079] Modules linked in: pppoe ppp_async pppox ppp_generic nf_conntrack_ipv6 mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat sg ledtrig_usbport nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables
[650849.084514]  nf_reject_ipv6 msdos vfat fat autofs4 nls_utf8 nls_koi8_r nls_iso8859_1 nls_cp866 nls_cp852 nls_cp850 nls_cp437 nls_cp1251 nls_cp1250 uas usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 mbcache jbd2 exfat usbcore nls_base usb_common crc32c_generic
[650849.112529] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.180 #0

I Think, need activate second RGMII link, like Padavan software, and no need some workaround.
Ok, no 1 Gbps light, but no errors.

Or you can give the latest maser a try. I am running on it for quite some time, have no errors whatsoever.

Activating the second RGMII link can cause issues, if I remember correctly this is explained this in the v5.4 PR by one of the developers. Take a look before you proceed.