Mtk_soc_eth watchdog timeout after r11573

I have been running snapshot r16165-0183ee2eb9 for a while. Moved to this after running into issues with 19.07.7. No issues so far, I didn't enable any offloading. BTW, moved from edgeos2.x to openwrt a week back.

I am using D-Link D-860L B1 as AP, running snapshots that was updated once mt76 got an update. It's very stable now, with much less interrupt errors.

While the EdgeRouter is still up and running I am again experiencing interruptions in the connection several times a day. I will try using the snapshot build to see if it makes a difference.

As I was still experiencing connection drops so I reverted to EdgeOS 1.10.11 which is running now without an issue.

I also saw the same issue with the Snapshot build until I disabled Software and Hardware Flow Offloading. Since doing that, it has been rock solid and up now 12 days and counting.

1 Like

Thank you for this feedback. I am aware, that offloading cannot be used together with sqm. But in my case offloading was also disabled. With earlier 19.07 builds I had the problem complete locks of the router which occurred randomly. With the newer builds these locks are gone and the ERX is stable for a long time. Nevertheless, now the connection drops about once or twice a day (also randomy with messages like "mt7530 mdio-bus:00 lan4: Link is Down" ) but just for a minute or so and everything is back to normal afterwards. This is not a problem for browsing the web, but it is a problem for streaming or video conferencing. That's why I reverted back to the stock firmware.

Yes I am noticing the same now for FaceTime. Disconnects video every 30-40 seconds for a few seconds.

I have been using 21.02 snapshot without an issue for a while. I use zoom and kids use MS teams, no issues so far.

Are you also using PPoe with VLan tagging?

No ppoe, no vlan tagging.

So far so good with the 20.02 snapshots for me personally, so very happy with that. Unfortunately, I do still run into another issue, where the port randomly disconnects and reconnects. Anyone experiencing the same issue? This is just getting spammed over and over again in my dmesg log:

[2612337.641359] mt7530 mdio-bus:1f lan1: Link is Down
[2612337.646999] br-lan: port 1(lan1) entered disabled state
[2612378.601396] mt7530 mdio-bus:1f lan1: Link is Up - 1Gbps/Full - flow control rx/tx
[2612378.609641] br-lan: port 1(lan1) entered blocking state
[2612378.615187] br-lan: port 1(lan1) entered forwarding state
[2637278.061031] mt7530 mdio-bus:1f lan1: Link is Down
[2637278.066413] br-lan: port 1(lan1) entered disabled state
[2671089.335938] mt7530 mdio-bus:1f lan1: Link is Up - 100Mbps/Full - flow control off
[2671089.343678] br-lan: port 1(lan1) entered blocking state
[2671089.349144] br-lan: port 1(lan1) entered forwarding state
[2671164.087303] mt7530 mdio-bus:1f lan1: Link is Down
[2671164.092606] br-lan: port 1(lan1) entered disabled state
[2671167.159559] mt7530 mdio-bus:1f lan1: Link is Up - 1Gbps/Full - flow control rx/tx
[2671167.167322] br-lan: port 1(lan1) entered blocking state
[2671167.172800] br-lan: port 1(lan1) entered forwarding state
[2671355.670475] mt7530 mdio-bus:1f lan2: Link is Up - 1Gbps/Full - flow control off
[2671355.678021] br-lan: port 2(lan2) entered blocking state
[2671355.683440] br-lan: port 2(lan2) entered forwarding state
[2671398.678004] mt7530 mdio-bus:1f lan2: Link is Down
[2671398.683057] br-lan: port 2(lan2) entered disabled state
[2671401.750238] mt7530 mdio-bus:1f lan2: Link is Up - 1Gbps/Full - flow control rx/tx
[2671401.757974] br-lan: port 2(lan2) entered blocking state
[2671401.763405] br-lan: port 2(lan2) entered forwarding state
[2671690.516396] mt7530 mdio-bus:1f lan2: Link is Down
[2671690.521708] br-lan: port 2(lan2) entered disabled state
[2671720.212390] mt7530 mdio-bus:1f lan2: Link is Up - 1Gbps/Full - flow control rx/tx
[2671720.220236] br-lan: port 2(lan2) entered blocking state
[2671720.225687] br-lan: port 2(lan2) entered forwarding state

If more people are experiencing this, it might warrant a new topic maybe?

I think it may be a bad cable.

1 Like

It´s happening on both port 2 and port 1, though. Two bad cables seem like a stretch?

Edit: Actually, no issues are reported on Port 3. So should be easy enough to test the cable theory. Time to move the cables around a bit :slight_smile:

Unfortunately, while it's much rares, this is still an issue on 21.02 RC1:

[2733116.802884] ------------[ cut here ]------------
[2733116.812473] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:448 0x8047e780
[2733116.826901] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[2733116.841147] Modules linked in: xt_connlimit pppoe ppp_async nf_conncount iptable_nat xt_state xt_nat xt_helper xt_conntrack xt_connmark xt_connbytes xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD xt_CT wireguard pppox ppp_generic nf_nat_pptp nf_nat nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack_pptp nf_conntrack_netlink nf_conntrack mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 libchacha20poly1305 libblake2s ipt_REJECT cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_recent xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY ts_kmp ts_fsm ts_bm slhc sch_cake poly1305_mips nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libblake2s_generic iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt compat chacha_mips br_netfilter sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit
[2733116.841336]  act_mirred ledtrig_usbport xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ifb ip6_udp_tunnel udp_tunnel kpp leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[2733117.117908] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G    B   W         5.4.111 #0
[2733117.132967] Stack : 00000000 80840000 ffffffff 8007d6e0 00000000 00000000 00000000 00000000
[2733117.149937]         00000000 00000000 00000000 00000000 00000000 00000001 87c0dd50 5fc2ce17
[2733117.166908]         87c0dde8 00000000 00000000 00000000 00000038 805e1804 312e342e 23203131
[2733117.183876]         00000000 00021330 00000000 0001cc94 00000000 87c0dd30 00000000 8047e780
[2733117.200851]         00000009 00000000 00200000 00000122 00000001 80359e2c 00000000 80810000
[2733117.217822]         ...
[2733117.223027] Call Trace:
[2733117.223043] [<8007d6e0>] 0x8007d6e0
[2733117.235507] [<805e1804>] 0x805e1804
[2733117.242793] [<8047e780>] 0x8047e780
[2733117.250072] [<80359e2c>] 0x80359e2c
[2733117.257352] [<8000b05c>] 0x8000b05c
[2733117.264633] [<8000b064>] 0x8000b064
[2733117.271907] [<805c6f9c>] 0x805c6f9c
[2733117.279178] [<8007d8ac>] 0x8007d8ac
[2733117.286449] [<8002bfe8>] 0x8002bfe8
[2733117.293729] [<8047e780>] 0x8047e780
[2733117.301004] [<8002c0c0>] 0x8002c0c0
[2733117.308295] [<8047e780>] 0x8047e780
[2733117.315579] [<800a9018>] 0x800a9018
[2733117.322855] [<8047e484>] 0x8047e484
[2733117.330133] [<800965d4>] 0x800965d4
[2733117.337415] [<8007f4c0>] 0x8007f4c0
[2733117.344693] [<80429908>] 0x80429908
[2733117.351966] [<8009681c>] 0x8009681c
[2733117.359245] [<80433bc0>] 0x80433bc0
[2733117.366520] [<80083d2c>] 0x80083d2c
[2733117.373805] [<805e7d1c>] 0x805e7d1c
[2733117.381084] [<80030768>] 0x80030768
[2733117.388355] [<802f8404>] 0x802f8404
[2733117.395626] [<80006c28>] 0x80006c28
[2733117.402895] 
[2733117.406597] ---[ end trace 52e7a6fe65a762ba ]---
[2733117.416180] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[2733117.443658] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
[2733117.483540] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
[2733117.499596] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx

I have been using the master branch since December, compiling the latest sources every month, and I have not seen that error again.

Are you using a vlan filter to create a switch (bridge) with ports?

I am not using anything fancy, really. My network configuration file is below. Mind you, the error is super rare (once in 33 days of uptime) and as far as I could see (but maybe I simply wasn't using my network when it happened) I didn't experience any issues due to it. So I am not TOO bothered right now.

root@OpenWrt:~# cat /etc/config/network 

config interface 'loopback'
	option ifname 'lo'
	option proto 'static'
	option ipaddr '127.0.0.1'
	option netmask '255.0.0.0'

config globals 'globals'
	option packet_steering '1'
	option ula_prefix 'redacted'

config interface 'lan'
	option type 'bridge'
	option ifname 'lan1 lan2 lan3 lan4'
	option proto 'static'
	option ipaddr '192.168.103.1'
	option netmask '255.255.255.0'
	option ip6assign '64'

config interface 'wan'
	option ifname 'wan'
	option proto 'static'
	option ipaddr 'redacted'
	option netmask '255.255.255.248'
	option gateway 'redacted'
	list dns '8.8.8.8'
	list dns '8.8.4.4'

config interface 'untrusted'
	option proto 'static'
	option type 'bridge'
	option ipaddr '192.168.104.1'
	option netmask '255.255.255.0'
	option ip6assign '64'

I've had uptimes of over 40 days and haven't seen that error since switching to the master branch. The logs are sent to a syslog server and i check it when i remember.

I don't have any bridge in use. Each interface has only one port assigned.

looks like the bug is still there. the fixes only reduced the chance ...

Seems like it. It's super rare, but it's definitely still there. I would say it's usable in the current state, though.

1 Like

So far so good with the 20.02 snapshots for me personally, so very happy with that. Unfortunately, I do still run into another issue, where the port randomly disconnects and reconnects. Anyone experiencing the same issue? This is just getting spammed over and over again in my dmesg log:

[2612337.641359] mt7530 mdio-bus:1f lan1: Link is Down
[2612337.646999] br-lan: port 1(lan1) entered disabled state
[2612378.601396] mt7530 mdio-bus:1f lan1: Link is Up - 1Gbps/Full - flow control rx/tx
[2612378.609641] br-lan: port 1(lan1) entered blocking state
[2612378.615187] br-lan: port 1(lan1) entered forwarding state
[2637278.061031] mt7530 mdio-bus:1f lan1: Link is Down
[2637278.066413] br-lan: port 1(lan1) entered disabled state
[2671089.335938] mt7530 mdio-bus:1f lan1: Link is Up - 100Mbps/Full - flow control off
[2671089.343678] br-lan: port 1(lan1) entered blocking state
[2671089.349144] br-lan: port 1(lan1) entered forwarding state
[2671164.087303] mt7530 mdio-bus:1f lan1: Link is Down
[2671164.092606] br-lan: port 1(lan1) entered disabled state
[2671167.159559] mt7530 mdio-bus:1f lan1: Link is Up - 1Gbps/Full - flow control rx/tx
[2671167.167322] br-lan: port 1(lan1) entered blocking state
[2671167.172800] br-lan: port 1(lan1) entered forwarding state
[2671355.670475] mt7530 mdio-bus:1f lan2: Link is Up - 1Gbps/Full - flow control off
[2671355.678021] br-lan: port 2(lan2) entered blocking state
[2671355.683440] br-lan: port 2(lan2) entered forwarding state
[2671398.678004] mt7530 mdio-bus:1f lan2: Link is Down
[2671398.683057] br-lan: port 2(lan2) entered disabled state
[2671401.750238] mt7530 mdio-bus:1f lan2: Link is Up - 1Gbps/Full - flow control rx/tx
[2671401.757974] br-lan: port 2(lan2) entered blocking state
[2671401.763405] br-lan: port 2(lan2) entered forwarding state
[2671690.516396] mt7530 mdio-bus:1f lan2: Link is Down
[2671690.521708] br-lan: port 2(lan2) entered disabled state
[2671720.212390] mt7530 mdio-bus:1f lan2: Link is Up - 1Gbps/Full - flow control rx/tx
[2671720.220236] br-lan: port 2(lan2) entered blocking state
[2671720.225687] br-lan: port 2(lan2) entered forwarding state

If more people are experiencing this, it might warrant a new topic maybe?

I have a similar problem on OpenWRT 21.02 RC4, I described it in detail here: