Mtk_soc_eth watchdog timeout after r11573

So, I am back on latest master, the Wifi issue is gone, but the PPPoE link keeps dropping:

[  975.364411] mt7530 mdio-bus:1f wan: Link is Down
[  975.379349] mt7530 mdio-bus:1f wan: configuring for phy/gmii link mode
[  975.392953] 8021q: adding VLAN 0 to HW filter on device wan
[  975.498268] mt7530 mdio-bus:1f wan: configuring for phy/gmii link mode
[  975.511931] 8021q: adding VLAN 0 to HW filter on device wan
[  979.602660] mt7530 mdio-bus:1f wan: Link is Up - 1Gbps/Full - flow control off
[  984.958268] pppoe-digi: renamed from ppp0

It repeats every 5-8 minutes. Seems to be something with the switch driver. Hopefully @nbd can take a look.

I already deleted and recreated the PPPoE connection from scratch.

Very weird. I am also on PPPoE but with zero issues. I am only seeing the log spam regarding the resetting WiFi. Your WiFi issues went away by power cycling? How did you power cycle the device exactly?

"Your WiFi issues went away by power cycling? "

Yes. Just pulled the plug for 10 seconds.

As for PPPoE, I had to revert to a 1month old version as the latest master drops every 5 minutes.

Ah, I am on 19.07.4. Is that version fine for you as well for your PPPoE connection?

I am on DSA for months now, I cannot revert to 4.19 without reconfiguring the whole router, so I would not test that if you dont mind :slight_smile:

But again: reverting to SNAPSHOT r14295-05b8e84362 fixed the PPPoE issue as well.

Hahh, there is a misterious link change bug on ath79 too! :thinking:

http://lists.openwrt.org/pipermail/openwrt-devel/2020-September/031466.html

Tried today's snapshot: PPPoE still fails.
Went back to this one: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=2c2fcbd2e0f856f460040b8c67530ca27fa323e7 and it works fine.

the latest 19.07.4 basically make my Xiaomi Router 4a 100M to "NETDEV WATCHDOG: eth0 (mtk_soc_eth)" several times in a single day. at least the RAM still have something like 20MB left compared to 19.07.3 which eats the RAM like crazy, still once that NETDEV WATCHDOG pops out, the 2.4ghz is killed, still the 5ghz is alive and if you give it like 5-10minutes, 2.4ghz wifi will just recover on its own with no related logs pertaining on the issue. and if you try to force restart the 2.4ghz radio it will just pop up a error with "device not ready" issue with several "mt76_wmac 10300000.wmac: MCU message 8 (seq 8) timed out" on the logs

in 19.07.3 it works fine at least for a few days (about 12 days) wtihout any issue before it crash with luci having a error with Out of Memory issue and the ssh also not working (probably the process is killed due to out of ram issue) I remember that out of 58MB of ram, only like 2mb of ram is left at 10days uptime, I dont know why it got so low at that point given that im only using this as a AP with setup to do a VLAN-SSID with 4 SSID with different isolated networks, with only like 7devices connected most of the time. but for some unknown reason, the 5ghz part of the wifi is still working and the 2.4ghz is dead until I force power cycle the Xiaomi 4a.

reverted back to 19.07.3 and just put a every 5 days auto reboot cronjob on it. that works fine for me at least.

Your problem with the RAM is strange. I have several HG556a also with 64MB running on 19.07.4 and 31MB remain free after 18 days of uptime. The same as just rebooted. Before they had 19.07 snapshot and i had no problems with ram. And they are working as a router with several packages installed (openvpn etc).

as of now, only like 5.15mb is total available, yesterday that was around 20MB, with Free is around 30+MB.

to be honest I dont know whats going on, no extra packages installed, no routing, no NAT, no firewall, just plain AP.

You have 12.96MB Free.

2 Likes

ill just post the later one which will drop like 3MB available with around <5MB free once it reach 5+days uptime.

edit: it happen sooner than I think.... here's what happen to luci now


cant access ssh since it was probably killed due to out of ram.

Unfortunately, I ran into the same issue again with 19.07.4, so it's definitely not fixed:

[923671.532181] ------------[ cut here ]------------
[923671.541563] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:320 0x8038c0d0
[923671.555786] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[923671.569816] Modules linked in: pppoe ppp_async pppox ppp_generic nf_nat_pptp nf_conntrack_pptp nf_conntrack_ipv6 mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE ebtable_nat ebtable_filter ebtable_broute cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY wireguard ts_fsm ts_bm slhc nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4 nf_nat_ipv4 nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table
[923671.710859]  nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_rtcache nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack_broadcast ts_kmp nf_conntrack_amanda iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables ebtables ebt_vlan ebt_stp ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_among ebt_802_3 crc_ccitt compat sch_cake nf_conntrack sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred ledtrig_usbport xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport
[923671.853050]  ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ifb ip6_udp_tunnel udp_tunnel leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[923671.915309] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.195 #0
[923671.927608] Stack : 00000000 00000000 00000000 87f49540 00000000 00000000 00000000 00000000
[923671.944415]         00000000 00000000 00000000 00000000 00000000 00000001 87c0bd60 53261622
[923671.961218]         87c0bdf8 00000000 00000000 00007ca8 00000038 8049c858 00000008 00000000
[923671.978023]         00000000 80550000 000df76d 70617773 87c0bd40 00000000 00000000 8050aed8
[923671.994831]         8038c0d0 00000140 00000001 87f49540 00000008 802ad210 00000004 806b0004
[923672.011640]         ...
[923672.016682] Call Trace:
[923672.016704] [<8049c858>] 0x8049c858
[923672.028825] [<8038c0d0>] 0x8038c0d0
[923672.035942] [<802ad210>] 0x802ad210
[923672.043078] [<8000c1a0>] 0x8000c1a0
[923672.050188] [<8000c1a8>] 0x8000c1a8
[923672.057293] [<804856b4>] 0x804856b4
[923672.064397] [<80071ab0>] 0x80071ab0
[923672.071498] [<8002e608>] 0x8002e608
[923672.078604] [<8038c0d0>] 0x8038c0d0
[923672.085783] [<8002e690>] 0x8002e690
[923672.092891] [<871b1b04>] 0x871b1b04 [mt7603e@871b0000+0x9100]
[923672.104486] [<800550e8>] 0x800550e8
[923672.111641] [<8038c0d0>] 0x8038c0d0
[923672.118764] [<8038bf24>] 0x8038bf24
[923672.125870] [<80088568>] 0x80088568
[923672.132972] [<8005f214>] 0x8005f214
[923672.140083] [<80088824>] 0x80088824
[923672.147189] [<80079158>] 0x80079158
[923672.154294] [<804a3658>] 0x804a3658
[923672.161396] [<80032fb4>] 0x80032fb4
[923672.168498] [<8025a5f0>] 0x8025a5f0
[923672.175607] [<80007488>] 0x80007488
[923672.182709] 
[923672.186002] ---[ end trace a950af9663dd6943 ]---
[923672.195392] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[923672.207893] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000065
[923672.220054] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0, base=06de0000, max=0, ctx=4023, dtx=4023, fdx=3913, next=4023
[923672.241895] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0, base=06380000, max=0, calc=1081, drx=1082
[923672.264375] mtk_soc_eth 1e100000.ethernet: 0x100 = 0x6060000c, 0x10c = 0x80818
[923672.284501] mtk_soc_eth 1e100000.ethernet: PPE started

What is your switch configuration?

Even with those patches i still had problems. Until i separated each ethernet port in a different VLAN they did not stop.

My PPPoE issue is still not resolved, but I managed to narrow it down:

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=17e64b9447959858c5c85f7f6c98264775585711 - PPPoE works.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=770a9c678756462ab0d94656f1fcc30624a31bd0 - PPPoE fails every couple minutes.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=283cdb30ab70e01ddeae3bb74a4c719bc460d3e9 - PPPoE fails every couple minutes.

So something went wrong between "kernel: bump 5.4 to 5.4.65" and "kernel: bump 5.4 to 5.4.66". As there were only two commits that can cause this (realetd to ramips/mediatek), I will further narrow it down.

Still find it strange that noone is having this issue with latest snapshots...

1 Like

You are not the only one. See this post and the one right below it: Xiaomi Mi Router 4A Gigabit Edition (R4AG/R4A Gigabit) -- fully supported and flashable with OpenWRTInvasion - #1186 by tsipizic

Same issue after a week of uptime on 19.07.04 (Mikrotik RBM33G), this bug is not fixed.

Found it!

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=f0cc5f6c0a72b7da9ed5915cf561e2f81d514c68

This commit breaks PPPoE. With the one before it works perfectly.

Sent a mail to @nbd with logs.

MOD: if I revert this single commit, latest snapshot starts to work just fine.

6 Likes

Hi! Same problem here with Xiaomi Mi Router snapshots (model R2100 with SoC MT7621). PPPoE disconnects every few minutes...

19.07.4 @scp07 version works perfect, connection is always stable.

I still want to try and disable flow control on ALL ports instead of just the CPU port, to see if that makes any difference for how often this issue crops up. Unfortunately, my knowledge is falling short of being able to write a patch myself. If there are any developers willing to collaborate, please have a look at the new topic I've started: Mt7621 / mt7530 programming: Disabling Flow Control on all ports

1 Like