Mtk_soc_eth watchdog timeout after r11573

Unfortunately, I ran into the same issue again with 19.07.4, so it's definitely not fixed:

[923671.532181] ------------[ cut here ]------------
[923671.541563] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:320 0x8038c0d0
[923671.555786] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[923671.569816] Modules linked in: pppoe ppp_async pppox ppp_generic nf_nat_pptp nf_conntrack_pptp nf_conntrack_ipv6 mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE ebtable_nat ebtable_filter ebtable_broute cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY wireguard ts_fsm ts_bm slhc nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4 nf_nat_ipv4 nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table
[923671.710859]  nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_rtcache nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack_broadcast ts_kmp nf_conntrack_amanda iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables ebtables ebt_vlan ebt_stp ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_among ebt_802_3 crc_ccitt compat sch_cake nf_conntrack sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred ledtrig_usbport xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport
[923671.853050]  ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ifb ip6_udp_tunnel udp_tunnel leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[923671.915309] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.195 #0
[923671.927608] Stack : 00000000 00000000 00000000 87f49540 00000000 00000000 00000000 00000000
[923671.944415]         00000000 00000000 00000000 00000000 00000000 00000001 87c0bd60 53261622
[923671.961218]         87c0bdf8 00000000 00000000 00007ca8 00000038 8049c858 00000008 00000000
[923671.978023]         00000000 80550000 000df76d 70617773 87c0bd40 00000000 00000000 8050aed8
[923671.994831]         8038c0d0 00000140 00000001 87f49540 00000008 802ad210 00000004 806b0004
[923672.011640]         ...
[923672.016682] Call Trace:
[923672.016704] [<8049c858>] 0x8049c858
[923672.028825] [<8038c0d0>] 0x8038c0d0
[923672.035942] [<802ad210>] 0x802ad210
[923672.043078] [<8000c1a0>] 0x8000c1a0
[923672.050188] [<8000c1a8>] 0x8000c1a8
[923672.057293] [<804856b4>] 0x804856b4
[923672.064397] [<80071ab0>] 0x80071ab0
[923672.071498] [<8002e608>] 0x8002e608
[923672.078604] [<8038c0d0>] 0x8038c0d0
[923672.085783] [<8002e690>] 0x8002e690
[923672.092891] [<871b1b04>] 0x871b1b04 [mt7603e@871b0000+0x9100]
[923672.104486] [<800550e8>] 0x800550e8
[923672.111641] [<8038c0d0>] 0x8038c0d0
[923672.118764] [<8038bf24>] 0x8038bf24
[923672.125870] [<80088568>] 0x80088568
[923672.132972] [<8005f214>] 0x8005f214
[923672.140083] [<80088824>] 0x80088824
[923672.147189] [<80079158>] 0x80079158
[923672.154294] [<804a3658>] 0x804a3658
[923672.161396] [<80032fb4>] 0x80032fb4
[923672.168498] [<8025a5f0>] 0x8025a5f0
[923672.175607] [<80007488>] 0x80007488
[923672.182709] 
[923672.186002] ---[ end trace a950af9663dd6943 ]---
[923672.195392] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[923672.207893] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000065
[923672.220054] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0, base=06de0000, max=0, ctx=4023, dtx=4023, fdx=3913, next=4023
[923672.241895] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0, base=06380000, max=0, calc=1081, drx=1082
[923672.264375] mtk_soc_eth 1e100000.ethernet: 0x100 = 0x6060000c, 0x10c = 0x80818
[923672.284501] mtk_soc_eth 1e100000.ethernet: PPE started

What is your switch configuration?

Even with those patches i still had problems. Until i separated each ethernet port in a different VLAN they did not stop.

My PPPoE issue is still not resolved, but I managed to narrow it down:

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=17e64b9447959858c5c85f7f6c98264775585711 - PPPoE works.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=770a9c678756462ab0d94656f1fcc30624a31bd0 - PPPoE fails every couple minutes.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=283cdb30ab70e01ddeae3bb74a4c719bc460d3e9 - PPPoE fails every couple minutes.

So something went wrong between "kernel: bump 5.4 to 5.4.65" and "kernel: bump 5.4 to 5.4.66". As there were only two commits that can cause this (realetd to ramips/mediatek), I will further narrow it down.

Still find it strange that noone is having this issue with latest snapshots...

1 Like

You are not the only one. See this post and the one right below it: Xiaomi Mi Router 4A Gigabit Edition (R4AG/R4A Gigabit) -- fully supported and flashable with OpenWRTInvasion

Same issue after a week of uptime on 19.07.04 (Mikrotik RBM33G), this bug is not fixed.

Found it!

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=f0cc5f6c0a72b7da9ed5915cf561e2f81d514c68

This commit breaks PPPoE. With the one before it works perfectly.

Sent a mail to @nbd with logs.

MOD: if I revert this single commit, latest snapshot starts to work just fine.

6 Likes

Hi! Same problem here with Xiaomi Mi Router snapshots (model R2100 with SoC MT7621). PPPoE disconnects every few minutes...

19.07.4 @scp07 version works perfect, connection is always stable.

I still want to try and disable flow control on ALL ports instead of just the CPU port, to see if that makes any difference for how often this issue crops up. Unfortunately, my knowledge is falling short of being able to write a patch myself. If there are any developers willing to collaborate, please have a look at the new topic I've started: Mt7621 / mt7530 programming: Disabling Flow Control on all ports

1 Like

For those who are interested, the above linked topic by me contains patches to disable flow control on ALL MACs instead of only one, disable it globally as well AND disable pause frame advertisement on the PHYs. I have tested it on my home router and everything is running fine.

However, this router has always been stable, so not sure if it actually fixes the transmit queue has timed out issue. I will deploy it to the router having issues this week. If anyone else wants to test it out feel free :slight_smile:

4 Likes

It seems this pacth https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=b59d5c8f0eebb6d15d7cefe487c17fad0ee4a524 solves the PPPoE drop issue.

On the other hand, can someone tell me what are the recommended settings for mt7621? In the last couple weeks I noticed that with default settings (packet steering and soft flow offload ON) I cannot reach more than 350Mbits and only a single core is utilized (with PPPoE).

1 Like

That flow-offload commit?!

Does it solve it for all cases of flow-offload (off, software, hardware)?

It only solves the PPPoE disconnect issue. No HW offload, I tried it. However it would be very nice if someone can clarify what these commits are actually achieving and what are the recommended settings, as it is quite clear that the default settings and only enabling software offload is far from enough.

Ohh, thats sad it doesn't enable HW offloading. :frowning:


Ahh sorry!

I wanted to ask:

  • Does it solve the disconnect issue when the flow-offload is off?
  • Does it solve the disconnect issue when the flow-offload is software?
  • Does it solve the disconnect issue when the flow-offload is hardware?

When PPPoE disconnect issue presented itself, it affected all of the above cases. It did not matter if any type of offload was enabled or not.

Testing it now on my R6850 router (mt7621a/t). Was getting lots of modem hangup on PPPoE like every hour or 2. So far 5 hours in and PPPoE still up with that commit.

The packet steering with software offload became the default in one of the commits months ago. I can't recall, but it said it provides more performance with it enabled along with SW offload. Maybe you can try it now with HW NAT, since along with that latest MT76 patch, HW NAT on my device works (based on the fact that with it enabled, SQM is ignored as intended. That's as far as I can test with my capabilities).

Packet steering and SW offload is enabled, yet without tweaking kernel parameters, by default this setting combination gets 350Mbits and single core limit. This is clearly not the desired operation.

What kernel parameters are required to use multiple cores? I am running into the same issue with the master branch.

I am using this post as base, but not completely following it:

https://forum.openwrt.org/t/mtk-soc-eth-watchdog-timeout-after-r11573/50000/282

1 Like

I found an interesting patch set among the preliminary 5.9 kernel support in Felix's repository:

The PPE (packet processing engine) is used to offload NAT/routed or even bridged flows. This patch brings up the PPE and uses it to get a packet hash. It also contains some functionality that will be used to bring up flow offloading later.

https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=blob;f=target/linux/generic/pending-5.9/770-15-net-ethernet-mediatek-mtk_eth_soc-add-support-for-in.patch;h=a68f3f6307d4c7b9de42178a2549e06feada5500;hb=97992f99bcb5c8a7bad54317d76e0eaa946ef3f4

As much as I can see, this is also affects mt7621, so there is a good chance eventually we will see flow offload on our platform.

1 Like

Software flow offload works. But hardware its other thing...