Mtk_soc_eth watchdog timeout after r11573

For those who are interested, the above linked topic by me contains patches to disable flow control on ALL MACs instead of only one, disable it globally as well AND disable pause frame advertisement on the PHYs. I have tested it on my home router and everything is running fine.

However, this router has always been stable, so not sure if it actually fixes the transmit queue has timed out issue. I will deploy it to the router having issues this week. If anyone else wants to test it out feel free :slight_smile:

4 Likes

It seems this pacth https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=b59d5c8f0eebb6d15d7cefe487c17fad0ee4a524 solves the PPPoE drop issue.

On the other hand, can someone tell me what are the recommended settings for mt7621? In the last couple weeks I noticed that with default settings (packet steering and soft flow offload ON) I cannot reach more than 350Mbits and only a single core is utilized (with PPPoE).

1 Like

That flow-offload commit?!

Does it solve it for all cases of flow-offload (off, software, hardware)?

It only solves the PPPoE disconnect issue. No HW offload, I tried it. However it would be very nice if someone can clarify what these commits are actually achieving and what are the recommended settings, as it is quite clear that the default settings and only enabling software offload is far from enough.

Ohh, thats sad it doesn't enable HW offloading. :frowning:


Ahh sorry!

I wanted to ask:

  • Does it solve the disconnect issue when the flow-offload is off?
  • Does it solve the disconnect issue when the flow-offload is software?
  • Does it solve the disconnect issue when the flow-offload is hardware?

When PPPoE disconnect issue presented itself, it affected all of the above cases. It did not matter if any type of offload was enabled or not.

Testing it now on my R6850 router (mt7621a/t). Was getting lots of modem hangup on PPPoE like every hour or 2. So far 5 hours in and PPPoE still up with that commit.

The packet steering with software offload became the default in one of the commits months ago. I can't recall, but it said it provides more performance with it enabled along with SW offload. Maybe you can try it now with HW NAT, since along with that latest MT76 patch, HW NAT on my device works (based on the fact that with it enabled, SQM is ignored as intended. That's as far as I can test with my capabilities).

Packet steering and SW offload is enabled, yet without tweaking kernel parameters, by default this setting combination gets 350Mbits and single core limit. This is clearly not the desired operation.

What kernel parameters are required to use multiple cores? I am running into the same issue with the master branch.

I am using this post as base, but not completely following it:

https://forum.openwrt.org/t/mtk-soc-eth-watchdog-timeout-after-r11573/50000/282

1 Like

I found an interesting patch set among the preliminary 5.9 kernel support in Felix's repository:

The PPE (packet processing engine) is used to offload NAT/routed or even bridged flows. This patch brings up the PPE and uses it to get a packet hash. It also contains some functionality that will be used to bring up flow offloading later.

https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=blob;f=target/linux/generic/pending-5.9/770-15-net-ethernet-mediatek-mtk_eth_soc-add-support-for-in.patch;h=a68f3f6307d4c7b9de42178a2549e06feada5500;hb=97992f99bcb5c8a7bad54317d76e0eaa946ef3f4

As much as I can see, this is also affects mt7621, so there is a good chance eventually we will see flow offload on our platform.

1 Like

Software flow offload works. But hardware its other thing...

I was almost sure this bug was fixed in the latest trunk with the DSA changes and patches but it occured again. The router continued to work after the exception.

Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.792725] ------------[ cut here ]------------
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.797500] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x2fc/0x304
Fri Nov  6 18:21:44 2020 kern.info kernel: [178602.805855] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.812890] Modules linked in: ksmbd pppoe ppp_async iptable_nat batman_adv xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD pppox ppp_generic nf_nat nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack mt76x2u mt76x2e mt76x2_common mt76x02_usb mt76x02_lib mt7603e mt76_usb mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG wireguard slhc nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_filter ip_tables crc_ccitt compat ledtrig_usbport ledtrig_heartbeat nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ip6_udp_tunnel udp_tunnel nls_utf8 sha512_generic sha256_generic libsha256 seqiv jitterentropy_rng drbg md5 md4 hmac ghash_generic gf128mul gcm ecb des_generic libdes ctr cmac ccm arc4 leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common crc32c_generic
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.900247] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.74 #0
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.906239] Stack : ffffffff 8007d454 80680000 80681564 806e0000 8068152c 80680680 8fc11db4
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.914655]         80820000 8fc3c724 806c8ce3 80618ff0 00000002 00000001 8fc11d58 3a466414
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.923072]         00000000 00000000 80860000 00000000 00000030 00000189 2e352064 34372e34
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.931492]         00000000 000022b6 00000000 70617773 00000000 806e0000 00000000 8044f254
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.939904]         00000009 00000002 00200000 00000122 00000003 80338a50 00000008 80820008
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.948326]         ...
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.950864] Call Trace:
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.953418] [<8000b72c>] show_stack+0x30/0x100
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.957939] [<805605d0>] dump_stack+0xa4/0xdc
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.962379] [<8002c00c>] __warn+0xc0/0x10c
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.966550] [<8002c0e4>] warn_slowpath_fmt+0x8c/0xac
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.971604] [<8044f254>] dev_watchdog+0x2fc/0x304
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.976391] [<80096280>] call_timer_fn.isra.34+0x20/0x90
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.981768] [<800964c8>] run_timer_softirq+0x1d8/0x230
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.986974] [<80581134>] __do_softirq+0x16c/0x334
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.991761] [<80030778>] irq_exit+0x98/0xb0
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.996020] [<802da714>] plat_irq_dispatch+0x64/0x104
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178603.001139] [<80006de8>] except_vec_vi_end+0xb8/0xc4
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178603.006192] [<805807e8>] r4k_wait_irqoff+0x1c/0x24
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178603.011200] ---[ end trace 11e8faf90187f74e ]---
Fri Nov  6 18:21:44 2020 kern.err kernel: [178603.015904] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
Fri Nov  6 18:21:44 2020 kern.info kernel: [178603.022607] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
Fri Nov  6 18:21:44 2020 kern.err kernel: [178603.029898] mtk_soc_eth 1e100000.ethernet: PPE table busy
Fri Nov  6 18:21:44 2020 kern.info kernel: [178603.061390] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
Fri Nov  6 18:21:44 2020 kern.info kernel: [178603.069470] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx

I found that the DSA-driven mt7530 switch can now set VLAN through UCI. Netifd provided support in the latest submission:


This is my uci settings:
uci set network.sw=interface
uci set network.sw.type='bridge'
uci add network bridge-vlan
uci set network.@bridge-vlan[0].device='br-sw'
uci set network.@bridge-vlan[0].vlan='1'
uci set network.@bridge-vlan[0].ports='lan1:t lan2 lan3'
uci add network bridge-vlan
uci set network.@bridge-vlan[1].device='br-sw'
uci set network.@bridge-vlan[1].vlan='3'
uci set network.@bridge-vlan[1].ports='lan1:t lan4'
uci add network bridge-vlan
uci set network.@bridge-vlan[2].device='br-sw'
uci set network.@bridge-vlan[2].vlan='4'
uci set network.@bridge-vlan[2].ports='lan1:t'
uci set network.lan.ifname='br-sw.1 bat0'

The VLAN is correctly set, and the iptv multicast data is transmitted stably on the VLAN; but batman-adv will cause the kernel to panic:

[   27.882353] batman_adv: bat0: Adding interface: br-sw.4
[   27.887639] batman_adv: bat0: The MTU of interface br-sw.4 is too small (1500) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1560 would solve the problem.
[   27.911932] batman_adv: bat0: Interface activated: br-sw.4
[   32.067719] ------------[ cut here ]------------
[   32.072383] WARNING: CPU: 3 PID: 0 at net/bridge/br_switchdev.c:46 br_handle_frame_finish+0xac/0x4ac
[   32.081510] Modules linked in: xt_connlimit pppoe ppp_async nf_conncount iptable_nat batman_adv xt_state xt_nat xt_helper xt_conntrack xt_connmark xt_connbytes xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD pppox ppp_generic nft_redir nft_nat nft_masq nft_flow_offload nft_ct nft_chain_nat nf_nat nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack_netlink nf_conntrack mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_recent xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG wireguard usblp ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda slhc nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject_bridge nft_reject nft_quota nft_objref nft_numgen nft_meta_bridge nft_log nft_limit nft_hash nft_fwd_netdev nft_dup_netdev nft_counter nf_tables_set nf_tables nf_reject_ipv4 nf_log_ipv4 nf_dup_netdev
[   32.081719]  nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_filter ip_tables crc_ccitt compat ledtrig_usbport ledtrig_heartbeat xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 msdos ip_gre gre ip_tunnel vfat fat fscache nls_utf8 nls_iso8859_1 nls_cp437 geneve udp_tunnel ip6_udp_tunnel uas usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 mbcache jbd2 usbcore nls_base usb_common crc32c_generic
[   32.240618] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.4.75 #0
[   32.246518] Stack : 00000000 80083974 00000001 00000001 806f0000 806f3000 806f2104 8fc25c7c
[   32.254845]         808d0000 8fc8ccc4 8073f0a7 8068589c 00000003 00000001 8fc25c20 0a0b3b2e
[   32.263175]         00000000 00000000 80910000 00000000 00000030 00000165 342e3520 2035372e
[   32.271498]         00000000 00000001 00000000 0003abea 00000000 80770000 00000000 0000002e
[   32.279822]         00000009 8073cf6c 8fc25e90 80740000 00000002 80380cc0 0000000c 808d000c
[   32.288145]         ...
[   32.290583] Call Trace:
[   32.293053] [<8000c11c>] show_stack+0x30/0x100
[   32.297510] [<805b7d50>] dump_stack+0xa4/0xdc
[   32.301873] [<8002d4b8>] __warn+0xc0/0x120
[   32.305955] [<8002d574>] warn_slowpath_fmt+0x5c/0xac
[   32.310908] [<80593f80>] br_handle_frame_finish+0xac/0x4ac
[   32.316372] [<80594718>] br_handle_frame+0x398/0x4e4
[   32.321345] [<80455734>] __netif_receive_skb_core+0x268/0xb10
[   32.327093] [<80456000>] __netif_receive_skb_one_core+0x24/0x50
[   32.333010] [<8045631c>] process_backlog+0x9c/0x178
[   32.337896] [<80457e88>] __napi_poll+0x40/0x1a0
[   32.342426] [<8045818c>] net_rx_action+0x114/0x28c
[   32.347224] [<805d9c90>] __do_softirq+0x198/0x458
[   32.351927] [<800328b0>] irq_exit+0x98/0xb0
[   32.356121] [<80320050>] plat_irq_dispatch+0x64/0x104
[   32.361165] [<80007328>] except_vec_vi_end+0xb8/0xc4
[   32.366114] [<805d9120>] r4k_wait_irqoff+0x1c/0x24
[   32.371053] ---[ end trace 9a2b408b492abc33 ]---

The router does not seem to be affected, the mesh node can be seen, but the data cannot be forwarded on the vlan.

Finally, with 19.07.4 the router has locked up again, having to disconnect it from the power supply after several consecutive reboots but without any transmit timed out error. The configuration is the same as it had on 19.07.1 which was being stable for months. It has coincided that I have changed the 2A power supply that I put in May, again for the original.

I don't know if this may have been the cause, it seems strange to me that the original power supply is not able to give the current it needs (the router drains 5W maximum and the power supply is 6W). I have returned to 19.07.1 to see what happens with original power supply that I've never tried while it's been stable.

I have reverted to 19.07.1 with the same configuration that was not giving me problems, and it keeps restarting or hangs after 24 hours. Using iperf from WAN to LAN i get it to hang or reboot almost immediately, to speed up the testing process.

I have tried different Switch VLAN configurations, disabling all ports except eth0 (in VLAN2) and eth1 (in VLAN1). I have also disabled Software Flow Offloading, stopped OpenVPN, etc. Unsuccessfully. Everything indicates that the power adapter that comes with the Edgerouter X is not enough or is defective.

its may be usefool for developers, or not.

[    0.000000] Linux version 4.14.206 (builder@buildhost) (gcc version 7.5.0 (OpenWrt GCC 7.5.0 r11242-6703abb7ca)) #0 SMP Wed Nov 25 05:02:08 2020
[    0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
*
*
[ 2103.572330] conntrack: generic helper won't handle protocol 47. Please consider loading the specific helper module.
[ 4153.099302] ------------[ cut here ]------------
[ 4153.103943] WARNING: CPU: 0 PID: 7 at net/sched/sch_generic.c:320 0x8038cd70

I don't have services that used protocol 47 (Cisco GRE).
most often, error "protocol 47" is followed by "sch_generic.c: 320"

This is useful. Never buy hardware that the developers don't have.

GRE protocol is used by PPTP. Make sure no device on the network is using it. It has nothing to do with the kernel panic.

There are definitely no devices and services on the network that use any type of VPN.
If VPN was used, then the entries in the LOG would be permanent.
This is a one-time error occurring a few minutes before error 320.
Obviously connected with overloading of some of the blocks or buses of SOC , or wrong timings / buffers.

PS. I playing around CPU/OCP/SYS devider by bootstrap resistor.
880/293/220 is more stable then frequency 880/220/220 MHz.
Bandwidth of OCP bus is matter.

The hangs and reboots issues came back after four months stable, and I can't find a way to make it work fine again.

While it was working correctly, it was using 19.07.1 with GMAC Port 5 FC Off and Interrupt Handling Patch (https://patchwork.ozlabs.org/project/openwrt/patch/20190306040846.21746-1-rosenp@gmail.com/). Each Ethernet port in a different VLAN (or with more than one). In this way it was 100% stable, but I made the changes below and since then I have had problems with hangs and reboots when squeezing the connection:

  1. I updated to 19.07.4.

  2. I added another managed switch in cascade to port eth1 where I connected the machines that were on ports eth2, eth3 and eth4. Thus I eliminated the need to use software-bridge between the different VLANs of each port. I disabled the ports eth2, eth3 and eth4 (they do not belong to any VLAN). Ports eth0 (WAN) and eth1 (LAN) remain unchanged, VLAN2 and VLAN1 respectively.

  3. I reconnected the original power supply, while it was stable I had another one in use to rule out problems.

I have tried to revert this changes to the previous functional one, but keeping the Switch and not connecting anything to the ports eth2, eth3 and eth4 (although they have a VLAN assigned, and belong to a software-bridge). The same problem continues, if I use the WAN intensively, the communication between the ethernet ports is lost and the router cannot be accessed, or it is directly rebooted. The syslog is clean, no transmit timed out or kernel crash.

I can not find a logical explanation, having tried to leave it rolled (almost) as before. Actually the only change right now is that there is nothing connected to ports eth2, eth3 and eth4. I already had a Switch connected to eth1, I just added another one in cascade.

I will compile 19.07.1 without Interrupt Handling Patch, which was reverted in branch 19.07 for alleged problems. But I remember that without that patch, the logs were flooded with transmit timed out errors on a daily basis.