Mtk_soc_eth watchdog timeout after r11573

My PPPoE issue is still not resolved, but I managed to narrow it down:

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=17e64b9447959858c5c85f7f6c98264775585711 - PPPoE works.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=770a9c678756462ab0d94656f1fcc30624a31bd0 - PPPoE fails every couple minutes.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=283cdb30ab70e01ddeae3bb74a4c719bc460d3e9 - PPPoE fails every couple minutes.

So something went wrong between "kernel: bump 5.4 to 5.4.65" and "kernel: bump 5.4 to 5.4.66". As there were only two commits that can cause this (realetd to ramips/mediatek), I will further narrow it down.

Still find it strange that noone is having this issue with latest snapshots...

1 Like

You are not the only one. See this post and the one right below it: Xiaomi Mi Router 4A Gigabit Edition (R4AG/R4A Gigabit) -- fully supported and flashable with OpenWRTInvasion - #1186 by tsipizic

Same issue after a week of uptime on 19.07.04 (Mikrotik RBM33G), this bug is not fixed.

Found it!

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=f0cc5f6c0a72b7da9ed5915cf561e2f81d514c68

This commit breaks PPPoE. With the one before it works perfectly.

Sent a mail to @nbd with logs.

MOD: if I revert this single commit, latest snapshot starts to work just fine.

6 Likes

Hi! Same problem here with Xiaomi Mi Router snapshots (model R2100 with SoC MT7621). PPPoE disconnects every few minutes...

19.07.4 @scp07 version works perfect, connection is always stable.

I still want to try and disable flow control on ALL ports instead of just the CPU port, to see if that makes any difference for how often this issue crops up. Unfortunately, my knowledge is falling short of being able to write a patch myself. If there are any developers willing to collaborate, please have a look at the new topic I've started: Mt7621 / mt7530 programming: Disabling Flow Control on all ports

1 Like

For those who are interested, the above linked topic by me contains patches to disable flow control on ALL MACs instead of only one, disable it globally as well AND disable pause frame advertisement on the PHYs. I have tested it on my home router and everything is running fine.

However, this router has always been stable, so not sure if it actually fixes the transmit queue has timed out issue. I will deploy it to the router having issues this week. If anyone else wants to test it out feel free :slight_smile:

4 Likes

It seems this pacth https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=b59d5c8f0eebb6d15d7cefe487c17fad0ee4a524 solves the PPPoE drop issue.

On the other hand, can someone tell me what are the recommended settings for mt7621? In the last couple weeks I noticed that with default settings (packet steering and soft flow offload ON) I cannot reach more than 350Mbits and only a single core is utilized (with PPPoE).

1 Like

That flow-offload commit?!

Does it solve it for all cases of flow-offload (off, software, hardware)?

It only solves the PPPoE disconnect issue. No HW offload, I tried it. However it would be very nice if someone can clarify what these commits are actually achieving and what are the recommended settings, as it is quite clear that the default settings and only enabling software offload is far from enough.

Ohh, thats sad it doesn't enable HW offloading. :frowning:


Ahh sorry!

I wanted to ask:

  • Does it solve the disconnect issue when the flow-offload is off?
  • Does it solve the disconnect issue when the flow-offload is software?
  • Does it solve the disconnect issue when the flow-offload is hardware?

When PPPoE disconnect issue presented itself, it affected all of the above cases. It did not matter if any type of offload was enabled or not.

Testing it now on my R6850 router (mt7621a/t). Was getting lots of modem hangup on PPPoE like every hour or 2. So far 5 hours in and PPPoE still up with that commit.

The packet steering with software offload became the default in one of the commits months ago. I can't recall, but it said it provides more performance with it enabled along with SW offload. Maybe you can try it now with HW NAT, since along with that latest MT76 patch, HW NAT on my device works (based on the fact that with it enabled, SQM is ignored as intended. That's as far as I can test with my capabilities).

Packet steering and SW offload is enabled, yet without tweaking kernel parameters, by default this setting combination gets 350Mbits and single core limit. This is clearly not the desired operation.

What kernel parameters are required to use multiple cores? I am running into the same issue with the master branch.

I am using this post as base, but not completely following it:

https://forum.openwrt.org/t/mtk-soc-eth-watchdog-timeout-after-r11573/50000/282

1 Like

I found an interesting patch set among the preliminary 5.9 kernel support in Felix's repository:

The PPE (packet processing engine) is used to offload NAT/routed or even bridged flows. This patch brings up the PPE and uses it to get a packet hash. It also contains some functionality that will be used to bring up flow offloading later.

https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=blob;f=target/linux/generic/pending-5.9/770-15-net-ethernet-mediatek-mtk_eth_soc-add-support-for-in.patch;h=a68f3f6307d4c7b9de42178a2549e06feada5500;hb=97992f99bcb5c8a7bad54317d76e0eaa946ef3f4

As much as I can see, this is also affects mt7621, so there is a good chance eventually we will see flow offload on our platform.

1 Like

Software flow offload works. But hardware its other thing...

I was almost sure this bug was fixed in the latest trunk with the DSA changes and patches but it occured again. The router continued to work after the exception.

Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.792725] ------------[ cut here ]------------
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.797500] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x2fc/0x304
Fri Nov  6 18:21:44 2020 kern.info kernel: [178602.805855] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.812890] Modules linked in: ksmbd pppoe ppp_async iptable_nat batman_adv xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD pppox ppp_generic nf_nat nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack mt76x2u mt76x2e mt76x2_common mt76x02_usb mt76x02_lib mt7603e mt76_usb mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG wireguard slhc nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_filter ip_tables crc_ccitt compat ledtrig_usbport ledtrig_heartbeat nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ip6_udp_tunnel udp_tunnel nls_utf8 sha512_generic sha256_generic libsha256 seqiv jitterentropy_rng drbg md5 md4 hmac ghash_generic gf128mul gcm ecb des_generic libdes ctr cmac ccm arc4 leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common crc32c_generic
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.900247] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.74 #0
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.906239] Stack : ffffffff 8007d454 80680000 80681564 806e0000 8068152c 80680680 8fc11db4
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.914655]         80820000 8fc3c724 806c8ce3 80618ff0 00000002 00000001 8fc11d58 3a466414
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.923072]         00000000 00000000 80860000 00000000 00000030 00000189 2e352064 34372e34
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.931492]         00000000 000022b6 00000000 70617773 00000000 806e0000 00000000 8044f254
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.939904]         00000009 00000002 00200000 00000122 00000003 80338a50 00000008 80820008
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.948326]         ...
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.950864] Call Trace:
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.953418] [<8000b72c>] show_stack+0x30/0x100
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.957939] [<805605d0>] dump_stack+0xa4/0xdc
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.962379] [<8002c00c>] __warn+0xc0/0x10c
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.966550] [<8002c0e4>] warn_slowpath_fmt+0x8c/0xac
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.971604] [<8044f254>] dev_watchdog+0x2fc/0x304
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.976391] [<80096280>] call_timer_fn.isra.34+0x20/0x90
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.981768] [<800964c8>] run_timer_softirq+0x1d8/0x230
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.986974] [<80581134>] __do_softirq+0x16c/0x334
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.991761] [<80030778>] irq_exit+0x98/0xb0
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178602.996020] [<802da714>] plat_irq_dispatch+0x64/0x104
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178603.001139] [<80006de8>] except_vec_vi_end+0xb8/0xc4
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178603.006192] [<805807e8>] r4k_wait_irqoff+0x1c/0x24
Fri Nov  6 18:21:44 2020 kern.warn kernel: [178603.011200] ---[ end trace 11e8faf90187f74e ]---
Fri Nov  6 18:21:44 2020 kern.err kernel: [178603.015904] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
Fri Nov  6 18:21:44 2020 kern.info kernel: [178603.022607] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
Fri Nov  6 18:21:44 2020 kern.err kernel: [178603.029898] mtk_soc_eth 1e100000.ethernet: PPE table busy
Fri Nov  6 18:21:44 2020 kern.info kernel: [178603.061390] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
Fri Nov  6 18:21:44 2020 kern.info kernel: [178603.069470] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx

I found that the DSA-driven mt7530 switch can now set VLAN through UCI. Netifd provided support in the latest submission:


This is my uci settings:
uci set network.sw=interface
uci set network.sw.type='bridge'
uci add network bridge-vlan
uci set network.@bridge-vlan[0].device='br-sw'
uci set network.@bridge-vlan[0].vlan='1'
uci set network.@bridge-vlan[0].ports='lan1:t lan2 lan3'
uci add network bridge-vlan
uci set network.@bridge-vlan[1].device='br-sw'
uci set network.@bridge-vlan[1].vlan='3'
uci set network.@bridge-vlan[1].ports='lan1:t lan4'
uci add network bridge-vlan
uci set network.@bridge-vlan[2].device='br-sw'
uci set network.@bridge-vlan[2].vlan='4'
uci set network.@bridge-vlan[2].ports='lan1:t'
uci set network.lan.ifname='br-sw.1 bat0'

The VLAN is correctly set, and the iptv multicast data is transmitted stably on the VLAN; but batman-adv will cause the kernel to panic:

[   27.882353] batman_adv: bat0: Adding interface: br-sw.4
[   27.887639] batman_adv: bat0: The MTU of interface br-sw.4 is too small (1500) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1560 would solve the problem.
[   27.911932] batman_adv: bat0: Interface activated: br-sw.4
[   32.067719] ------------[ cut here ]------------
[   32.072383] WARNING: CPU: 3 PID: 0 at net/bridge/br_switchdev.c:46 br_handle_frame_finish+0xac/0x4ac
[   32.081510] Modules linked in: xt_connlimit pppoe ppp_async nf_conncount iptable_nat batman_adv xt_state xt_nat xt_helper xt_conntrack xt_connmark xt_connbytes xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD pppox ppp_generic nft_redir nft_nat nft_masq nft_flow_offload nft_ct nft_chain_nat nf_nat nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack_netlink nf_conntrack mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_recent xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG wireguard usblp ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda slhc nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject_bridge nft_reject nft_quota nft_objref nft_numgen nft_meta_bridge nft_log nft_limit nft_hash nft_fwd_netdev nft_dup_netdev nft_counter nf_tables_set nf_tables nf_reject_ipv4 nf_log_ipv4 nf_dup_netdev
[   32.081719]  nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_filter ip_tables crc_ccitt compat ledtrig_usbport ledtrig_heartbeat xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 msdos ip_gre gre ip_tunnel vfat fat fscache nls_utf8 nls_iso8859_1 nls_cp437 geneve udp_tunnel ip6_udp_tunnel uas usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 mbcache jbd2 usbcore nls_base usb_common crc32c_generic
[   32.240618] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.4.75 #0
[   32.246518] Stack : 00000000 80083974 00000001 00000001 806f0000 806f3000 806f2104 8fc25c7c
[   32.254845]         808d0000 8fc8ccc4 8073f0a7 8068589c 00000003 00000001 8fc25c20 0a0b3b2e
[   32.263175]         00000000 00000000 80910000 00000000 00000030 00000165 342e3520 2035372e
[   32.271498]         00000000 00000001 00000000 0003abea 00000000 80770000 00000000 0000002e
[   32.279822]         00000009 8073cf6c 8fc25e90 80740000 00000002 80380cc0 0000000c 808d000c
[   32.288145]         ...
[   32.290583] Call Trace:
[   32.293053] [<8000c11c>] show_stack+0x30/0x100
[   32.297510] [<805b7d50>] dump_stack+0xa4/0xdc
[   32.301873] [<8002d4b8>] __warn+0xc0/0x120
[   32.305955] [<8002d574>] warn_slowpath_fmt+0x5c/0xac
[   32.310908] [<80593f80>] br_handle_frame_finish+0xac/0x4ac
[   32.316372] [<80594718>] br_handle_frame+0x398/0x4e4
[   32.321345] [<80455734>] __netif_receive_skb_core+0x268/0xb10
[   32.327093] [<80456000>] __netif_receive_skb_one_core+0x24/0x50
[   32.333010] [<8045631c>] process_backlog+0x9c/0x178
[   32.337896] [<80457e88>] __napi_poll+0x40/0x1a0
[   32.342426] [<8045818c>] net_rx_action+0x114/0x28c
[   32.347224] [<805d9c90>] __do_softirq+0x198/0x458
[   32.351927] [<800328b0>] irq_exit+0x98/0xb0
[   32.356121] [<80320050>] plat_irq_dispatch+0x64/0x104
[   32.361165] [<80007328>] except_vec_vi_end+0xb8/0xc4
[   32.366114] [<805d9120>] r4k_wait_irqoff+0x1c/0x24
[   32.371053] ---[ end trace 9a2b408b492abc33 ]---

The router does not seem to be affected, the mesh node can be seen, but the data cannot be forwarded on the vlan.