Mtk_soc_eth watchdog timeout after r11573

Thank you @apocalypse for sharing your experience! This issue with the Edgerouter X already drove me nuts! With the current master branch OpenWRT is finally stable again. There is still one question left though - have you ever tried using a PoE adapter to power the edgerouter? There is one case I heard of where switching from PoE to the 12V power supply seemed to have caused the problems.

I have not tried feeding it by PoE, but I have tried with 12V 2A and 16V 4.5A power supplies and the result in 19.07 has been the same. It is only stable on Master branch. I have reached 22 days of error free uptime. Then there was a power outage. I have to connect the router to a UPS. But after 22 days I can say that it is 100% stable.

1 Like

It is confusing because it used to be said that with DSA there will not be hwnat on mt7621. Has that now changed? If VLAN tagging and IPv4 HW NAT HW offloading works, should we expect PPPoE HW offload to work at some point too? What about IPv6 offloading?

Well, this can be answered by @nbd, I am also interested in it (at least PPPoE HW offload). To be honest I have some doubts about even simple IPv4 HW offload works or not at the moment, only a single user confirmed it. Maybe with the move to kernel 5.10, something will change.

No it has not. @blogic has stated he was working on getting it working on mt7621, but we'll see.

Alright so I'm running the latest snapshot (at least 2 day old by now) and I got VLANs working (switch0/br-lan using DSA manually with a iface hotplug script) without any current issues so far.

I get ~375Mbit/s with an iperf3 test between a router and a PC (which the PC can handle ~931Mbit/s when against stock firmware EdgeRouter X). I enabled SW offload in the Firewall section of LuCI but got same results, and also HW offload later with the same results.

Am I enabling it incorrectly or is that just the current state of it? If so, how should I enable it (so that I can use it or at least test it and report back)?

HW offload is not currently supported. Unfortunately DSA is slower than swconfig.

1 Like

i disagree a little.
Yesterday i also tested iperf3 and got 900/900 mbps IPoE with 1.7% load of cpu.

Laptop (debian+marwell) 192.168.x.x as a client<- 1Gbps -> lan-Mi3G-wan <- 1Gbps -> desktop (windows+realtek) 10.10.x.x as a server.
May be i am doing something wrong, but 900 mbps
OpenWrt SNAPSHOT r15635-799fca7602 / LuCI Master git-21.020.56896-af422b1

if HW offload is available, it would show up under /proc/interrupts. AFAIK, no such thing exists for mt7621 right now.

I also get low load of cpu(1%) and 900Mbps IPOE speed.This wouldn't happen if hwnat didn't work.

wait, forum was hacked. are u realy neheb?

1 Like

I have no mt7621 hardware to confirm nor deny anything.

Anyway, that's good news if what you're saying is true.

Good news, a potential workaround for this age-old bug is coming to 19.07.7 according to a post on the mailing list: https://lists.openwrt.org/pipermail/openwrt-devel/2021-February/033649.html

1 Like

I hope, Baptiste will have success about it as disabling the TSO doesn't 100% fixes the problem, @Bartvz already said earlier:

But thankfully Jaap already asked about the fix: https://lists.openwrt.org/pipermail/openwrt-devel/2021-February/033652.html

1 Like

I think disabling TSO definitely doesn't fix anything. It just softens the problem.

The real solution is the DSA driver.

There you go DSA:

[3587566.117516] ------------[ cut here ]------------
[3587566.127133] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x2fc/0x304
[3587566.143966] NETDEV WATCHDOG: dsa (mtk_soc_eth): transmit queue 0 timed out
[3587566.158018] Modules linked in: pppoe ppp_async l2tp_ppp iptable_nat xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD xt_CT pppox ppp_generic nf_nat nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack ipt_REJECT xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG slhc nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables crc_ccitt sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ip6_gre ip_gre gre l2tp_eth l2tp_netlink l2tp_core udp_tunnel ip6_udp_tunnel ip6_tunnel tunnel6 ip_tunnel leds_gpio gpio_button_hotplug
[3587566.300596] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.4.58 #0
[3587566.312719] Stack : ffffffff 8007d450 806d0000 806d3f7c 80740000 806d3f44 806d3098 8fc13db4
[3587566.329705]         80880000 8fc3ccc8 8071ed43 80667dfc 00000003 00000001 8fc13d58 eb1d7caf
[3587566.346677]         00000000 00000000 808c0000 00000000 746e6961 000005c9 352e342e 30232038
[3587566.363647]         00000000 0002b941 00000000 00049634 00000000 80740000 00000000 804718f8
[3587566.380619]         00000009 00000003 00200000 00000122 00000000 8034f834 0000000c 8088000c
[3587566.397591]         ...
[3587566.402802] Call Trace:
[3587566.408038] [<8000b72c>] show_stack+0x30/0x100
[3587566.417256] [<805a8e08>] dump_stack+0xa4/0xdc
[3587566.426288] [<8002bea0>] __warn+0xc0/0x10c
[3587566.434789] [<8002bf78>] warn_slowpath_fmt+0x8c/0xac
[3587566.445043] [<804718f8>] dev_watchdog+0x2fc/0x304
[3587566.454765] [<80096278>] call_timer_fn.isra.34+0x20/0x90
[3587566.465701] [<800964c0>] run_timer_softirq+0x1d8/0x230
[3587566.476310] [<805c98f4>] __do_softirq+0x16c/0x334
[3587566.486044] [<8003060c>] irq_exit+0x98/0xb0
[3587566.494720] [<802eedc0>] plat_irq_dispatch+0x64/0x104
[3587566.505128] [<80006de8>] except_vec_vi_end+0xb8/0xc4
[3587566.515370] [<805c8fa8>] r4k_wait_irqoff+0x1c/0x24
[3587566.525511] ---[ end trace c4587fabce800a9b ]---
[3587566.535072] mtk_soc_eth 1e100000.ethernet dsa: transmit timed out
[3587566.548229] mtk_soc_eth 1e100000.ethernet dsa: Link is Down
[3587566.596716] mtk_soc_eth 1e100000.ethernet dsa: configuring for fixed/rgmii link mode
[3587566.612609] mtk_soc_eth 1e100000.ethernet dsa: Link is Up - 1Gbps/Full - flow control rx/tx

However, this happened maybe once in a month. Definitely far less often than previous driver.

This log is from the old kernel revision (5.4.58). I have been using 5.4.82 for more than 1 month and have not seen that message again. Nor have i been able to reproduce that error the way i did on 19.07.x in any way.

1 Like

Some devices may just be unstable, like a newly assembled PC. :slight_smile:
https://4pda.ru/forum/index.php?showtopic=837667&st=5500#entry75122115
This is a debatable issue, a curve router because of the program or hardware.
We need extensive testing.

Actually, this is dmesg from Edgerouter-X, so it should be ok hardware wise.

But you never know.

Looking at the latest comments on https://bugs.openwrt.org/index.php?do=details&task_id=2628 and folks are mentioning the latest snapshots have fixed this issue. Has anyone here tried that?

I’m using 19.07.7 on an EdgeRouter X, PPPoe WAN with VLAN tagging and the device goes down almost daily.

edit: ok reading further upwards I see people having success with kernel 5 and above. I’ll give latest snapshot a try.