Transmit Queue timed out on Netgear WAX206 / MT7622BV / mtk_soc_eth

Dear community,

I observed a network outage where my device connected to Netgear WAX206 could not reach the internet (provided via FritzBox 7530 connected via ethernet to the WAX) anymore.

I rebooted the netgear and then it worked again.

Syslog:

Mon Jul 17 06:22:01 2023 kern.warn kernel: [8025799.999008] ------------[ cut here ]------------
Mon Jul 17 06:22:01 2023 kern.info kernel: [8025800.003817] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 7 timed out
Mon Jul 17 06:22:01 2023 kern.warn kernel: [8025800.010970] WARNING: CPU: 1 PID: 0 at dev_watchdog+0x330/0x33c
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.016977] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 cfg80211 slhc nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c hwmon crc_ccitt compat tun seqiv leds_gpio gpio_button_hotplug
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.073098] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.106 #0
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.079358] Hardware name: Netgear WAX206 (DT)
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.083966] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.091094] pc : dev_watchdog+0x330/0x33c
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.095269] lr : dev_watchdog+0x330/0x33c
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.099444] sp : ffffffc00800bd90
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.102921] x29: ffffffc00800bd90 x28: 0000000000000140 x27: 00000000ffffffff
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.110226] x26: 0000000000000000 x25: 0000000000000001 x24: ffffff8000e994c0
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.117530] x23: 0000000000000000 x22: 0000000000000001 x21: ffffffc008b06000
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.124834] x20: ffffff8000e99000 x19: 0000000000000007 x18: ffffffc008b1a3a0
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.132138] x17: ffffffc0173c6000 x16: ffffffc00800c000 x15: 0000000000000ca5
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.139442] x14: 0000000000000437 x13: ffffffc00800bab8 x12: ffffffc008b723a0
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.146746] x11: 712074696d736e61 x10: ffffffc008b723a0 x9 : 0000000000000000
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.154050] x8 : ffffffc008b1a350 x7 : ffffffc008b1a3a0 x6 : 0000000000000001
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.161354] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.168657] x2 : 0000000000000002 x1 : 0000000000000005 x0 : 000000000000003f
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.175962] Call trace:
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.178573]  dev_watchdog+0x330/0x33c
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.182400]  call_timer_fn.constprop.0+0x20/0x80
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.187186]  __run_timers.part.0+0x208/0x284
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.191621]  run_timer_softirq+0x38/0x70
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.195709]  _stext+0x10c/0x28c
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.199017]  __irq_exit_rcu+0xdc/0xfc
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.202846]  irq_exit+0xc/0x1c
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.206065]  handle_domain_irq+0x60/0x8c
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.210153]  gic_handle_irq+0x64/0x8c
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.213982]  call_on_irq_stack+0x28/0x44
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.218072]  do_interrupt_handler+0x4c/0x54
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.222420]  el1_interrupt+0x2c/0x4c
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.226162]  el1h_64_irq_handler+0x14/0x20
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.230424]  el1h_64_irq+0x74/0x78
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.233991]  arch_cpu_idle+0x14/0x20
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.237732]  do_idle+0xc0/0x150
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.241042]  cpu_startup_entry+0x24/0x60
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.245132]  secondary_start_kernel+0x130/0x140
Mon Jul 17 06:22:01 2023 kern.debug kernel: [8025800.249828]  __secondary_switched+0x50/0x54
Mon Jul 17 06:22:01 2023 kern.warn kernel: [8025800.254183] ---[ end trace 257eb2115bb44383 ]---

Also, the fritz box got stuck transmit queues, due to that. I wonder if that shows some kind of potential problem?

Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203206.858422] ------------[ cut here ]------------
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203206.858485] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x2bc/0x2c0
Mon Jul 17 06:22:31 2023 kern.info kernel: [1203206.862154] NETDEV WATCHDOG: eth0 (ipqess-edma): transmit queue 2 timed out
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203206.870729] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet ath10k_pci ath10k_core ath wireguard pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_compat nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mac80211 libchacha20poly1305 iptable_mangle iptable_filter ipt_REJECT ipt_ECN ip_tables curve25519_neon cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY x_tables slhc sch_cake poly1305_arm nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libcrc32c hwmon crc_ccitt compat clip chacha_neon sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_route cls_matchall
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203206.871645]  cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact drv_dsl_cpe_api drv_mei_cpe ifb ip6_udp_tunnel udp_tunnel tun br2684 vrx518_tc atm vrx518 drv_ifxos md5 ghash_arm_ce cmac leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom gpio_button_hotplug crc32c_generic
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203206.967940] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.105 #0
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203206.990164] Hardware name: Generic DT based system
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203206.996331] [<c030cdbc>] (unwind_backtrace) from [<c030985c>] (show_stack+0x10/0x14)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.001539] [<c030985c>] (show_stack) from [<c0617cc0>] (dump_stack_lvl+0x40/0x4c)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.009264] [<c0617cc0>] (dump_stack_lvl) from [<c03220b8>] (__warn+0x8c/0x100)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.017163] [<c03220b8>] (__warn) from [<c0322194>] (warn_slowpath_fmt+0x68/0x78)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.024540] [<c0322194>] (warn_slowpath_fmt) from [<c08348a0>] (dev_watchdog+0x2bc/0x2c0)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.032008] [<c08348a0>] (dev_watchdog) from [<c038a420>] (call_timer_fn.constprop.0+0x24/0x88)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.040341] [<c038a420>] (call_timer_fn.constprop.0) from [<c038ab40>] (__run_timers.part.0+0x1f0/0x25c)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.049369] [<c038ab40>] (__run_timers.part.0) from [<c038abe4>] (run_timer_softirq+0x38/0x68)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.058917] [<c038abe4>] (run_timer_softirq) from [<c03012b4>] (__do_softirq+0x10c/0x2c4)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.067858] [<c03012b4>] (__do_softirq) from [<c0326c1c>] (irq_exit+0xbc/0x100)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.075931] [<c0326c1c>] (irq_exit) from [<c036fcec>] (handle_domain_irq+0x60/0x78)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.083568] [<c036fcec>] (handle_domain_irq) from [<c063018c>] (gic_handle_irq+0x7c/0x90)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.091383] [<c063018c>] (gic_handle_irq) from [<c0300b3c>] (__irq_svc+0x5c/0x78)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.099539] Exception stack(0xc0d01f10 to 0xc0d01f58)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.107179] 1f00:                                     940b8d6c 00000000 00000001 c0312880
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.112394] 1f20: 00000000 c0d04f28 c0d00000 00000000 00000000 ffffe000 c0d04ec8 c0d04f5c
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.120726] 1f40: c0d04fcc c0d01f60 c03070ac c03070b0 60000013 ffffffff
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.129054] [<c0300b3c>] (__irq_svc) from [<c03070b0>] (arch_cpu_idle+0x38/0x3c)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.135999] [<c03070b0>] (arch_cpu_idle) from [<c0351064>] (do_idle+0x238/0x298)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.143463] [<c0351064>] (do_idle) from [<c03513c8>] (cpu_startup_entry+0x18/0x1c)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.151017] [<c03513c8>] (cpu_startup_entry) from [<c0c01140>] (start_kernel+0x668/0x678)
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.158970] ---[ end trace a3b252e3ef3109a9 ]---
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.167018] ipqess-edma c080000.ethernet eth0: hardware queue 8 is in stuck?
Mon Jul 17 06:22:31 2023 kern.warn kernel: [1203207.173346] ath10k_ahb a000000.wifi: SWBA overrun on vdev 1, skipped old beacon
Mon Jul 17 06:22:37 2023 kern.warn kernel: [1203213.098338] ipqess-edma c080000.ethernet eth0: hardware queue 8 is in stuck?
Mon Jul 17 06:22:43 2023 kern.warn kernel: [1203218.858265] ipqess-edma c080000.ethernet eth0: hardware queue 8 is in stuck?
Mon Jul 17 06:22:48 2023 kern.warn kernel: [1203223.898213] ipqess-edma c080000.ethernet eth0: hardware queue 8 is in stuck?
Mon Jul 17 06:22:53 2023 kern.warn kernel: [1203228.858137] ipqess-edma c080000.ethernet eth0: hardware queue 8 is in stuck?

(After the WAX reboot, the hardware queue 8 is stuck messages disappear).

Netgear running OpenWrt SNAPSHOT r22565-877ec78e23 (yep, not up to date, a few months old, I need to update...)
And fritzbox some similar version, maybe a bit older.

This was my first problem with this hardware setup, I use it in prodution since April, so I cannot test/reproduce the error anyway.

In the meantime, I also replaced the ethernet cable and used different ports as well as rebooted both devices. After that, the error came back.

Unfortunately, it shows quite reliably now. As a first measure, I threw out the WAX206 and replaced it with a new one I luckily got yesterday (now on yesterdays snapshot). As the error needs a few minutes to show, I am unsure if it is fixed now.

As the setup was running perfectly fine and nothing has been changed, I am left questioning.
I also checked frame counters in ifconfig and there were no errors on either device. Any ideas how to check further?

The WAX206 was generally reachable via ethernet still - another AP connected to another port did not show any signs of ethernet/connection problems (except no internet, for sure). I was able to reach the WAX through that ethernet port...

There is related issues on Github:
https://github.com/openwrt/openwrt/issues/13122 (MT7986a)
https://github.com/openwrt/openwrt/issues/12143 (Also MT7986 - Bananapi BPI R3)

All share the same mtk_soc_eth driver.
I don't know how to reproduce the error but it would be nice to have a plan what to check when it occurs again...
For me, all flow offloading is turned off and the SOC was running with performance governor.

I also had a similar issue with my wax206:
I simplified my network setup to rule out errors there, even disabled to firewall for testing purposes.
I've narrowed down a test case where the WAX206 should forward
ping requests from a LAN port to the WAN port and let it running over night: I got 78% packet loss.

Observable symptoms are that all network traffic ceases in bursts of about 20 seconds duration until it
returns to a working state for a time interval of random length. Once the device is working again, some stuck ping requests get answered leading to extremely high response times for those packets (>2s).

That sounds like a very different problem - but still a bad one. I did not have anything occuring that looked similar like your problem.