Mtk_soc_eth watchdog timeout after r11573

refelian · June 17, 2020, 1:42pm

It seems to be happening every hour now for me. Any idea how to extract kernel logging after power cycling from a previous boot? I can't access the device through luci/ssh.

lantsem · June 19, 2020, 8:38am

Try to disable packet steering (RPS) that was enabled by default a few days ago.

dchard · June 23, 2020, 7:39pm

I am using RPS for 2 months now on a DIR-860L and it does not create any issue, at least not on my end (with kernel 5.4).

lantsem · June 23, 2020, 8:26pm

I got tx timed out with RPS pretty easily with my 1Gbps uplink and kernel 5.4. After disabling RPS so far not once.

dchard · June 24, 2020, 3:13pm

Maybe you want to try with latest master, as there was a few additions in the last 23 days. I only have 300Mbps uplink, so I cannot test up to a gig on the uplink side.

lantsem · June 25, 2020, 12:12pm

Already tested. Performance increased after added optimizations to ethernet driver, but transmit timed out is still there with RPS enabled and transfer speed around ~450 Mbps:

[274768.477705] ------------[ cut here ]------------
[274768.487131] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x2fc/0x304
[274768.503782] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[274768.517852] Modules linked in: mt7615e mt7615_common mt76 mac80211 iptable_nat ipt_REJECT cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_MASQUERADE xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY nf_reject_ipv4 nf_nat_ftp nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack_ftp nf_conncount iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables compat act_connmark nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ifb leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[274768.690284] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G        W         5.4.46 #0
[274768.705005] Stack : ffffffff 80077050 80670000 80673774 806d0000 8067373c 80672890 8fc13db4
[274768.721815]         80810000 8fc3ccc4 806bad03 8060aaf4 00000003 00000001 8fc13d58 48e5ef65
[274768.738611]         00000000 00000000 80850000 00000000 00000030 0000019d 20202020 20202020
[274768.755405]         00000000 00003566 00000000 20202020 00000000 806d0000 00000000 80438604
[274768.772214]         00000009 00000003 00200000 00000122 00000003 80324bf0 0000000c 8081000c
[274768.789026]         ...
[274768.794054] Call Trace:
[274768.799110] [<8000b72c>] show_stack+0x30/0x100
[274768.808134] [<80553484>] dump_stack+0xa4/0xdc
[274768.816978] [<80028058>] __warn+0xc0/0x10c
[274768.825296] [<80028130>] warn_slowpath_fmt+0x8c/0xac
[274768.835359] [<80438604>] dev_watchdog+0x2fc/0x304
[274768.844888] [<80090210>] call_timer_fn.isra.34+0x20/0x90
[274768.855621] [<80090458>] run_timer_softirq+0x1d8/0x230
[274768.866022] [<80573fec>] __do_softirq+0x16c/0x334
[274768.875562] [<8002c7a4>] irq_exit+0x98/0xb0
[274768.884059] [<802c6860>] plat_irq_dispatch+0x64/0x104
[274768.894283] [<80006de8>] except_vec_vi_end+0xb8/0xc4
[274768.904333] [<805736a0>] r4k_wait_irqoff+0x1c/0x24
[274768.914377] ---[ end trace b5b1f56d83bf2de2 ]---
[274768.923778] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[274768.938362] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
[274768.976925] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
[274768.993071] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[274975.322681] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[274975.335751] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
[274975.374134] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
[274975.390317] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx

Ethernet recovers pretty quickly, I don't lose access to router, but I didn't test for a long uptime with many timed out errors. Without RPS I got only ~240 Mbps, but it is stable.

P.S. I don't used software offloading in this tests.

dchard · June 25, 2020, 1:22pm

Well, I am using RPS and SW offload, and had no issues with it. Without RPS the speed is quite bad.

What you can try is set this in your rc.local:

# echo e > /sys/class/net/eth0/queues/rx-0/rps_cpus

This is how I use it. The trick is that this is not allowing the RPS to work on all 4 threads, only on the first three. This allows one core to be kept free for other processes/scheduling/IRQ etc. maybe this is why I dont see any of these errors. The master branch is using "f" instead of "e" which allows RPS on all cores.

Will update my build to latest master today.

refelian · June 25, 2020, 1:33pm

Disabling RPS didn't help. Perhaps it became a bit better but it's difficult to say. My device has two threads, perhaps limiting RPS to a single thread will help. Will give it a go later this evening.

Anyway, currently I am using an ugly workaround whenever it drops. I am using watchcat which looks at my access point to determine if it needs to restart

lantsem · June 25, 2020, 2:02pm

I use my own startup script which sets affinity for ethernet and wifi cards (2, 4, 8) and RPS for ethernet with value '5' to use only additional thread with cpu0 and thread from 2.4G wifi with cpu1. In my tests this setting is more optimal than using 3/4 threads like with your 'e' value - efficiency isn't much lower and 5G wifi has more performance at the same time. I even tried to use RPS for wlan interfaces (yeah, it helps in one my test case), but stopped while waiting for the mt76 interrupt errors bug to be fixed.

Mushoz · July 28, 2020, 8:25pm

It's been a while since this bug has last been discussed, and the master branch has progressed substantially during that time. Is there anyone that used to have this issue, but for whom it is now 100% fixed? If so, what build are you running that fixed it for you? I am running a snapshot of the 19.07 branch that includes the pause frame patch, but it is still occuring every now and then. The router hasn't crashed, which is an improvement, but the router does not allow traffic for up to 30 minutes when the bug is triggered.

refelian · July 29, 2020, 3:59pm

It's still there. I can also trigger it by unplugging my work laptop from the USB-C/ Ethernet (I know oddly specific but very reproducible)

Mushoz · July 29, 2020, 4:22pm

Interesting! And on which build are you? Being able to reliably trigger it might prove useful for getting it fixed (hopefully) down the road.

refelian · July 29, 2020, 5:12pm

OpenWrt 19.07-SNAPSHOT r11109-18cf2383b6 / LuCI openwrt-19.07 branch git-20.167.61968-87da00a

It's 19.07.3 with a couple of fixes pulled in for this chipset. (Also I didn't know you are a fellow dutchie).

shevalier · August 1, 2020, 7:54am

I tested 19.7 before commit f4985a2 - 3-5 minutes and router rebooted under heavy load on 100Mbps uplink
Linux version 4.14.187 (xx@debian) (gcc version 7.5.0 (OpenWrt GCC 7.5.0 r11093-f99b1d1)) #0 SMP Sun Jul 5 22:12:04 2020
CPU 0 Unable to handle kernel paging request at virtual address ffffad99, epc == 8ead1c38, ra == 8ead1c84

Oops[#1]:

and reboot

Release version 19.7.03 - up to 30 days uptime, but there is always a transmit queue 0 timed out in kernel log.
But no reboots

dchard · August 15, 2020, 8:39pm

@nbd @neheb

I would like to kindly ask any of you if there is any news on the HW offload for mt7621 on DSA? Couple months passed since the last time, thought I would ask

neheb · August 16, 2020, 7:00am

There is none.

dchard · August 17, 2020, 6:19pm

Sorry to hear that. I tried to follow the known sources (mainly the "next" branch of the kernel git), but found nothing in this regard. I hoped that @nbd might have some sort of an update on this topic.

phenobarbital · August 29, 2020, 12:34am

I'm currently using 5.4.60 with snapshot r14281-09faa73c53 over a ZBT-WG3526, I'm getting the same "transmit timed out", almost instantly after boot, but placing this on rc.local:

# wifi0 - 5GHz
echo 2 > /proc/irq/25/smp_affinity
# wifi1 - 2GHz
echo 1 > /proc/irq/24/smp_affinity
# lan
echo 5 > /proc/irq/22/smp_affinity

# RPS
echo 8 > /sys/class/net/eth0/queues/rx-0/rps_cpus
echo 6 > /sys/class/net/wlan0/queues/rx-0/rps_cpus
echo 4 > /sys/class/net/wlan1/queues/rx-0/rps_cpus

The interrupts were reduced drastically:

           CPU0       CPU1       CPU2       CPU3       
  8:     677031     677109     677119     677087  MIPS GIC Local   1  timer
  9:      49624          0          0          0  MIPS GIC  63  IPI call
 10:          0      59795          0          0  MIPS GIC  64  IPI call
 11:          0          0      50433          0  MIPS GIC  65  IPI call
 12:          0          0          0      43069  MIPS GIC  66  IPI call
 13:     262118          0          0          0  MIPS GIC  67  IPI resched
 14:          0      90961          0          0  MIPS GIC  68  IPI resched
 15:          0          0     121369          0  MIPS GIC  69  IPI resched
 16:          0          0          0     100677  MIPS GIC  70  IPI resched
 17:          0          0          0          0  MIPS GIC  19  1e000600.gpio-bank0, 1e000600.gpio-bank1, 1e000600.gpio-bank2
 19:         15          0          0          0  MIPS GIC  33  ttyS0
 20:          0          0          0          0  MIPS GIC  27  1e130000.sdhci
 21:          0          0          0          0  MIPS GIC  29  xhci-hcd:usb1
 22:     256614          0          0          0  MIPS GIC  10  1e100000.ethernet
 24:        240          0     131972          0  MIPS GIC  11  mt7603e
 25:          2        112          0     353291  MIPS GIC  31  mt76x2e
 26:          0          0          0          0  MIPS GIC  32  ahci[0000:03:00.0]
 27:          0          0          0          0  1e000600.gpio  18  keys
ERR:        130

also, I have 2 hours uptime right now and I don't see any kernel issues, this is an old device that I try to recover, and this is the first time (what I remember) that so long has passed without an error on dmesg.
I remember this error in the past (that's why I stopped using this device, almost 4 years ago), it's incredible that this error persists.

neheb · August 29, 2020, 3:12am

Sounds bad.

When the DSA transition happened, people were reporting that this was fixed. Now people report that it's broken. Guess an upstream kernel change broke it...

Rising_Sun · August 29, 2020, 9:31am

WIth this commit from nbd's staging tree merged into master i don't get any timeouts. Iperf with 5GB data runs smoothly without hanging. Xiaomi 3G
https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=commit;h=59d236f11df7539cfd9524a8d7857faefe40f74b