Mtk_soc_eth watchdog timeout after r11573

@neheb @apocalypse

I am also dissecting a Huawei CPE that has exactly the same mt7621 SoC, and I found this script in the factory firmware (based on kernel 2.6 ....):

The file name is "adjust_cpu_offload.sh":

#!/bin/sh

# CPU0 IRQ23(Ralink_ESW) + IRQ24(ra0)
# CPU1 RPS(ra0) + IRQ3(eth2)
# CPU2 RPS(ra0) + RPS(rai0)
# CPU3 RPS(eth2) + IRQ4(rai0)

smp.sh wifi >/dev/null 2>&1

# SMP affinity
echo 2 > /proc/irq/3/smp_affinity
echo 8 > /proc/irq/4/smp_affinity
echo 1 > /proc/irq/24/smp_affinity

# RPS
echo 8 > /sys/class/net/eth2/queues/rx-0/rps_cpus
echo 6 > /sys/class/net/ra0/queues/rx-0/rps_cpus
echo 4 > /sys/class/net/rai0/queues/rx-0/rps_cpus

Have no idea if this can have any effect, but previously someone mentioned, that this issue might a scheduling/affinity issue.

On Openwrt the above eth2 is eth0, I checked the value and it is an "e" not a "6":

root@dir860lb1:~# cat /sys/class/net/eth0/queues/rx-0/rps_cpus
e

This commit ramips: gsw_mt7621: disable PORT 5 MAC RX/TX flow control by default looks like a fix for the problem and as it is in a staging tree, it might be applied to master soon.

Would be nice to see that in master indeed.

Since the reboot at 9 Feb 19:51:

Uptime 3d 2h 36m 25s and for now no kernel crash or reboot and only 1 interrupt error on CPU0 (used by switch).

What is the point that now it seems that the patch is working and the first time before the restart does not?

Are you using D-Link DIR-860L B1 too?

Why I am having a lot more interrupt errors:

root@OpenWrt:~# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  8:   33061974   33061948   33061914   33061941  MIPS GIC Local   1  timer
  9:     242093          0          0          0  MIPS GIC  63  IPI call
 10:          0   70970738          0          0  MIPS GIC  64  IPI call
 11:          0          0   69551870          0  MIPS GIC  65  IPI call
 12:          0          0          0   70699212  MIPS GIC  66  IPI call
 13:    4509834          0          0          0  MIPS GIC  67  IPI resched
 14:          0    7228594          0          0  MIPS GIC  68  IPI resched
 15:          0          0    1128570          0  MIPS GIC  69  IPI resched
 16:          0          0          0    1319830  MIPS GIC  70  IPI resched
 19:         14          0          0          0  MIPS GIC  33  ttyS0
 20:          0          0          0          0  MIPS GIC  29  xhci-hcd:usb1
 21:  560229579          0          0          0  MIPS GIC  10  1e100000.ethernet
 22:          2          0          0          0  MIPS GIC  30  gsw
 23:        277          0    9010747          0  MIPS GIC  11  mt76x2e
 24:   30387205          0          0          0  MIPS GIC  31  mt76x2e
 26:          0          0          0          0      GPIO   7  keys
 27:          0          0          0          0      GPIO  18  keys
ERR:       1677

and my uptime is:

root@OpenWrt:~# uptime
 15:16:50 up 3 days, 19:50,  load average: 0.02, 0.02, 0.00

No, i'm using EdgeRouter X.

What I am thinking is that CPU pointer overtakes the DMA pointer and get stuck.
This may can happen when DMA is slowdown by PAUSE frames.
But I tried to reproduce it seems not that easily to do.
I was hoping reducing the DMA size to 32 packets may trigger this timeout quicker.
But it doesn't.

That is my theory.

I added a bit more debug output to the timeout().
Also the DMA poll_tx() checks if the DMA done bit is set.
Also reduces the dma_size 4096 to 256 entries to reduce the debug output when a timeout hits.

BTW: my code also enables SFP port on ubiquity ER-X-SFP.

A crash now looks like this.

I added these two lines. First show all the DMA done bits status of every entry.

fe_tx_timeout: 00: 0x00000000 0x00000000
Set cpu pointer behind dma. idx = 15

Full log.

[   34.529713] ------------[ cut here ]------------
[   34.538936] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0x1ac/0x324
[   34.555397] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[   34.569244] Modules linked in: pppoe ppp_async nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet pppox ppp_generic nft_set_rbtree nft_set_hash nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir_ipv4 nft_redir nft_quota nft_objref nft_numgen nft_nat nft_meta nft_masq_ipv4 nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_exthdr nft_ct nft_counter nft_chain_route_ipv6 nft_chain_route_ipv4 nft_chain_nat_ipv4 nf_tables_ipv6 nf_tables_ipv4 nf_tables_inet nf_tables nf_conntrack_ipv6 mt76x2e mt76x2_common mt76x02_lib mt76 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD wireguard ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot
[   34.710711]  ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda slhc nfnetlink nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat i2c_gpio i2c_algo_pca i2c_algo_bit gpio_pca953x i2c_dev ledtrig_usbport nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ip6_udp_tunnel udp_tunnel uas mmc_block usb_storage mtk_sd mmc_core leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd dwc3 sd_mod scsi_mod gpio_button_hotplug usbcore nls_base usb_common
[   34.833658] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.169 #0
[   34.845770] Stack : 00000000 8fea7040 ffffffff 80079724 80610000 805aaf94 00000000 00000000
[   34.862399]         80575178 8fc0bdcc 8fc3c70c 805e4a47 8056fb80 00000001 8fc0bd70 53261648
[   34.879028]         00000000 00000000 80b90000 00000000 80b88500 000000d7 00000007 00000000
[   34.895656]         00000000 00000000 000cb87a ffffffff 00000000 80610000 00000000 803c2b50
[   34.912284]         805a511c 00000140 00000001 8fea7040 00000000 802e7618 00000004 80b80004
[   34.928913]         ...
[   34.933771] Call Trace:
[   34.938648] [<8000c4d4>] show_stack+0x58/0x100
[   34.947491] [<804a8274>] dump_stack+0xa4/0xe0
[   34.956167] [<80031db0>] __warn+0xe0/0x140
[   34.964310] [<800319f4>] warn_slowpath_fmt+0x30/0x3c
[   34.974182] [<803c2b50>] dev_watchdog+0x1ac/0x324
[   34.983543] [<800913ac>] call_timer_fn.isra.28+0x24/0x84
[   34.994105] [<8009170c>] run_timer_softirq+0x1bc/0x248
[   35.004326] [<804c5898>] __do_softirq+0x128/0x2e8
[   35.013686] [<800367b0>] irq_exit+0xa8/0xc4
[   35.022007] [<8029a4c4>] plat_irq_dispatch+0xf0/0x13c
[   35.032051] [<800074c8>] except_vec_vi_end+0xb8/0xc4
[   35.041921] [<80008e48>] r4k_wait_irqoff+0x1c/0x24
[   35.051594] ---[ end trace 8c2028659346c630 ]---
[   35.060790] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[   35.073091] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000065
[   35.085049] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0, base=0dfca000, max=16, ctx=1, dtx=1, fdx=1, next=1
[   35.104784] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0, base=0da80000, max=0, calc=57, drx=58
[   35.122277] fe_tx_timeout: 00: 0x00000000 0x00000000
[   35.132149] Set cpu pointer behind dma. idx = 0
[   35.532124] mtk_soc_eth 1e100000.ethernet: 0x100 = 0x6060000c, 0x10c = 0x80818
[   35.560414] mtk_soc_eth 1e100000.ethernet: PPE started
[   45.169725] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[   45.182051] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000067
[   45.194035] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0, base=0dfca000, max=16, ctx=1, dtx=0, fdx=0, next=1
[   45.213796] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0, base=0e320000, max=0, calc=39, drx=40
[   45.231312] fe_tx_timeout: 00: 0x00000000 0x00000000
[   45.241208] Set cpu pointer behind dma. idx = 15
[   45.642123] mtk_soc_eth 1e100000.ethernet: 0x100 = 0x5d60000c, 0x10c = 0x80818
[   45.670574] mtk_soc_eth 1e100000.ethernet: PPE started
[   55.169725] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[   55.182047] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000067
[   55.194029] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0, base=0dfca000, max=16, ctx=1, dtx=0, fdx=0, next=1
[   55.213790] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0, base=0e320000, max=0, calc=33, drx=34
[

I also have interrupt errors with the patched kernel:

root@dir860lb1:~# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  8:   21924295   21924267   21924268   21924259  MIPS GIC Local   1  timer
  9:     103543          0          0          0  MIPS GIC  63  IPI call
 10:          0   15989902          0          0  MIPS GIC  64  IPI call
 11:          0          0   28501957          0  MIPS GIC  65  IPI call
 12:          0          0          0   25001554  MIPS GIC  66  IPI call
 13:     503328          0          0          0  MIPS GIC  67  IPI resched
 14:          0     541985          0          0  MIPS GIC  68  IPI resched
 15:          0          0     622890          0  MIPS GIC  69  IPI resched
 16:          0          0          0     668015  MIPS GIC  70  IPI resched
 19:         12          0          0          0  MIPS GIC  33  ttyS0
 20:          0          0          0          0  MIPS GIC  29  xhci-hcd:usb1
 21:  183488361          0          0          0  MIPS GIC  10  1e100000.ethernet
 22:          3          0          0          0  MIPS GIC  30  gsw
 23:        490          0   26558211          0  MIPS GIC  11  mt76x2e
 24:    4989051          0          0          0  MIPS GIC  31  mt76x2e
 26:          0          0          0          0      GPIO   7  keys
 27:          0          0          0          0      GPIO  18  keys
ERR:       1042

But no kernel errors of any kind. Uptime is only 2 days 13 hours. Below 10 days uptime, it is not relevant. Between 10-15 days its promising, and when you hit 20 days, thats when you can know that the problem is really gone.

Maybe those high interrupt errors are due to Wi-Fi.

I don't have any kind of kernel error for now. Uptime is 4 days 6 hours. But the first time i test firmware with the patch, a kernel panic occurred at 24 hours, 4 days later its reboot itself. My theory is that at the first start, the if condition did not detect the correct rev id chip and used the else condition parameters, this second start seems to be detected it correct (maybe it's a random bug). Anyway i have another firmware ready without the if condition to test if it fails again.

The network has constant wan traffic (IPTV multicast included), mixing Fast and Gigabit Ethernet devices (also connected directly to the router switch) and some VLAN tagged / untagged and trunk link to switches.
I hope to achieve at least 15 days with the kernel log clean, and keep it that way.

Today after 7 days and 18 hours without any errors, it has been frozen :frowning:

This is with the unconditional patch already?

No, it is with the conditional. But since i had no kernel crash i was hopeful. Now i'm testing with the patch without conditional.

Great news, a version of the patch has just been committed to OpenWRT master.
https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=c8f8e59816eca49d776562d2d302bf990a87faf0

Just installed the latest master with this patch in it. Let see if it fixes it.

MOD: it did not helped. Same error is still present...

1 Like

I am testing with the patch without if condition, after 4 days without errors, yesterday i have the same kernel panic again. 18.06.7 version

According to the patch comment, it only turns off flow control on port 5. Maybe we should try to switch it off on all ports and give it a try?

It seems not enough for the timeout to go away with the patch in the master branch, I just encountered one crash.

The patch in the master branch is actually the same as the one with if conditions, I suppose? I am asking because I didn't recall noticed any crashes with the if conditions patch.

Not sure if the new drivers in the new 'RFT' 4.19 branch are able to eliminate the crashes, I didn't run a long testing using it. Perhaps it deserves another long time testing.

As much as neheb explained, the current RTF 4.19 branch is not going to resolve this, as it needs the driver from the linux-next branch to be backported, and I think the current RTF branch does not have that.

Non the less: given the fact how many of us have issues still with this switch driver, it would be lovely to have that as a priority. When I bought this router almost 3 years ago, it was the crappy wifi driver, now its the switch chip...

Testing 19.07.1 without conditional. 18.07.6 causes reboots and freezes.

If continues to fail, i may be try in reverse, with FC ON and/or removing the line that sets the RX length to 2K.

I tested the latest master (without conditional), and it also produced an error and a crash within 24 hours...