It seems to be happening every hour now for me. Any idea how to extract kernel logging after power cycling from a previous boot? I can't access the device through luci/ssh.
Try to disable packet steering (RPS) that was enabled by default a few days ago.
I am using RPS for 2 months now on a DIR-860L and it does not create any issue, at least not on my end (with kernel 5.4).
I got tx timed out with RPS pretty easily with my 1Gbps uplink and kernel 5.4. After disabling RPS so far not once.
Maybe you want to try with latest master, as there was a few additions in the last 23 days. I only have 300Mbps uplink, so I cannot test up to a gig on the uplink side.
Already tested. Performance increased after added optimizations to ethernet driver, but transmit timed out is still there with RPS enabled and transfer speed around ~450 Mbps:
[274768.477705] ------------[ cut here ]------------
[274768.487131] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x2fc/0x304
[274768.503782] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[274768.517852] Modules linked in: mt7615e mt7615_common mt76 mac80211 iptable_nat ipt_REJECT cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_MASQUERADE xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY nf_reject_ipv4 nf_nat_ftp nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack_ftp nf_conncount iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables compat act_connmark nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ifb leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd gpio_button_hotplug usbcore nls_base usb_common
[274768.690284] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G W 5.4.46 #0
[274768.705005] Stack : ffffffff 80077050 80670000 80673774 806d0000 8067373c 80672890 8fc13db4
[274768.721815] 80810000 8fc3ccc4 806bad03 8060aaf4 00000003 00000001 8fc13d58 48e5ef65
[274768.738611] 00000000 00000000 80850000 00000000 00000030 0000019d 20202020 20202020
[274768.755405] 00000000 00003566 00000000 20202020 00000000 806d0000 00000000 80438604
[274768.772214] 00000009 00000003 00200000 00000122 00000003 80324bf0 0000000c 8081000c
[274768.789026] ...
[274768.794054] Call Trace:
[274768.799110] [<8000b72c>] show_stack+0x30/0x100
[274768.808134] [<80553484>] dump_stack+0xa4/0xdc
[274768.816978] [<80028058>] __warn+0xc0/0x10c
[274768.825296] [<80028130>] warn_slowpath_fmt+0x8c/0xac
[274768.835359] [<80438604>] dev_watchdog+0x2fc/0x304
[274768.844888] [<80090210>] call_timer_fn.isra.34+0x20/0x90
[274768.855621] [<80090458>] run_timer_softirq+0x1d8/0x230
[274768.866022] [<80573fec>] __do_softirq+0x16c/0x334
[274768.875562] [<8002c7a4>] irq_exit+0x98/0xb0
[274768.884059] [<802c6860>] plat_irq_dispatch+0x64/0x104
[274768.894283] [<80006de8>] except_vec_vi_end+0xb8/0xc4
[274768.904333] [<805736a0>] r4k_wait_irqoff+0x1c/0x24
[274768.914377] ---[ end trace b5b1f56d83bf2de2 ]---
[274768.923778] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[274768.938362] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
[274768.976925] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
[274768.993071] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[274975.322681] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[274975.335751] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
[274975.374134] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
[274975.390317] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
Ethernet recovers pretty quickly, I don't lose access to router, but I didn't test for a long uptime with many timed out errors. Without RPS I got only ~240 Mbps, but it is stable.
P.S. I don't used software offloading in this tests.
Well, I am using RPS and SW offload, and had no issues with it. Without RPS the speed is quite bad.
What you can try is set this in your rc.local:
# echo e > /sys/class/net/eth0/queues/rx-0/rps_cpus
This is how I use it. The trick is that this is not allowing the RPS to work on all 4 threads, only on the first three. This allows one core to be kept free for other processes/scheduling/IRQ etc. maybe this is why I dont see any of these errors. The master branch is using "f" instead of "e" which allows RPS on all cores.
Will update my build to latest master today.
Disabling RPS didn't help. Perhaps it became a bit better but it's difficult to say. My device has two threads, perhaps limiting RPS to a single thread will help. Will give it a go later this evening.
Anyway, currently I am using an ugly workaround whenever it drops. I am using watchcat which looks at my access point to determine if it needs to restart
I use my own startup script which sets affinity for ethernet and wifi cards (2, 4, 8) and RPS for ethernet with value '5' to use only additional thread with cpu0 and thread from 2.4G wifi with cpu1. In my tests this setting is more optimal than using 3/4 threads like with your 'e' value - efficiency isn't much lower and 5G wifi has more performance at the same time. I even tried to use RPS for wlan interfaces (yeah, it helps in one my test case), but stopped while waiting for the mt76 interrupt errors bug to be fixed.
It's been a while since this bug has last been discussed, and the master branch has progressed substantially during that time. Is there anyone that used to have this issue, but for whom it is now 100% fixed? If so, what build are you running that fixed it for you? I am running a snapshot of the 19.07 branch that includes the pause frame patch, but it is still occuring every now and then. The router hasn't crashed, which is an improvement, but the router does not allow traffic for up to 30 minutes when the bug is triggered.
It's still there. I can also trigger it by unplugging my work laptop from the USB-C/ Ethernet (I know oddly specific but very reproducible)
Interesting! And on which build are you? Being able to reliably trigger it might prove useful for getting it fixed (hopefully) down the road.
OpenWrt 19.07-SNAPSHOT r11109-18cf2383b6 / LuCI openwrt-19.07 branch git-20.167.61968-87da00a
It's 19.07.3 with a couple of fixes pulled in for this chipset. (Also I didn't know you are a fellow dutchie).
I tested 19.7 before commit f4985a2 - 3-5 minutes and router rebooted under heavy load on 100Mbps uplink
Linux version 4.14.187 (xx@debian) (gcc version 7.5.0 (OpenWrt GCC 7.5.0 r11093-f99b1d1)) #0 SMP Sun Jul 5 22:12:04 2020
CPU 0 Unable to handle kernel paging request at virtual address ffffad99, epc == 8ead1c38, ra == 8ead1c84
Oops[#1]:
and reboot
Release version 19.7.03 - up to 30 days uptime, but there is always a transmit queue 0 timed out in kernel log.
But no reboots
I would like to kindly ask any of you if there is any news on the HW offload for mt7621 on DSA? Couple months passed since the last time, thought I would ask
There is none.
Sorry to hear that. I tried to follow the known sources (mainly the "next" branch of the kernel git), but found nothing in this regard. I hoped that @nbd might have some sort of an update on this topic.
I'm currently using 5.4.60 with snapshot r14281-09faa73c53 over a ZBT-WG3526, I'm getting the same "transmit timed out", almost instantly after boot, but placing this on rc.local:
# wifi0 - 5GHz
echo 2 > /proc/irq/25/smp_affinity
# wifi1 - 2GHz
echo 1 > /proc/irq/24/smp_affinity
# lan
echo 5 > /proc/irq/22/smp_affinity
# RPS
echo 8 > /sys/class/net/eth0/queues/rx-0/rps_cpus
echo 6 > /sys/class/net/wlan0/queues/rx-0/rps_cpus
echo 4 > /sys/class/net/wlan1/queues/rx-0/rps_cpus
The interrupts were reduced drastically:
CPU0 CPU1 CPU2 CPU3
8: 677031 677109 677119 677087 MIPS GIC Local 1 timer
9: 49624 0 0 0 MIPS GIC 63 IPI call
10: 0 59795 0 0 MIPS GIC 64 IPI call
11: 0 0 50433 0 MIPS GIC 65 IPI call
12: 0 0 0 43069 MIPS GIC 66 IPI call
13: 262118 0 0 0 MIPS GIC 67 IPI resched
14: 0 90961 0 0 MIPS GIC 68 IPI resched
15: 0 0 121369 0 MIPS GIC 69 IPI resched
16: 0 0 0 100677 MIPS GIC 70 IPI resched
17: 0 0 0 0 MIPS GIC 19 1e000600.gpio-bank0, 1e000600.gpio-bank1, 1e000600.gpio-bank2
19: 15 0 0 0 MIPS GIC 33 ttyS0
20: 0 0 0 0 MIPS GIC 27 1e130000.sdhci
21: 0 0 0 0 MIPS GIC 29 xhci-hcd:usb1
22: 256614 0 0 0 MIPS GIC 10 1e100000.ethernet
24: 240 0 131972 0 MIPS GIC 11 mt7603e
25: 2 112 0 353291 MIPS GIC 31 mt76x2e
26: 0 0 0 0 MIPS GIC 32 ahci[0000:03:00.0]
27: 0 0 0 0 1e000600.gpio 18 keys
ERR: 130
also, I have 2 hours uptime right now and I don't see any kernel issues, this is an old device that I try to recover, and this is the first time (what I remember) that so long has passed without an error on dmesg.
I remember this error in the past (that's why I stopped using this device, almost 4 years ago), it's incredible that this error persists.
Sounds bad.
When the DSA transition happened, people were reporting that this was fixed. Now people report that it's broken. Guess an upstream kernel change broke it...
WIth this commit from nbd's staging tree merged into master i don't get any timeouts. Iperf with 5GB data runs smoothly without hanging. Xiaomi 3G
https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=commit;h=59d236f11df7539cfd9524a8d7857faefe40f74b