Mtk_soc_eth watchdog timeout after r11573

@apocalypse this project does not have massive funding. Meaning it makes sense that old "releases" are unstable. Not only that, figuring out this issue requires understanding of the kernel's networking stack and driver infrastructure. I doubt anyone would have the patience for such a thing.

In my experience, LEDE 17.01 was the most stable for mt7621. kernel 4.4 did not change enough to cause problems with the out of tree drivers used for mt7621.

@db260179 hardware bug is probably correct. And yes, the reason people want to use 19.07 is because of HWNAT. Adding HWNAT support for the upstream driver is apparently trivial but no one has done it yet. swconfig also happens to faster than DSA but stability is more important than speed.

Out of tree driver has no future. I pushed to get the upstream driver in the tree because of issues like this. There are more developers on kernel.org than openwrt.org.

master is quite good actually. non upstream PCI, ethernet, swconfig, GPIO, and SPI drivers were removed. I'm still pushing to get the mmc driver removed but it seems some hardware doesn't work with the upstream one.

Off-topic: there's a crypto driver in development for mt7621: https://github.com/vschagen/mtk-eip93

Only really useful for people running dm-crypt.

1 Like

@neheb Even on LEDE 17.01 i had these problems, also on 18.06. Eth0: transmit timed out on ER-X with LEDE 17.01.7

For now in master it is running 5 days without any problem. What do you mean that swconfig driver is faster?

I am running my own build based on master with kernel patches to overclock the CPU and more magic that I will not discuss in detail here. However, I see that my WG3526 works very "stable" (in the "does not crash" sense of the word). In fact, while there are several devices randomly crashing since the kernel patch 82, this one does keep running uninterrupted for the whole week (my releases are weekly, so I can't say if it will work for a month, for example).

Take a look at my kernel log:

[    0.000000] Linux version 5.4.82 (notengobattery@gmail.com) (gcc version 10.2.0 (OpenWrt GCC 10.2.0 r0+15161-36449b38ae)) #0 SMP PREEMPT Sat Dec 12 22:18:12 2020
[    0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
[    0.000000] printk: bootconsole [early0] enabled
[    0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
[    0.000000] MIPS: machine is Zbtlink ZBT-WG3526 (Glucose)
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] VPE topology {2,2} total 4
[    0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000000fffffff]
[    0.000000]   HighMem  [mem 0x0000000010000000-0x0000000023ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001bffffff]
[    0.000000]   node   0: [mem 0x0000000020000000-0x0000000023ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000023ffffff]
[    0.000000] On node 0 totalpages: 131072
[    0.000000]   Normal zone: 576 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 65536 pages, LIFO batch:15
[    0.000000]   HighMem zone: 65536 pages, LIFO batch:15
[    0.000000] percpu: Embedded 14 pages/cpu s26992 r8192 d22160 u57344
[    0.000000] pcpu-alloc: s26992 r8192 d22160 u57344 alloc=14*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130496
[    0.000000] Kernel command line: console=ttyS0,115200 rootfstype=squashfs,jffs2 rcutree.kthread_prio=49
[    0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[    0.000000] Writing ErrCtl register=000213a0
[    0.000000] Readback ErrCtl register=000213a0
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504684K/524288K available (6365K kernel code, 204K rwdata, 1280K rodata, 1228K init, 233K bss, 19604K reserved, 0K cma-reserved, 262144K highmem)
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] rcu: Preemptible hierarchical RCU implementation.
[    0.000000] 	Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 256
[    0.000000] random: get_random_bytes called from start_kernel+0x34c/0x570 with crng_init=0
[    0.000000] CPU Overclock PLL: 0xc2005bb2
[    0.000000] CPU Clock: 1200MHz
[    0.000000] clocksource: GIC: mask: 0xffffffffffffffff max_cycles: 0x114c1bbbeec, max_idle_ns: 440795229572 ns
[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 3185434077 ns
[    0.000008] sched_clock: 32 bits at 600MHz, resolution 1ns, wraps every 3579139583ns
[    0.007813] Calibrating delay loop... 798.72 BogoMIPS (lpj=1597440)
[    0.038001] pid_max: default: 32768 minimum: 301
[    0.042753] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.049969] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.060164] rcu: Hierarchical SRCU implementation.
[    0.065606] smp: Bringing up secondary CPUs ...
[    2.948034] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    2.948043] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    2.948052] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    2.948141] CPU1 revision is: 0001992f (MIPS 1004Kc)
[    0.094350] Synchronize counters for CPU 1: done.
[    3.009179] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    3.009188] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    3.009195] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    3.009242] CPU2 revision is: 0001992f (MIPS 1004Kc)
[    0.155032] Synchronize counters for CPU 2: done.
[    3.062594] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    3.062601] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    3.062608] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    3.062656] CPU3 revision is: 0001992f (MIPS 1004Kc)
[    0.208891] Synchronize counters for CPU 3: done.
[    0.238739] smp: Brought up 1 node, 4 CPUs
[    0.246512] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.256253] futex hash table entries: 1024 (order: 3, 32768 bytes, linear)
[    0.263328] pinctrl core: initialized pinctrl subsystem
[    0.270131] NET: Registered protocol family 16
[    0.285533] FPU Affinity set after 6384 emulations
[    0.646613] printk: console [pstore-1] enabled
[    0.651344] pstore: Registered ramoops as persistent store backend
[    0.657631] ramoops: using 0x480000@0x7b80000, ecc: 16
[    0.677395] random: fast init done
[    0.683326] workqueue: max_active 576 requested for napi_workq is out of range, clamping between 1 and 512
[    0.694760] clocksource: Switched to clocksource GIC
[    0.701488] NET: Registered protocol family 2
[    0.706794] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 6144 bytes, linear)
[    0.715293] TCP established hash table entries: 2048 (order: 1, 8192 bytes, linear)
[    0.722984] TCP bind hash table entries: 2048 (order: 2, 16384 bytes, linear)
[    0.730225] TCP: Hash tables configured (established 2048 bind 2048)
[    0.736816] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.743388] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.750950] NET: Registered protocol family 1
[    0.755374] PCI: CLS 0 bytes, default 32
[    0.854714] 4 CPUs re-calibrate udelay(lpj = 1593344)
[    0.861361] workingset: timestamp_bits=14 max_order=17 bucket_order=3
[    0.878392] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.884271] squashfs: patched 4.0 (2020/11/10) Oever Gonzalez
[    0.890101] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[    0.901361] bounce: pool size: 64 pages
[    0.907281] mt7621_gpio 1e000600.gpio: registering 32 gpios
[    0.913205] mt7621_gpio 1e000600.gpio: registering 32 gpios
[    0.919189] mt7621_gpio 1e000600.gpio: registering 32 gpios
[    0.925583] Serial: 8250/16550 driver, 3 ports, IRQ sharing disabled
[    0.933281] printk: console [ttyS0] disabled
[    0.937677] 1e000c00.uartlite: ttyS0 at MMIO 0x1e000c00 (irq = 19, base_baud = 3125000) is a 16550A
[    0.946884] printk: console [ttyS0] enabled
[    0.955244] printk: bootconsole [early0] disabled
[    0.967349] spi-mt7621 1e000b00.spi: sys_freq: 300000000
[    0.983458] spi-nor spi0.0: w25q128 (16384 Kbytes)
[    0.988490] 4 fixed-partitions partitions found on MTD device spi0.0
[    0.994947] Creating 4 MTD partitions on "spi0.0":
[    0.999928] 0x000000000000-0x000000030000 : "u-boot"
[    1.006190] 0x000000030000-0x000000040000 : "u-boot-env"
[    1.012859] 0x000000040000-0x000000050000 : "factory"
[    1.019307] 0x000000050000-0x000001000000 : "firmware"
[    1.025945] 2 uimage-fw partitions found on MTD device firmware
[    1.032136] Creating 2 MTD partitions on "firmware":
[    1.037195] 0x000000000000-0x00000027aea9 : "kernel"
[    1.043511] 0x00000027aea9-0x000000fb0000 : "rootfs"
[    1.049800] mtd: device 5 (rootfs) set to be root filesystem
[    1.055854] 1 squashfs-split partitions found on MTD device rootfs
[    1.062127] 0x000000c30000-0x000000fb0000 : "rootfs_data"
[    1.070085] libphy: Fixed MDIO Bus: probed
[    1.101118] libphy: mdio: probed
[    1.104684] mt7530 mdio-bus:1f: MT7530 adapts as multi-chip module
[    1.114898] mtk_soc_eth 1e100000.ethernet eth0: mediatek frame engine at 0xbe100000, irq 22
[    1.124629] i2c-mt7621 1e000900.i2c: clock 100 kHz
[    1.131069] mt7621-pci 1e140000.pcie: Parsing DT failed
[    1.139141] NET: Registered protocol family 10
[    1.145286] Segment Routing with IPv6
[    1.149168] NET: Registered protocol family 17
[    1.153922] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[    1.167296] 8021q: 802.1Q VLAN Support v1.8
[    1.171911] registered taskstats version 1
[    1.176591] zswap: default zpool zbud not available
[    1.181567] zswap: pool creation failed
[    1.189206] pstore: Using crash dump compression: zstd
[    1.195348] mt7530 mdio-bus:1f: MT7530 adapts as multi-chip module
[    1.214333] libphy: dsa slave smi: probed
[    1.219041] mt7530 mdio-bus:1f lan1 (uninitialized): PHY [dsa-0.0:00] driver [Generic PHY]
[    1.228782] mt7530 mdio-bus:1f lan2 (uninitialized): PHY [dsa-0.0:01] driver [Generic PHY]
[    1.238668] mt7530 mdio-bus:1f lan3 (uninitialized): PHY [dsa-0.0:02] driver [Generic PHY]
[    1.248509] mt7530 mdio-bus:1f lan4 (uninitialized): PHY [dsa-0.0:03] driver [Generic PHY]
[    1.258451] mt7530 mdio-bus:1f wan (uninitialized): PHY [dsa-0.0:04] driver [Generic PHY]
[    1.268228] mt7530 mdio-bus:1f: configuring for fixed/rgmii link mode
[    1.279702] DSA: tree 0 setup
[    1.283141] rt2880-pinmux pinctrl: pcie is already enabled
[    1.288818] mt7621-pci 1e140000.pcie: Error applying setting, reverse things back
[    1.296559] mt7621-pci-phy 1e149000.pcie-phy: PHY for 0xbe149000 (dual port = 1)
[    1.304300] mt7621-pci-phy 1e14a000.pcie-phy: PHY for 0xbe14a000 (dual port = 0)
[    1.412000] mt7621-pci-phy 1e149000.pcie-phy: Xtal is 40MHz
[    1.417663] mt7621-pci-phy 1e14a000.pcie-phy: Xtal is 40MHz
[    1.523365] mt7621-pci 1e140000.pcie: pcie2 no card, disable it (RST & CLK)
[    1.530401] mt7621-pci 1e140000.pcie: PCIE0 enabled
[    1.535443] mt7621-pci 1e140000.pcie: PCIE1 enabled
[    1.540407] mt7621-pci 1e140000.pcie: PCI coherence region base: 0x60000000, mask/settings: 0xf0000002
[    1.550022] mt7621-pci 1e140000.pcie: PCI host bridge to bus 0000:00
[    1.556512] pci_bus 0000:00: root bus resource [io  0x1e160000-0x1e16ffff]
[    1.563577] pci_bus 0000:00: root bus resource [mem 0x60000000-0x6fffffff]
[    1.570615] pci_bus 0000:00: root bus resource [bus 00-ff]
[    1.576251] pci 0000:00:00.0: [0e8d:0801] type 01 class 0x060400
[    1.582450] pci 0000:00:00.0: reg 0x10: [mem 0x00000000-0x7fffffff]
[    1.588825] pci 0000:00:00.0: reg 0x14: [mem 0x60400000-0x6040ffff]
[    1.595332] pci 0000:00:00.0: supports D1
[    1.599431] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[    1.605707] pci 0000:00:01.0: [0e8d:0801] type 01 class 0x060400
[    1.611883] pci 0000:00:01.0: reg 0x10: [mem 0x00000000-0x7fffffff]
[    1.618318] pci 0000:00:01.0: reg 0x14: [mem 0x60410000-0x6041ffff]
[    1.624732] pci 0000:00:01.0: supports D1
[    1.628913] pci 0000:00:01.0: PME# supported from D0 D1 D3hot
[    1.635919] pci 0000:01:00.0: [14c3:7603] type 00 class 0x028000
[    1.642167] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x000fffff]
[    1.648689] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[    1.655969] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
[    1.661308] pci 0000:00:00.0:   bridge window [io  0x0000-0x0fff]
[    1.667607] pci 0000:00:00.0:   bridge window [mem 0x60000000-0x600fffff]
[    1.674475] pci 0000:00:00.0:   bridge window [mem 0x60100000-0x601fffff pref]
[    1.681881] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    1.688871] pci 0000:02:00.0: [14c3:7662] type 00 class 0x028000
[    1.695060] pci 0000:02:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit]
[    1.702046] pci 0000:02:00.0: reg 0x30: [mem 0x00000000-0x0000ffff pref]
[    1.708941] pci 0000:02:00.0: PME# supported from D0 D3hot D3cold
[    1.716225] pci 0000:00:01.0: PCI bridge to [bus 02-ff]
[    1.721559] pci 0000:00:01.0:   bridge window [io  0x0000-0x0fff]
[    1.727856] pci 0000:00:01.0:   bridge window [mem 0x60200000-0x602fffff]
[    1.734832] pci 0000:00:01.0:   bridge window [mem 0x60300000-0x603fffff pref]
[    1.742136] pci_bus 0000:02: busn_res: [bus 02-ff] end is updated to 02
[    1.748978] pci 0000:00:00.0: BAR 0: no space for [mem size 0x80000000]
[    1.755692] pci 0000:00:00.0: BAR 0: failed to assign [mem size 0x80000000]
[    1.762840] pci 0000:00:01.0: BAR 0: no space for [mem size 0x80000000]
[    1.769613] pci 0000:00:01.0: BAR 0: failed to assign [mem size 0x80000000]
[    1.776676] pci 0000:00:00.0: BAR 8: assigned [mem 0x60000000-0x600fffff]
[    1.783647] pci 0000:00:00.0: BAR 9: assigned [mem 0x60100000-0x601fffff pref]
[    1.790948] pci 0000:00:01.0: BAR 8: assigned [mem 0x60200000-0x602fffff]
[    1.797895] pci 0000:00:01.0: BAR 9: assigned [mem 0x60300000-0x603fffff pref]
[    1.805277] pci 0000:00:00.0: BAR 1: assigned [mem 0x60400000-0x6040ffff]
[    1.812149] pci 0000:00:01.0: BAR 1: assigned [mem 0x60410000-0x6041ffff]
[    1.819101] pci 0000:00:00.0: BAR 7: assigned [io  0x1e160000-0x1e160fff]
[    1.825965] pci 0000:00:01.0: BAR 7: assigned [io  0x1e161000-0x1e161fff]
[    1.832920] pci 0000:01:00.0: BAR 0: assigned [mem 0x60000000-0x600fffff]
[    1.839872] pci 0000:00:00.0: PCI bridge to [bus 01]
[    1.844927] pci 0000:00:00.0:   bridge window [io  0x1e160000-0x1e160fff]
[    1.851876] pci 0000:00:00.0:   bridge window [mem 0x60000000-0x600fffff]
[    1.858745] pci 0000:00:00.0:   bridge window [mem 0x60100000-0x601fffff pref]
[    1.866130] pci 0000:02:00.0: BAR 0: assigned [mem 0x60200000-0x602fffff 64bit]
[    1.873608] pci 0000:02:00.0: BAR 6: assigned [mem 0x60300000-0x6030ffff pref]
[    1.880905] pci 0000:00:01.0: PCI bridge to [bus 02]
[    1.886036] pci 0000:00:01.0:   bridge window [io  0x1e161000-0x1e161fff]
[    1.892901] pci 0000:00:01.0:   bridge window [mem 0x60200000-0x602fffff]
[    1.899854] pci 0000:00:01.0:   bridge window [mem 0x60300000-0x603fffff pref]
[    1.907696] mt7530 mdio-bus:1f: Link is Up - 1Gbps/Full - flow control off
[    1.915237] hctosys: unable to open rtc device (rtc0)
[    1.939103] VFS: Mounted root (squashfs filesystem) readonly on device 31:5.
[    1.951085] Freeing unused kernel memory: 1228K
[    1.955913] This architecture does not have kernel memory protection.
[    1.962433] Run /sbin/init as init process
[    3.726884] init: Console is alive
[    3.730608] init: - watchdog -
[    6.367227] kmodloader: loading kernel modules from /etc/modules-boot.d/*
[    6.961009] io scheduler bfq registered
[    6.966040] io scheduler kyber registered
[    6.971110] io scheduler mq-deadline registered
[    6.986919] usbcore: registered new interface driver usbfs
[    6.992654] usbcore: registered new interface driver hub
[    6.998393] usbcore: registered new device driver usb
[    7.051999] SCSI subsystem initialized
[    7.067688] libata version 3.00 loaded.
[    7.078308] xhci-mtk 1e1c0000.xhci: 1e1c0000.xhci supply vbus not found, using dummy regulator
[    7.087396] xhci-mtk 1e1c0000.xhci: 1e1c0000.xhci supply vusb33 not found, using dummy regulator
[    7.096572] xhci-mtk 1e1c0000.xhci: xHCI Host Controller
[    7.102013] xhci-mtk 1e1c0000.xhci: new USB bus registered, assigned bus number 1
[    7.114911] xhci-mtk 1e1c0000.xhci: hcc params 0x01401198 hci version 0x96 quirks 0x0000000000210010
[    7.124302] xhci-mtk 1e1c0000.xhci: irq 21, io mem 0x1e1c0000
[    7.131100] hub 1-0:1.0: USB hub found
[    7.135161] hub 1-0:1.0: 2 ports detected
[    7.139730] xhci-mtk 1e1c0000.xhci: xHCI Host Controller
[    7.145311] xhci-mtk 1e1c0000.xhci: new USB bus registered, assigned bus number 2
[    7.152915] xhci-mtk 1e1c0000.xhci: Host supports USB 3.0 SuperSpeed
[    7.159554] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    7.168375] hub 2-0:1.0: USB hub found
[    7.172338] hub 2-0:1.0: 1 port detected
[    7.218015] usbcore: registered new interface driver usb-storage
[    7.228473] usbcore: registered new interface driver uas
[    7.234291] kmodloader: done loading kernel modules from /etc/modules-boot.d/*
[    7.251101] init: - preinit -
[    7.265526] mmc0: new high speed SDXC card at address 0007
[    7.277556] mmcblk0: mmc0:0007 SD64G 57.9 GiB 
[    7.304366]  mmcblk0: p1
[    7.514920] usb 2-1: new SuperSpeed Gen 1 USB device number 2 using xhci-mtk
[    7.545400] hub 2-1:1.0: USB hub found
[    7.549900] hub 2-1:1.0: 4 ports detected
[    7.783566] random: procd: uninitialized urandom read (4 bytes read)
[    7.798800] usb 1-1: new high-speed USB device number 2 using xhci-mtk
[    7.960383] hub 1-1:1.0: USB hub found
[    7.964892] hub 1-1:1.0: 4 ports detected
[    8.095639] extroot-attach: running bootfs attach pre-init script...
[    8.246792] usb 1-2: new high-speed USB device number 3 using xhci-mtk
[    9.171137] usb 2-1.4: new SuperSpeed Gen 1 USB device number 3 using xhci-mtk
[    9.289717] jffs2: notice: (632) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
[    9.307113] extroot-attach: successfully mounted the boot file system
[    9.855817] extroot-attach: successfully mounted pstore
[    9.897921] random: crng init done
[   10.396878] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
[   10.405446] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   10.414123] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   11.650879] mt7530 mdio-bus:1f lan1: configuring for phy/gmii link mode
[   11.658315] 8021q: adding VLAN 0 to HW filter on device lan1
[   16.054910] mount_root: loading kmods from internal overlay
[   16.092402] kmodloader: loading kernel modules from //etc/modules-boot.d/*
[   16.102713] kmodloader: done loading kernel modules from //etc/modules-boot.d/*
[   17.248117] jffs2: notice: (723) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
[   17.264450] block: attempting to load /tmp/jffs_cfg/upper/etc/config/fstab
[   17.275116] block: check_filesystem: /usr/sbin/fsck.f2fs not found
[   17.883121] F2FS-fs (mmcblk0p1): Mounted with checkpoint version = 531ea93c
[   17.903317] mount_root: switched to extroot
[   17.912420] urandom-seed: Seeding with /etc/urandom.seed
[   17.940651] urandom-seed: Seeding with /etc/urandom.log
[   17.974143] procd: - early -
[   17.977293] procd: - watchdog -
[   18.695006] procd: - watchdog -
[   18.701593] procd: - ubus -
[   19.060681] procd: - init -
[   21.296465] I/O scheduler: non-rotational device 'mtdblock4' was '[none] bfq kyber mq-deadline ', now 'bfq [kyber] mq-deadline none'
[   21.361686] urngd: v1.0.2 started.
[   21.430397] kmodloader: loading kernel modules from /etc/modules.d/*
[   21.448866] zram: Added device: zram0
[   21.457166] zbud: loaded
[   21.542825] I/O scheduler: non-rotational device 'mtdblock2' was '[none] bfq kyber mq-deadline ', now 'bfq [kyber] mq-deadline none'
[   22.123664] loop: module loaded
[   22.389883] Mirror/redirect action on
[   22.406626] u32 classifier
[   22.409565]     input device check on
[   22.413440]     Actions configured
[   22.414122] I/O scheduler: non-rotational device 'mtdblock0' was '[none] bfq kyber mq-deadline ', now 'bfq [kyber] mq-deadline none'
[   22.434270] usbcore: registered new interface driver cdc_wdm
[   22.453423] Loading modules backported from Linux version v5.8-0-gbcf876870b95
[   22.460892] Backport generated by backports.git v5.8-1-0-g79400d9e
[   22.520328] usbcore: registered new interface driver ums-alauda
[   22.528261] usbcore: registered new interface driver ums-cypress
[   22.536255] usbcore: registered new interface driver ums-datafab
[   22.544109] usbcore: registered new interface driver ums-freecom
[   22.551983] usbcore: registered new interface driver ums-isd200
[   22.559812] usbcore: registered new interface driver ums-jumpshot
[   22.567572] usbcore: registered new interface driver ums-karma
[   22.575540] usbcore: registered new interface driver ums-sddr09
[   22.583283] usbcore: registered new interface driver ums-sddr55
[   22.591402] usbcore: registered new interface driver ums-usbat
[   22.605481] usbcore: registered new interface driver usbserial_generic
[   22.612511] usbserial: USB Serial support registered for generic
[   22.648742] xt_time: kernel timezone is -0000
[   22.654388] I/O scheduler: non-rotational device 'mtdblock5' was '[none] bfq kyber mq-deadline ', now 'bfq [kyber] mq-deadline none'
[   22.657236] usbcore: registered new interface driver asix
[   22.811049] I/O scheduler: non-rotational device 'mtdblock3' was '[none] bfq kyber mq-deadline ', now 'bfq [kyber] mq-deadline none'
[   22.983170] I/O scheduler: non-rotational device 'mtdblock1' was '[none] bfq kyber mq-deadline ', now 'bfq [kyber] mq-deadline none'
[   23.007279] ax88179_178a 2-1.4:1.0 eth1: register 'ax88179_178a' at usb-1e1c0000.xhci-1.4, ASIX AX88179 USB 3.0 Gigabit Ethernet, 00:05:1b:d4:6a:b2
[   23.033939] usbcore: registered new interface driver ax88179_178a
[   23.129778] mt7621-pci 1e140000.pcie: bus=1 slot=0 irq=24
[   23.135602] pci 0000:00:00.0: enabling device (0006 -> 0007)
[   23.141545] mt7603e 0000:01:00.0: enabling device (0000 -> 0002)
[   23.148241] mt7603e 0000:01:00.0: ASIC revision: 76030010
[   23.276920] I/O scheduler: non-rotational device 'mmcblk0' was 'bfq kyber [mq-deadline] none', now 'bfq [kyber] mq-deadline none'
[   23.380341] I/O scheduler: non-rotational device 'mtdblock6' was '[none] bfq kyber mq-deadline ', now 'bfq [kyber] mq-deadline none'
[   23.572387] I/O scheduler: non-rotational device 'loop0' was 'bfq kyber [mq-deadline] none', now 'bfq [kyber] mq-deadline none'
[   23.688593] I/O scheduler: non-rotational device 'loop1' was 'bfq kyber [mq-deadline] none', now 'bfq [kyber] mq-deadline none'
[   23.772003] mt7603e 0000:01:00.0: Firmware Version: ap_pcie
[   23.789453] mt7603e 0000:01:00.0: Build Time: 20160107100755
[   23.822857] mt7603e 0000:01:00.0: firmware init done
[   23.837244] I/O scheduler: non-rotational device 'loop2' was 'bfq kyber [mq-deadline] none', now 'bfq [kyber] mq-deadline none'
[   23.940464] I/O scheduler: non-rotational device 'loop3' was 'bfq kyber [mq-deadline] none', now 'bfq [kyber] mq-deadline none'
[   24.008348] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[   24.033785] mt7621-pci 1e140000.pcie: bus=2 slot=1 irq=25
[   24.039362] pci 0000:00:01.0: enabling device (0006 -> 0007)
[   24.045191] mt76x2e 0000:02:00.0: enabling device (0000 -> 0002)
[   24.051514] mt76x2e 0000:02:00.0: ASIC revision: 76120044
[   24.104568] I/O scheduler: non-rotational device 'loop4' was 'bfq kyber [mq-deadline] none', now 'bfq [kyber] mq-deadline none'
[   24.212568] I/O scheduler: non-rotational device 'loop5' was 'bfq kyber [mq-deadline] none', now 'bfq [kyber] mq-deadline none'
[   24.344753] I/O scheduler: non-rotational device 'loop6' was 'bfq kyber [mq-deadline] none', now 'bfq [kyber] mq-deadline none'
[   24.455609] mt76x2e 0000:02:00.0: ROM patch build: 20141115060606a
[   24.477617] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[   24.483476] mt76x2e 0000:02:00.0: Build: 1
[   24.487755] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[   24.499346] I/O scheduler: non-rotational device 'loop7' was 'bfq kyber [mq-deadline] none', now 'bfq [kyber] mq-deadline none'
[   24.510847] mt76x2e 0000:02:00.0: Firmware running!
[   24.529501] ieee80211 phy1: Selected rate control algorithm 'minstrel_ht'
[   24.580276] PPP generic driver version 2.4.2
[   24.588036] NET: Registered protocol family 24
[   24.598909] qmi_wwan 1-2:1.4: cdc-wdm0: USB WDM device
[   24.607267] qmi_wwan 1-2:1.4 wwan0: register 'qmi_wwan' at usb-1e1c0000.xhci-2, WWAN/QMI device, 06:76:4e:7c:d9:8f
[   24.618287] usbcore: registered new interface driver qmi_wwan
[   24.651190] usbcore: registered new interface driver option
[   24.657237] usbserial: USB Serial support registered for GSM modem (1-port)
[   24.665095] option 1-2:1.0: GSM modem (1-port) converter detected
[   24.672080] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB0
[   24.679734] option 1-2:1.1: GSM modem (1-port) converter detected
[   24.686668] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB1
[   24.694369] option 1-2:1.2: GSM modem (1-port) converter detected
[   24.701392] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB2
[   24.709032] option 1-2:1.3: GSM modem (1-port) converter detected
[   24.715859] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB3
[   24.728914] kmodloader: done loading kernel modules from /etc/modules.d/*
[   29.793376] zram: setup backing device /dev/loop0
[   29.841985] zram0: detected capacity change from 0 to 1036120064
[   30.070937] Adding 1011832k swap on /dev/zram0.  Priority:1000 extents:1 across:1011832k SSDscFS
[   37.694635] mtk_soc_eth 1e100000.ethernet eth0: Link is Down
[   37.711848] mtk_soc_eth 1e100000.ethernet: PPE table busy
[   37.733966] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
[   37.742460] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   37.747580] mt7530 mdio-bus:1f lan1: configuring for phy/gmii link mode
[   37.759244] 8021q: adding VLAN 0 to HW filter on device lan1
[   37.769074] br-lan: port 1(lan1) entered blocking state
[   37.775871] br-lan: port 1(lan1) entered disabled state
[   37.783564] device lan1 entered promiscuous mode
[   37.788557] device eth0 entered promiscuous mode
[   37.847606] mt7530 mdio-bus:1f lan2: configuring for phy/gmii link mode
[   37.857204] 8021q: adding VLAN 0 to HW filter on device lan2
[   37.868053] br-lan: port 2(lan2) entered blocking state
[   37.874444] br-lan: port 2(lan2) entered disabled state
[   37.882972] device lan2 entered promiscuous mode
[   37.900174] mt7530 mdio-bus:1f lan3: configuring for phy/gmii link mode
[   37.908007] 8021q: adding VLAN 0 to HW filter on device lan3
[   37.918552] br-lan: port 3(lan3) entered blocking state
[   37.925568] br-lan: port 3(lan3) entered disabled state
[   37.935256] device lan3 entered promiscuous mode
[   37.955427] mt7530 mdio-bus:1f lan4: configuring for phy/gmii link mode
[   37.966797] 8021q: adding VLAN 0 to HW filter on device lan4
[   37.979537] br-lan: port 4(lan4) entered blocking state
[   37.985532] br-lan: port 4(lan4) entered disabled state
[   37.995444] device lan4 entered promiscuous mode
[   38.014263] mt7530 mdio-bus:1f wan: configuring for phy/gmii link mode
[   38.025862] 8021q: adding VLAN 0 to HW filter on device wan
[   38.036855] br-lan: port 5(wan) entered blocking state
[   38.046283] br-lan: port 5(wan) entered disabled state
[   38.055798] device wan entered promiscuous mode
[   38.252957] jffs2: notice: (4437) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
[   40.311800] br-lan: port 6(wlan0) entered blocking state
[   40.320117] br-lan: port 6(wlan0) entered disabled state
[   40.326348] device wlan0 entered promiscuous mode
[   40.781014] br-lan: port 7(wlan1) entered blocking state
[   40.786939] br-lan: port 7(wlan1) entered disabled state
[   40.800551] device wlan1 entered promiscuous mode
[   40.808057] br-lan: port 7(wlan1) entered blocking state
[   40.813651] br-lan: port 7(wlan1) entered forwarding state
[   40.828966] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
[   41.079701] ax88179_178a 2-1.4:1.0 eth1: ax88179 - Link status is: 1
[   41.098007] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[   41.107462] mt7530 mdio-bus:1f wan: Link is Up - 1Gbps/Full - flow control rx/tx
[   41.115514] br-lan: port 5(wan) entered blocking state
[   41.121093] br-lan: port 5(wan) entered forwarding state
[   41.406478] br-lan: port 7(wlan1) entered disabled state
[   41.734488] IPv6: br-lan: IPv6 duplicate address fd00:192:168::1 used by e4:95:6e:47:1f:55 detected!
[   45.973245] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[   45.981094] br-lan: port 7(wlan1) entered blocking state
[   45.986681] br-lan: port 7(wlan1) entered forwarding state
[   47.823120] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[   47.830318] br-lan: port 6(wlan0) entered blocking state
[   47.836080] br-lan: port 6(wlan0) entered forwarding state
[  121.118105] pstore: collected an compressed console-ramoops-0 from last boot
[  121.135233] pstore: collected an compressed pmsg-ramoops-0 from last boot
[85719.163811] ax88179_178a 2-1.4:1.0 eth1: ax88179 - Link status is: 1
[85722.107765] ax88179_178a 2-1.4:1.0 eth1: ax88179 - Link status is: 1
[85722.121596] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[155843.369462] mt7530 mdio-bus:1f wan: Link is Down
[155843.466576] br-lan: port 5(wan) entered disabled state
[155875.336121] mt7530 mdio-bus:1f wan: Link is Up - 1Gbps/Full - flow control rx/tx
[155875.343896] br-lan: port 5(wan) entered blocking state
[155875.349327] br-lan: port 5(wan) entered forwarding state
[155878.407819] mt7530 mdio-bus:1f wan: Link is Down
[155878.412766] br-lan: port 5(wan) entered disabled state
[155890.695417] mt7530 mdio-bus:1f wan: Link is Up - 1Gbps/Full - flow control rx/tx
[155890.703061] br-lan: port 5(wan) entered blocking state
[155890.708493] br-lan: port 5(wan) entered forwarding state
[178367.985725] mt7530 mdio-bus:1f wan: Link is Down
[178367.990987] br-lan: port 5(wan) entered disabled state
[178399.728591] mt7530 mdio-bus:1f wan: Link is Up - 1Gbps/Full - flow control rx/tx
[178399.736352] br-lan: port 5(wan) entered blocking state
[178399.741699] br-lan: port 5(wan) entered forwarding state
[178408.944057] mt7530 mdio-bus:1f wan: Link is Down
[178408.948998] br-lan: port 5(wan) entered disabled state
[178413.040057] mt7530 mdio-bus:1f wan: Link is Up - 1Gbps/Full - flow control rx/tx
[178413.047702] br-lan: port 5(wan) entered blocking state
[178413.053126] br-lan: port 5(wan) entered forwarding state

I agree with you, snapshots are really good as far as I can report.

1 Like

Unfortunately you and others have devices with some bug in the SOC, that gets triggered by the v19.07 mtk ethernet driver.

In my case on zbt-wg3526-16M 19.07.3 was rock solid (months of uptime). I skipped .4 due to known issues, now .05 locks up. Same config and usage pattern.

Oh that's unfortunate then.

See https://github.com/openwrt/openwrt/pull/2798#issuecomment-599154521 and subsequent comments.

I'm sure there are hacks that can be used to speed certain aspects of the driver up. https://gitlab.com/db260179/xiaomi-m4a/-/commit/4f1a3c5223e253c58085d82f0f01fc58575512f5 caught my eye actually.

Hmm now that I'm looking at the comments, https://github.com/openwrt/openwrt/pull/2798#issuecomment-600006326 seems interesting...

@db260179 trying your iperf example doesn't trigger a crash or errors on zbt-wg3526-16M/19.07.5.

This is udp:

iperf3 -c 192.168.1.1 -u -t 100 -b 1000M -i10 -P2
[SUM]   0.00-100.00 sec  12.1 GBytes  1.04 Gbits/sec  0.000 ms  0/8992148 (0%)  sender
[SUM]   0.00-100.22 sec  3.40 GBytes   291 Mbits/sec  0.075 ms  6471825/8992143 (72%)  receiver

During udp run - see interrupts in 'in' column (router was unusable during the run)

# vmstat -t -w --one-header 10
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- -----timestamp-----
 r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st                 PST
 0  0            0       460792         2524        14856    0    0     0     0  637  184   0   0  99   0   0 2020-12-19 23:03:34
 2  0            0       460728         2524        14864    0    0     0     0  698  345   3  17  81   0   0 2020-12-19 23:03:44
 2  0            0       460912         2524        14864    0    0     0     0  609  284   7  43  50   0   0 2020-12-19 23:03:54
 2  0            0       460920         2524        14864    0    0     0     0  599  273   7  43  50   0   0 2020-12-19 23:04:04
 2  0            0       460852         2524        14864    0    0     0     0  590  246   7  43  50   0   0 2020-12-19 23:04:14
 2  0            0       460728         2524        14864    0    0     0     0  636  285   6  40  54   0   0 2020-12-19 23:04:24
 2  0            0       460812         2524        14864    0    0     0     0  589  252   7  43  50   0   0 2020-12-19 23:04:34
 3  0            0       461064         2524        14864    0    0     0     0  641  356   7  44  50   0   0 2020-12-19 23:04:44
 2  0            0       460968         2524        14864    0    0     0     0  635  304   8  43  50   0   0 2020-12-19 23:04:54
 2  0            0       461000         2524        14864    0    0     0     0  583  225   8  38  54   0   0 2020-12-19 23:05:04
 2  0            0       461060         2524        14864    0    0     0     0  607  288   7  43  50   0   0 2020-12-19 23:05:14
 0  0            0       461244         2524        14856    0    0     0     0  627  269   5  27  68   0   0 2020-12-19 23:05:24
 0  0            0       461212         2524        14856    0    0     0     0  663  198   0   0  99   0   0 2020-12-19 23:05:34

With tcp (note a lot more interrupts in 'in' compared to udp)

iperf3 -c 192.168.1.1 -t 100 -b 1000M -i10 -Z -P2
[SUM]   0.00-100.00 sec  8.97 GBytes   771 Mbits/sec  91866             sender
[SUM]   0.00-100.00 sec  8.96 GBytes   770 Mbits/sec                  receiver

# vmstat -t -w --one-header 10
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- -----timestamp-----
 r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st                 PST
 2  0            0       458604         2524        15112    0    0     0     0 5655  970   1  33  66   0   0 2020-12-19 23:01:44
 2  0            0       457220         2524        15112    0    0     0     0 9520  840   1  48  51   0   0 2020-12-19 23:01:54
 3  0            0       457252         2524        15112    0    0     0     0 8174 1013   1  48  51   0   0 2020-12-19 23:02:04
 2  0            0       459176         2524        15112    0    0     0     0 6285  933   1  46  53   0   0 2020-12-19 23:02:14
 2  0            0       456976         2524        15112    0    0     0     0 4562  686   1  48  51   0   0 2020-12-19 23:02:24
 3  0            0       460488         2524        15112    0    0     0     0 10176  779   1  49  50   0   0 2020-12-19 23:02:34
 2  0            0       457156         2524        15112    0    0     0     0 8067  921   1  48  51   0   0 2020-12-19 23:02:44
 2  0            0       457368         2524        15112    0    0     0     0 8250  946   1  49  50   0   0 2020-12-19 23:02:54
 2  0            0       457536         2524        15112    0    0     0     0 7704  893   1  48  52   0   0 2020-12-19 23:03:04
 2  0            0       459236         2524        15112    0    0     0     0 7804  952   1  48  52   0   0 2020-12-19 23:03:14
 0  0            0       460884         2524        14856    0    0     0     0 7073  856   1  41  58   0   0 2020-12-19 23:03:24
 0  0            0       460792         2524        14856    0    0     0     0  637  184   0   0  99   0   0 2020-12-19 23:03:34

only 'rev' in dmesg:

# dmesg|grep rev
[    0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
[    0.105763] CPU1 revision is: 0001992f (MIPS 1004Kc)
[    0.206133] CPU2 revision is: 0001992f (MIPS 1004Kc)
[    0.287022] CPU3 revision is: 0001992f (MIPS 1004Kc)
[   14.024365] mt7603e 0000:01:00.0: ASIC revision: 76030010
[   15.299007] mt76x2e 0000:02:00.0: ASIC revision: 76120044

notably, i delete stock 20-smp-tune (, instead i manually assign ethernet interrupt to cpu 2 and only use 2.4 Ghz wifi on router:

# cat /proc/interrupts; grep . /proc/irq/{22,24,25}/smp_affinity_list; grep . /sys/class/net/*/queues/*/?ps_cpus
           CPU0       CPU1       CPU2       CPU3
  8:   13764565   13764547   13764539   13764532  MIPS GIC Local   1  timer
  9:      91742          0          0          0  MIPS GIC  63  IPI call
 10:          0      48303          0          0  MIPS GIC  64  IPI call
 11:          0          0     123406          0  MIPS GIC  65  IPI call
 12:          0          0          0      60066  MIPS GIC  66  IPI call
 13:     105351          0          0          0  MIPS GIC  67  IPI resched
 14:          0      87430          0          0  MIPS GIC  68  IPI resched
 15:          0          0     172782          0  MIPS GIC  69  IPI resched
 16:          0          0          0     128434  MIPS GIC  70  IPI resched
 19:         12          0          0          0  MIPS GIC  33  ttyS0
 20:          0          0          0          0  MIPS GIC  27  1e130000.sdhci
 21:          0          0          0          0  MIPS GIC  29  xhci-hcd:usb1
 22:         20          0   13099950          0  MIPS GIC  10  1e100000.ethernet
 23:          4          0          0          0  MIPS GIC  30  gsw
 24:         23          0          0   10341658  MIPS GIC  11  mt7603e
 25:          2          0          0          0  MIPS GIC  31  mt76x2e
ERR:          1

/proc/irq/22/smp_affinity_list:2
/proc/irq/24/smp_affinity_list:3
/proc/irq/25/smp_affinity_list:2
/sys/class/net/br-lan/queues/rx-0/rps_cpus:0
/sys/class/net/br-lan/queues/tx-0/xps_cpus:0
/sys/class/net/eth0.1/queues/rx-0/rps_cpus:0
/sys/class/net/eth0.1/queues/tx-0/xps_cpus:0
/sys/class/net/eth0.2/queues/rx-0/rps_cpus:0
/sys/class/net/eth0.2/queues/tx-0/xps_cpus:0
/sys/class/net/eth0/queues/rx-0/rps_cpus:0
/sys/class/net/eth0/queues/tx-0/xps_cpus:0
/sys/class/net/lo/queues/rx-0/rps_cpus:0
/sys/class/net/lo/queues/tx-0/xps_cpus:0
/sys/class/net/wlan0/queues/rx-0/rps_cpus:0
/sys/class/net/wlan0/queues/tx-0/xps_cpus:0
/sys/class/net/wlan0/queues/tx-1/xps_cpus:0
/sys/class/net/wlan0/queues/tx-2/xps_cpus:0
/sys/class/net/wlan0/queues/tx-3/xps_cpus:0
/sys/class/net/wlan1/queues/rx-0/rps_cpus:0
/sys/class/net/wlan1/queues/tx-0/xps_cpus:0
/sys/class/net/wlan1/queues/tx-1/xps_cpus:0
/sys/class/net/wlan1/queues/tx-2/xps_cpus:0
/sys/class/net/wlan1/queues/tx-3/xps_cpus:0

This patch i found, and added (seemed to have got lost in the mountain of patches, i guess?)

To allow 4.14 to have this NAPI threaded polling.

Yes i investigated this, and the mt7621 part of the old v19.07 ethernet driver doesn't use the port status feature, it appears its brute forced (speed,fc,duplex etc) and only a small part in the mtk_eth_soc.c to check port up or down.

This was on the original v19.07.5 source build and not my source build?

The gsw:rev, will only be in my source, as i added a patch to see what soc versions exist.

@neheb as much as I see, the threaded NAPI patch is part of 5.4 and 4.14 as well.

If someone take a look at Felix's git, there are a lot of new patches coming along with the 5.10 support. https://git.openwrt.org/?p=openwrt/staging/nbd.git;a=summary
I am not sure how "WIP" the 5.10 support is at the moment, but if @neheb or @nbd say its ready for beta testing, I would compile a version and load it.

That's here: https://github.com/openwrt/openwrt/pull/3693

1 Like

I am on snapshot (with kernel 5.4) and it is very stable. I have not seen any errors so far, no more mtk_soc_eth timeouts. While when on v19.07.X I experienced them daily.

Mind that kernel 5 introduced different architecture for switches - DSA (distributed switch architecture).

Official v19.07.5 build

Perhaps my lock up is something different, these are first 4 traces that happened in the same second.
A lot more traces in logs until i power cycled the router. I was receiving the traces via UDP log, so router was not completely dead from network prospective.

1:

INFO: rcu_sched self-detected stall on CPU
       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=3000
        (t=6000 jiffies g=600482 c=600481 q=97)
NMI backtrace for cpu 0
CPU: 0 PID: 9933 Comm: kworker/0:2 Not tainted 4.14.209 #0
Workqueue: events 0x802409f4
Stack : 00000000 00000000 804d7568 8fc09d24 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000001 8fc09ce0 ac07f5c8
        8fc09d78 00000000 00000000 00005af8 00000038 8049da98 00000008 00000000
        00000000 80550000 000732bc 00000000 8fc09cc0 00000000 00000000 00000000
        80554500 8054c0cc 000000e0 80550000 00000000 802ae190 00000000 806b0000
        ...
Call Trace:
[<8049da98>] 0x8049da98
[<802ae190>] 0x802ae190
[<8000c1a0>] 0x8000c1a0
[<8000c1a8>] 0x8000c1a8
[<804868d4>] 0x804868d4
[<80072c54>] 0x80072c54
[<8048d8f4>] 0x8048d8f4
[<80008f30>] 0x80008f30
[<80008f30>] 0x80008f30
[<8048d9e0>] 0x8048d9e0
[<80086558>] 0x80086558
[<800859a8>] 0x800859a8
[<80089118>] 0x80089118
[<8009a110>] 0x8009a110
[<8009a168>] 0x8009a168
[<8005f3e4>] 0x8005f3e4
[<80332644>] 0x80332644
[<80079328>] 0x80079328
[<800735e0>] 0x800735e0
[<8025bc44>] 0x8025bc44
[<8025be04>] 0x8025be04
[<800735e0>] 0x800735e0
[<804a475c>] 0x804a475c
[<8025b4c0>] 0x8025b4c0
[<80007488>] 0x80007488

2:

INFO: rcu_sched self-detected stall on CPU
       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=11993
        (t=24003 jiffies g=600482 c=600481 q=473)
NMI backtrace for cpu 0
CPU: 0 PID: 9933 Comm: kworker/0:2 Not tainted 4.14.209 #0
Workqueue: events 0x802409f4
Stack : 00000000 00000000 804d7568 8fc09d24 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000001 8fc09ce0 ac07f5c8
        8fc09d78 00000000 00000000 000062a0 00000038 8049da98 00000008 00000000
        00000000 80550000 00078fca 00000000 8fc09cc0 00000000 00000000 00000000
        80554500 8054c0cc 000000e0 80550000 00000000 802ae190 00000000 806b0000
        ...
Call Trace:
[<8049da98>] 0x8049da98
[<802ae190>] 0x802ae190
[<8000c1a0>] 0x8000c1a0
[<8000c1a8>] 0x8000c1a8
[<804868d4>] 0x804868d4
[<80072c54>] 0x80072c54
[<8048d8f4>] 0x8048d8f4
[<80008f30>] 0x80008f30
[<80008f30>] 0x80008f30
[<8048d9e0>] 0x8048d9e0
[<80086558>] 0x80086558
[<800859a8>] 0x800859a8
[<80089118>] 0x80089118
[<8009a110>] 0x8009a110
[<8009a168>] 0x8009a168
[<80332644>] 0x80332644
[<80079328>] 0x80079328
[<800735e0>] 0x800735e0
[<8025bd98>] 0x8025bd98
[<8025bc44>] 0x8025bc44
[<8025be04>] 0x8025be04
[<800735e0>] 0x800735e0
[<804a475c>] 0x804a475c
[<8025b4c0>] 0x8025b4c0
[<80007488>] 0x80007488

Diff 1 and 2

@@ -1,13 +1,13 @@
 INFO: rcu_sched self-detected stall on CPU
-       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=3000
-        (t=6000 jiffies g=600482 c=600481 q=97)
+       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=11993
+        (t=24003 jiffies g=600482 c=600481 q=473)
 NMI backtrace for cpu 0
 CPU: 0 PID: 9933 Comm: kworker/0:2 Not tainted 4.14.209 #0
 Workqueue: events 0x802409f4
 Stack : 00000000 00000000 804d7568 8fc09d24 00000000 00000000 00000000 00000000
         00000000 00000000 00000000 00000000 00000000 00000001 8fc09ce0 ac07f5c8
-        8fc09d78 00000000 00000000 00005af8 00000038 8049da98 00000008 00000000
-        00000000 80550000 000732bc 00000000 8fc09cc0 00000000 00000000 00000000
+        8fc09d78 00000000 00000000 000062a0 00000038 8049da98 00000008 00000000
+        00000000 80550000 00078fca 00000000 8fc09cc0 00000000 00000000 00000000
         80554500 8054c0cc 000000e0 80550000 00000000 802ae190 00000000 806b0000
         ...
 Call Trace:
@@ -26,10 +26,10 @@
 [<80089118>] 0x80089118
 [<8009a110>] 0x8009a110
 [<8009a168>] 0x8009a168
-[<8005f3e4>] 0x8005f3e4
 [<80332644>] 0x80332644
 [<80079328>] 0x80079328
 [<800735e0>] 0x800735e0
+[<8025bd98>] 0x8025bd98
 [<8025bc44>] 0x8025bc44
 [<8025be04>] 0x8025be04
 [<800735e0>] 0x800735e0

3:

INFO: rcu_sched self-detected stall on CPU
       0-...: (1 GPs behind) idle=0b2/140000000000001/0 softirq=2840147/2840148 fqs=17290
        (t=42006 jiffies g=600482 c=600481 q=872)
rcu_sched kthread starved for 7393 jiffies! g600482 c600481 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=2
rcu_sched       R  running task        0     8      2 0x00100000
Stack : 00000001 8fc3aca0 8fc57e38 8fc57e38 8fc57e38 80550000 814be380 00000000
        81495da0 80052c18 01d9e8ef 814be380 8fc57e38 80550000 814be380 00000000
        00000003 8054c1e8 80554610 8049fe18 8fc57e38 80550000 80550000 00000000
        00000003 804a3630 80550000 00000000 8fc57ebc 80068228 00000000 00000001
        00000200 00000000 01d9e8ef 800886a0 8fc3aca0 0c000003 80554500 00000000
        ...
Call Trace:
[<80052c18>] 0x80052c18
[<8049fe18>] 0x8049fe18
[<804a3630>] 0x804a3630
[<80068228>] 0x80068228
[<800886a0>] 0x800886a0
[<80084bc4>] 0x80084bc4
[<80080000>] 0x80080000
[<80081c0c>] 0x80081c0c
[<80084664>] 0x80084664
[<8004c0f8>] 0x8004c0f8
[<8004bfc8>] 0x8004bfc8
[<8004bfc8>] 0x8004bfc8
[<8004bfc8>] 0x8004bfc8
[<80006f78>] 0x80006f78

4:

NMI backtrace for cpu 0
CPU: 0 PID: 9933 Comm: kworker/0:2 Not tainted 4.14.209 #0
Workqueue: events 0x802409f4
Stack : 00000000 00000000 804d7568 8fc09d24 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000001 8fc09ce0 ac07f5c8
        8fc09d78 00000000 00000000 00006f70 00000038 8049da98 00000008 00000000
        00000000 80550000 0009b1b2 00000000 8fc09cc0 00000000 00000000 00000000
        80554500 8054c0cc 000000e0 80550000 00000000 802ae190 00000000 806b0000
        ...
Call Trace:
[<8049da98>] 0x8049da98
[<802ae190>] 0x802ae190
[<8000c1a0>] 0x8000c1a0
[<8000c1a8>] 0x8000c1a8
[<804868d4>] 0x804868d4
[<80072c54>] 0x80072c54
[<8048d8f4>] 0x8048d8f4
[<80008f30>] 0x80008f30
[<80008f30>] 0x80008f30
[<8048d9e0>] 0x8048d9e0
[<8005384c>] 0x8005384c
[<80051840>] 0x80051840
[<80086558>] 0x80086558
[<8008646c>] 0x8008646c
[<800859a8>] 0x800859a8
[<80089118>] 0x80089118
[<8009a110>] 0x8009a110
[<8009a168>] 0x8009a168
[<80332644>] 0x80332644
[<80079328>] 0x80079328
[<800735e0>] 0x800735e0
[<8025bd98>] 0x8025bd98
[<8025bc44>] 0x8025bc44
[<8025be04>] 0x8025be04
[<800735e0>] 0x800735e0
[<804a475c>] 0x804a475c
[<8025b4c0>] 0x8025b4c0
[<80007488>] 0x80007488

PS: not familiar how to decode these for openwrt, this is:

# cat /etc/openwrt_release
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='19.07.5'
DISTRIB_REVISION='r11257-5090152ae3'
DISTRIB_TARGET='ramips/mt7621'
DISTRIB_ARCH='mipsel_24kc'
DISTRIB_DESCRIPTION='OpenWrt 19.07.5 r11257-5090152ae3'
DISTRIB_TAINTS=''
# uname -a
Linux gw 4.14.209 #0 SMP Sun Dec 6 07:31:03 2020 mips GNU/Linux

For now 7 days and 5 hours with master branch r15223 (Kernel 5.4.82) without any problem.

Spent some time searching and at a loss - what's the way to decode the code traces in openwrt ?

Someone is reporting that disabling TSO fixed the issue for them. They used a patch that can be found here: https://bugs.openwrt.org/index.php?do=details&task_id=2628

It might also be possible to disable TSO via ethtool, but I am not sure if that would also work for fixing it. Has anyone tried this potential solution?

2 Likes

Because of SQM I always disable as much offloads as possible with TSO among them

And do you ever experience the transmit queue has timed out error?

1 Like

Yes, I did but after master switched over to DSA, I have not experienced them anymore

2 Likes