Mtk_soc_eth watchdog timeout after r11573

Just installed it. Disabled the packet steering under global network option. Speed is the same, did a couple speedtests, with PPPoE on a 1000/300 line I got 600/324 relatively consistently. During DL speedtest this is the CPU load:

The first core is constant 100% during speedtest, the rest of the cores are as such on the picture.

And before these patches you needed packet steering in order to hit these speeds, correct? That means performance wise these patches are doing really well. Let's hope the bug this topic is about is also finally fixed! We'll see in a couple of weeks I guess. Fingers crossed.

No. In previous versions I used packet steering and the speed was similar, though the cores were loaded more evenly.

MOD: now I tried it with packet steering enabled, and it has the same behavior as without packet steering on the latest master

I'm running 19.07.1 with the FC Off and Interrupt handling patches and it's been 100% stable for months. Not a single transmit timed out error, 0 interrupt errors, and no hangs or reboots. The maximum execution time has been 37 days due to power outages.

What i did to achieve this was assign each ethernet port a different VLAN (or several), but without there being more than one port in the same VLAN (and thus not use the integrated switch) and remove the OpenVPN interface tap0 from the br-lan bridge .

Which interrupt handling patches specifically? Could you maybe link all the patches that you applied so there is no confusion about the patches you used?

Also, how often did you run into this issue before you applied these patches?

Edit: Also, on what device is that? Is that with WiFi enabled?

Disabled the packet steering and try this.Did it make some diffrence?

echo "4" > "/sys/class/net/eth0/queues/rx-0/rps_cpus"
echo "4" > "/proc/irq/21/smp_affinity"
echo "8" > "/proc/irq/23/smp_affinity"
echo "8" > "/proc/irq/24/smp_affinity"
echo "2" > "/sys/class/net/wan/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/pppoe-wan/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan1/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan2/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan3/queues/rx-0/rps_cpus"
echo "2" > "/sys/class/net/lan4/queues/rx-0/rps_cpus"

The ancients that were discussed in this thread:

https://patchwork.ozlabs.org/project/openwrt/patch/20200211101741.17350-1-ynezz@true.cz/

https://patchwork.ozlabs.org/project/openwrt/patch/20191029172328.85861-2-rosenp@gmail.com/

Only with these two patches transmit timed out errors disappear completely. Without them, i had them several times a day or every two days. But even with those patches i had reboots and freezes. Finally, since i separated the ethernet ports in different VLANs it has not happened again.

The router is an ER-X. For those who have wifi, the interrupt errors will not go away completely because it is because of it. But with this patches and separete VLAN for each port, it should not give more transmit timed out errors, reboots or hangs caused by internal switch.

1 Like

Is this on 19.07? Those patches are not for the upstream driver.

Yes, that is on 19.07. He said so himself one post earlier :slight_smile:

Tried it. Slightly less performance, only two cores are loaded.

These patches are included in release 19.07.4.

Disable Flow Control was added to the 19.07 branch on May 26. Interrupt handling patch on September 6.
19.07.4 was released on September 7th.

Hardware offloading also seems to have been fixed (I haven't tested it).

1 Like

Yes, this looks like a really solid release for mt7621 devices. I am about to do the upgrade. Can't wait to test stability!

I think... YEP!!!!
In my usage scenario, 10 minute and reboot.
19.04 - more than 1hr heavy load, and no reboot.
Its cool

That is promising! I wonder what usage scenario causes it to crash that quickly. I only ran into issues about ~once a week.

What version were you running before this one by the way?

  • release 19.07.3 - no reboot, but sometimes "sch_generic.c:320 error" in log

  • snapshot, including "Update kernel 4.14 to version 4.14.195", "generic: fix flow table hw offload " and "ramips: gsw_mt7621: disable PORT 5 MAC RX/TX flow control by default" - reboot with various kernel panic (logging by serial console)

  • release 19.07.4 (previous patches + "ramips: ethernet: fix to interrupt handling") - no reboot, and no "sch_generic.c:320 error" for now

1 Like

Release 19.07.4 Hardware NAT is fixed partially, now it works, but after some time of work, network gets inaccessible, must reboot router. Switch to Software flow offloading, seems work stable.

Just installed today's snapshot, and now there is no way to spread the load on more than 1 core, this creates a bottleneck of 350Mbits on a gigabit line. With SW offload. I also tried HW offload, thought it might got working after this commit: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=4fb58813f94ac6cc8167138e23a92189fe50b258, but no.

I also tried to enable/disable packet steering, tried my previous tricks with RPS, nothing works, limited to single core and terrible speeds...

MOD:

And my kernel log is full with this:

[  105.362573] mt76x2e 0000:02:00.0: MCU message 2 (seq 9) timed out
[  105.731965] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  105.743041] mt76x2e 0000:02:00.0: Build: 1
[  105.751312] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  105.778561] mt76x2e 0000:02:00.0: Firmware running!
[  105.790708] ieee80211 phy1: Hardware restart was requested
[  106.834540] mt76x2e 0000:02:00.0: MCU message 2 (seq 12) timed out
[  107.203695] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  107.214697] mt76x2e 0000:02:00.0: Build: 1
[  107.222905] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  107.250568] mt76x2e 0000:02:00.0: Firmware running!
[  107.262695] ieee80211 phy1: Hardware restart was requested

If I disable all wifi adapters, the kernel log stops shooting this message.

MOD2: only the 2.4GHz wifi is affected.

Here has none of these logs, running

OpenWrt SNAPSHOT, r14465-04d3b517dc

for 3 days, 18:41.

BTW, as I mentioned before, my D-Link DIR-860L B1 is just used as AP, not main router.

@dchard Please let the mt76 developers know you are also experiencing the same problem. Hopefully they will look into it if they know it's a widespread issue. Out of curiosity, what mt76 device are you using? Curious if yours is using the same WiFi chips as my router is.

I commented. On my end, besides the wifi issue, PPPoE is also dropping (not ISP fault).