Possible Kernel 5.10 regression issue with MT7621 and SW/HW offload enabled

After upgrading to a newer snapshot build which switched ramips to kernel 5.10, I started experiencing random reboots on one Archer C6 v3.2 (MT7621) I'm using as router. On other 3 Archer C6 v3.2 devices I use as access points (also with kernel 5.10) this issue did not happen.

At this moment I don't have further log details. Anyway @denk also reported a similar issue with Edgerouter X SFP (also MT7621), see details here.

He reported that disabling flow offloading (sw/hw) solved the issue. This may explain why only my router (which has sw/hw flow offloading enabled) is having this issue, while the access points are OK (not being used as routers, just as "dumb" access points).

For now I've reverted to a previous snapshot build (r18298-8261b85844) with kernel 5.4.162. This solved the problem, even with sw/hw flow offload enabled everything is working fine.

I'm just posting here to give OpenWrt devevelopers visibility about this issue.

Thanks! :+1:

4 Likes

Hi
I'm using a R6220 as an AP with latest master (kernel 5.10), nothing to complain about. I'll give it a try as a router for some hours, with SW and HW offloading on.

EDIT : running for 2h now

My Tp-link c6u rebooted too. I wondered why it rebboted many timesat day.

Edgerouter X Also reboots randomly, wen't back to stable 21.2.0.1 and disabled ipv6 for now.

@badulesia, I believe then that just enabling the HW/SW flow offload is not sufficient to trigger the issue. Since you are using your device as Access Point, you are not actually using the routing feature (which in fact uses the flow offload).

Yes of course :wink:
I have reset it, and put it as main router, with both SW and HW offloading, just as I said I would do.

I have done performance tests, huge ISO downloads, in order to stress the routing process. So far (> 6h) nothing to report.

EDIT : I'm using OpenWrt SNAPSHOT r18400-f9782f5bcd
EDIT 2 : 18h running

2 Likes

Hi.

Netgear R6220 (MT7621ST)
OpenWrt SNAPSHOT r18400-f9782f5bcd / LuCI Master git-21.343.55550-008bd89 (kernel 5.10.87)
both SW and HW offloading enabled
I didn't use the default build, but a custom one. Nothing offensive, here the MAKE command

make image PROFILE=netgear_r6220 PACKAGES="-ppp -ppp-mod-pppoe -wpad-basic-wolfssl wpad-wolfssl luci luci-theme-material"

I performed a reset (within Luci) to restart with default values. I setup some minimal settings (password, ssh, ntp...) and added a wifi SSID. Than I used it as a main router in various ways (web, gaming, ftp ...).

I have ran the router for 24h now, without any issue, so no random reboot.

I have noticed that we don't use the same CPU and switch, maybe a way of research ?
R6220 -> MT7621ST
Archer C6 3.2 --> MT7621DAT
@jookk Tp-link c6u --> MT7621DAT
@YummyHamster Edgerouter X --> MT7621AT

just throwing in something random

can try disabling tcp timestamps to see if it does anything

echo 0 > /proc/sys/net/ipv4/tcp_timestamps

Yesterday I installed a new snapshot build I did (r18468-4a2cca7824). Nothing fancy, just a base build plus LuCI, WireGuard, DDNS and some additional CLIs (nano, iperf3, htop). The only significant change I did was to include the latest mt76 driver (2ef775c10bd36537fadbde81754f64242e35bfd0).

I have both SW and HW flow offload enabled (Archer C6 v3.2/mt7621).

After 24 hours it is still up and running without any random reboot. I don’t know if any commit fixed this issue, but so far it seems that the new snapshot build with kernel 5.10[.89] is stable.

@YummyHamster, could you please try a newer Snapshot build on your Edgerouter X to see if the reboots are also solved for you? Thanks!

1 Like

Well, I think I spoke too soon. After almost 5 days running rock solid, my router just rebooted itself about an hour ago.

I configured logging on a remote server syslog server, but nothing was logged. I believe this is in fact a kernel issue that cannot be logged remotely (kernel ring buffer).

EDIT: now just rebooted after less than 2 hours. I will disable HW Flow Offloading (while keeping SW Flow Offloading) to check if there is any improvement.

1 Like

Same behavior on mir3p. Kernel panics every few hours. Uploaded logs collected by pstore.
After disabling HW/SW NAT - no reboots. Also, this commit seems to fix the issue. Not sure exactly, but I have 2 days of uptime. Maybe @nbd or someone else can look at this.

I did not have luck with the above commit. Event with it, the random reboots happened. I am testing now with SW offload enabled, but HW offload disabled. So far so good (1 day up and running without any reboot).

To me it is clear that Kernel 5.10 has serious problems with mt7621 HW offload. After this test I will likely go back to kernel 5.4, which was rock solid even with HW offload enabled.

1 Like

Per my tests and observations I came to the conclusion that in fact the cause of the random reboots is an issue with Kernel 5.10 and mt7621 HW flow offload. While I do not have any kernel panic log to share, I'm pretty confident of this finding (the only way to capture kernel ring buffer across reboots is to attach an UART and capture via serial console, something I cannot do due to the location where my main router is installed).

After a full week running SNAPSHOT r18539-f2c3875dfc with Kernel 5.10 on an Archer C6 v3.2 with HW flow offload disabled, the reboot issue went away. Notice that SW flow offload was enabled and it is working fine with kernel 5.10. However with HW flow offload enabled the device randomly reboots itself (the reboot frequency varies a lot, from a few minutes after power up to a few days).

With Kernel 5.4 and HW flow offload enabled this problem does not happen. So, before I switch back to an older build with kernel 5.4, I may give Kernel 5.10 + firewall4 + nftables a try (since I understand that the mt7621 HW acceleration is implemented as part of the firewall).

The same seems to be happening also for MT7622 devices. We initially thought only MT7622 was affected and now I'm finding this thread basically telling me that on MT7621 it's the same, just most users probably won't even notice the occasional reboot or blame it on the electricity grid or whatever.
See Belkin RT3200/Linksys E8450 WiFi AX discussion - #1375 by Mushoz for a full crash-log (poisoned pointer dereference), maybe anyone has an idea where that poisoned pointer comes from (the hardware? or is it a use-after-free thing?) and what we should do to prevent a crash in this case.

5 Likes

Hardware flow offlading with the 5.4 kernel is not fully supported as it does not work with PPPoE and VLANs. Kernel 5.10 restored that support, but introduced instability. Maybe that's the culprit?

@daniel @dsouza I'm using a firmware that was supposed to fix this issue; the version I have is with mixed results, but points the way to a fix.

MT7621 - 21.02/Master feedback firmware image test - IPV6 offload and disabled Flow Control - For Developers - OpenWrt Forum

1 Like

Well, I can say that my experience with kernel 5.10 has not been good.

With mt7621 and HW Offload enabled, I got random reboots. And I am using only IPv4, so I have not tried the IPv6 patch above. I have four Archer C6 v3.2, one router plus 3 access points. All APs are running custom "slim" builds which does not include firewall, iptables, nor dnsmasq. With the AP build kernel 5.10 is more stable, so I believe the issue is related to firewall/iptables with Kernel 5.10.

However, I also have a TP-Link RE305 v3 (mt7628) which is also running the "slim" build without firewall/iptables/dnsmasq. On this device with Kernel 5.10, JFFS2 is getting corrupted everyday and OpenWRT fails with different problems due to the corrupted overlay file system.

So therefore at least in my setup (four mt7621 and one mt7628 devices) Kernel 5.10 is really unstable and unreliable at this point. For now I reverted all devices back to the latest snapshot build with kernel 5.4 and everything is rock solid.

It is still in my plans to test Kernel 5.10 with firewall4/nftables with HW offload acceleration in the device running as router, but due to the reasons above I will roll back to previous builds with kernel 5.4 until snapshot builds are stable again.

2 Likes

I can give a try with 5.10 kernel and firewall4/nftables.
Actually snapshots have 5.10.92 kernel, an update will come soon. I'll just wait for it, maybe tomorrow.
I'll use a Netgear R6220 with SW/HW offload. I don't have IPv6 on the WAN side.

1 Like

Well, I just tried a snapshot I did today and it bricked my device. I just connected a UART cable, as you can see it is using Kernel 5.10.92 and just hangs at "Starting kernel ..." with no further error.

I will now try to recover this device and definitively I will stay away from Kernel 5.10 on the Archer C6 v3 devices until it is included in stable build... :slightly_frowning_face:

U-Boot 1.1.3 (May 13 2020 - 19:39:06)

Board: Ralink APSoC DRAM:  128 MB
relocate_code Pointer at: 87f58000

Config XHCI 40M PLL
flash manufacture id: c8, device id 40 18
find flash: GD25Q128C
*** Warning - bad CRC, using default environment

============================================
Ralink UBoot Version: 5.0.0.0
--------------------------------------------
ASIC MT7621A DualCore (MAC to MT7530 Mode)
DRAM_CONF_FROM: Auto-Detection
DRAM_TYPE: DDR3
DRAM bus: 16 bit
Xtal Mode=3 OCP Ratio=1/3
Flash component: SPI Flash
Date:May 13 2020  Time:19:39:06
============================================
THIS IS uboot
icache: sets:256, ways:4, linesz:32 ,total:32768
dcache: sets:256, ways:4, linesz:32 ,total:32768

 ##### The CPU freq = 880 MHZ ####
 estimate memory size =128 Mbytes

Press '4' or 't' to break the booting process

Press 'x' to enter recovery web server                                        0
nm_init:791
nm_initFwupPtnStruct:276
nm_lib_readPtnTable:738
[NM_Debug](nm_lib_readPtnTable) 00743: NM_PTN_TABLE_BASE = 0xfe0000
[NM_Debug](nm_lib_readPtnFromNvram) 00569: partition_used_len = 1054, requried l                                                          en = 8192
[NM_Debug](nm_lib_readPtnTable) 00751: Reading Partition Table from NVRAM ... OK

[NM_Debug](nm_lib_readPtnTable) 00759: Parsing Partition Table ... OK

[NM_Debug](nm_lib_readPtnFromNvram) 00569: partition_used_len = 2, requried len                                                           = 2
factory boot check integer ok.


3: System Boot system code via Flash.
## Booting image at bc040000 ...
   Image Name:   MIPS OpenWrt Linux-5.10.92
   Image Type:   MIPS Linux Kernel Image (lzma compressed)
   Data Size:    2731273 Bytes =  2.6 MB
   Load Address: 82000000
   Entry Point:  82000000
   Verifying Checksum ... OK
   Uncompressing Kernel Image ... OK
No initrd
## Transferring control to Linux (at address 82000000) ...
## Giving linux memsize in MB, 128

Starting kernel ...



OK, recovery was easier than I expected (opening the device and connecting the UART was the "hard" part, the rest was easy).

After I successfully recovered the device, I wanted to be sure the issue was not with my build. So I downloaded today's snapshot image from OpenWRT website and the results are the same as above. So be aware Archer C6 v3 users (as well as A6 v3) that as of January 30rd 2022 the snapshot builds will brick your device (and a UART connection will be required to recover it).