Possible Kernel 5.10 regression issue with MT7621 and SW/HW offload enabled

Yesterday I installed a new snapshot build I did (r18468-4a2cca7824). Nothing fancy, just a base build plus LuCI, WireGuard, DDNS and some additional CLIs (nano, iperf3, htop). The only significant change I did was to include the latest mt76 driver (2ef775c10bd36537fadbde81754f64242e35bfd0).

I have both SW and HW flow offload enabled (Archer C6 v3.2/mt7621).

After 24 hours it is still up and running without any random reboot. I don’t know if any commit fixed this issue, but so far it seems that the new snapshot build with kernel 5.10[.89] is stable.

@YummyHamster, could you please try a newer Snapshot build on your Edgerouter X to see if the reboots are also solved for you? Thanks!

1 Like

Well, I think I spoke too soon. After almost 5 days running rock solid, my router just rebooted itself about an hour ago.

I configured logging on a remote server syslog server, but nothing was logged. I believe this is in fact a kernel issue that cannot be logged remotely (kernel ring buffer).

EDIT: now just rebooted after less than 2 hours. I will disable HW Flow Offloading (while keeping SW Flow Offloading) to check if there is any improvement.

1 Like

Same behavior on mir3p. Kernel panics every few hours. Uploaded logs collected by pstore.
After disabling HW/SW NAT - no reboots. Also, this commit seems to fix the issue. Not sure exactly, but I have 2 days of uptime. Maybe @nbd or someone else can look at this.

I did not have luck with the above commit. Event with it, the random reboots happened. I am testing now with SW offload enabled, but HW offload disabled. So far so good (1 day up and running without any reboot).

To me it is clear that Kernel 5.10 has serious problems with mt7621 HW offload. After this test I will likely go back to kernel 5.4, which was rock solid even with HW offload enabled.

1 Like

Per my tests and observations I came to the conclusion that in fact the cause of the random reboots is an issue with Kernel 5.10 and mt7621 HW flow offload. While I do not have any kernel panic log to share, I'm pretty confident of this finding (the only way to capture kernel ring buffer across reboots is to attach an UART and capture via serial console, something I cannot do due to the location where my main router is installed).

After a full week running SNAPSHOT r18539-f2c3875dfc with Kernel 5.10 on an Archer C6 v3.2 with HW flow offload disabled, the reboot issue went away. Notice that SW flow offload was enabled and it is working fine with kernel 5.10. However with HW flow offload enabled the device randomly reboots itself (the reboot frequency varies a lot, from a few minutes after power up to a few days).

With Kernel 5.4 and HW flow offload enabled this problem does not happen. So, before I switch back to an older build with kernel 5.4, I may give Kernel 5.10 + firewall4 + nftables a try (since I understand that the mt7621 HW acceleration is implemented as part of the firewall).

The same seems to be happening also for MT7622 devices. We initially thought only MT7622 was affected and now I'm finding this thread basically telling me that on MT7621 it's the same, just most users probably won't even notice the occasional reboot or blame it on the electricity grid or whatever.
See Belkin RT3200/Linksys E8450 WiFi AX discussion - #1375 by Mushoz for a full crash-log (poisoned pointer dereference), maybe anyone has an idea where that poisoned pointer comes from (the hardware? or is it a use-after-free thing?) and what we should do to prevent a crash in this case.

5 Likes

Hardware flow offlading with the 5.4 kernel is not fully supported as it does not work with PPPoE and VLANs. Kernel 5.10 restored that support, but introduced instability. Maybe that's the culprit?

@daniel @dsouza I'm using a firmware that was supposed to fix this issue; the version I have is with mixed results, but points the way to a fix.

MT7621 - 21.02/Master feedback firmware image test - IPV6 offload and disabled Flow Control - For Developers - OpenWrt Forum

1 Like

Well, I can say that my experience with kernel 5.10 has not been good.

With mt7621 and HW Offload enabled, I got random reboots. And I am using only IPv4, so I have not tried the IPv6 patch above. I have four Archer C6 v3.2, one router plus 3 access points. All APs are running custom "slim" builds which does not include firewall, iptables, nor dnsmasq. With the AP build kernel 5.10 is more stable, so I believe the issue is related to firewall/iptables with Kernel 5.10.

However, I also have a TP-Link RE305 v3 (mt7628) which is also running the "slim" build without firewall/iptables/dnsmasq. On this device with Kernel 5.10, JFFS2 is getting corrupted everyday and OpenWRT fails with different problems due to the corrupted overlay file system.

So therefore at least in my setup (four mt7621 and one mt7628 devices) Kernel 5.10 is really unstable and unreliable at this point. For now I reverted all devices back to the latest snapshot build with kernel 5.4 and everything is rock solid.

It is still in my plans to test Kernel 5.10 with firewall4/nftables with HW offload acceleration in the device running as router, but due to the reasons above I will roll back to previous builds with kernel 5.4 until snapshot builds are stable again.

2 Likes

I can give a try with 5.10 kernel and firewall4/nftables.
Actually snapshots have 5.10.92 kernel, an update will come soon. I'll just wait for it, maybe tomorrow.
I'll use a Netgear R6220 with SW/HW offload. I don't have IPv6 on the WAN side.

1 Like

Well, I just tried a snapshot I did today and it bricked my device. I just connected a UART cable, as you can see it is using Kernel 5.10.92 and just hangs at "Starting kernel ..." with no further error.

I will now try to recover this device and definitively I will stay away from Kernel 5.10 on the Archer C6 v3 devices until it is included in stable build... :slightly_frowning_face:

U-Boot 1.1.3 (May 13 2020 - 19:39:06)

Board: Ralink APSoC DRAM:  128 MB
relocate_code Pointer at: 87f58000

Config XHCI 40M PLL
flash manufacture id: c8, device id 40 18
find flash: GD25Q128C
*** Warning - bad CRC, using default environment

============================================
Ralink UBoot Version: 5.0.0.0
--------------------------------------------
ASIC MT7621A DualCore (MAC to MT7530 Mode)
DRAM_CONF_FROM: Auto-Detection
DRAM_TYPE: DDR3
DRAM bus: 16 bit
Xtal Mode=3 OCP Ratio=1/3
Flash component: SPI Flash
Date:May 13 2020  Time:19:39:06
============================================
THIS IS uboot
icache: sets:256, ways:4, linesz:32 ,total:32768
dcache: sets:256, ways:4, linesz:32 ,total:32768

 ##### The CPU freq = 880 MHZ ####
 estimate memory size =128 Mbytes

Press '4' or 't' to break the booting process

Press 'x' to enter recovery web server                                        0
nm_init:791
nm_initFwupPtnStruct:276
nm_lib_readPtnTable:738
[NM_Debug](nm_lib_readPtnTable) 00743: NM_PTN_TABLE_BASE = 0xfe0000
[NM_Debug](nm_lib_readPtnFromNvram) 00569: partition_used_len = 1054, requried l                                                          en = 8192
[NM_Debug](nm_lib_readPtnTable) 00751: Reading Partition Table from NVRAM ... OK

[NM_Debug](nm_lib_readPtnTable) 00759: Parsing Partition Table ... OK

[NM_Debug](nm_lib_readPtnFromNvram) 00569: partition_used_len = 2, requried len                                                           = 2
factory boot check integer ok.


3: System Boot system code via Flash.
## Booting image at bc040000 ...
   Image Name:   MIPS OpenWrt Linux-5.10.92
   Image Type:   MIPS Linux Kernel Image (lzma compressed)
   Data Size:    2731273 Bytes =  2.6 MB
   Load Address: 82000000
   Entry Point:  82000000
   Verifying Checksum ... OK
   Uncompressing Kernel Image ... OK
No initrd
## Transferring control to Linux (at address 82000000) ...
## Giving linux memsize in MB, 128

Starting kernel ...



OK, recovery was easier than I expected (opening the device and connecting the UART was the "hard" part, the rest was easy).

After I successfully recovered the device, I wanted to be sure the issue was not with my build. So I downloaded today's snapshot image from OpenWRT website and the results are the same as above. So be aware Archer C6 v3 users (as well as A6 v3) that as of January 30rd 2022 the snapshot builds will brick your device (and a UART connection will be required to recover it).

Lol (?) ...
Ok maybe I'll wait for next snapshot. I'm flashing a snapshot 2 or 3 times a week for the mt7621 and it has always worked. I know that snapshot are not garantied to work, for so far I had no real issue. Today's snapshot is still based on 5.10.92. I'll wait for the next based on 5.10.95.

Notice that this issue might affect only my device (Archer C6 v3 and A6 v3) and not all mt7621 devices.

So even with kernel 5.10.92 which is problematic in my experience you may have better luck with a different device.

BTW, due to a complete lack of error messages I am assuming the kernel is the issue (since it hung while loading), but it might be something else in this build.

Please continue the debate here (and maybe help by trying the possible fix) as this problem is not related to SW/HW offload (which is the topic of this thread):

1 Like

I bricked my A6v3 2 days ago. Can you tell me how to recover via UART connection?

@AashishAS, let's continue the discussion about unbricking Archer C6/A6 v3.x in the new topic below:

I just tested a new snapshot build today (r18710-dc2da6a233, kernel 5.10.96). Things got worse:

  1. With the now default firewall4, enabling HW flow offload breaks the firewall and all connected devices lose internet connectivity. Basically firewall4 is not able to recognize the flag option flow_offloading_hw '1' in the firewall config file. More details here.

  2. Redoing the build but selecting firewall3 instead, enabling HW flow offload now has no effect. I've monitored the CPU usage (medium/high) during a heavy download, and even with HW flow offload enabled the CPU usage is the same as disabled. With firewall3 and kernel 5.4 HW flow offload works perfectly and the CPU usage is minimum.

Once again rolled back to snapshot r18324-794e8123ce (kernel 5.4.162), the last snapshot build that has HW offload working and stable.

2 Likes

Just tried on a R6220 with Feb3 OpenWrt SNAPSHOT r18717-0e32c6baf3 (kernel 5.10.96)
Firewall is firewall4 (nftables).

basic settings : 610 Mbit/s
SW offloading : 630 Mbits/s
SW/HW Offloading : 600 Mbit/s
All results are very close and offloading doesn't seem to be active.

When I click on status/firewall I have no answer, and apparently no firewall process in system/processes.

BTW I have a symetric 1Gbit/s fiber which I normally use with a x86 router.

1 Like

You need reboot router now

Hardware Offload have many incompatibilities:

Vlan stp stats....

For future I recommend a big CPU without Offload

I'm waiting for something with 2.5Gbe for now