Master regression - boot loop due to kernel panic on latest snapshot (mt7621/Archer C6 v3)

Just in case some dev is looking at this topic, I've tracked this issue to the code below. But i believe it has not changed and I have no understanding about its logic to proceed with the investigation of the root cause of what caused it started failing (but I suspect that the "detec_magic" is flawed):

/arch/mips/ralink/mt7621.c

static void __init mt7621_memory_detect(void)
{
	void *dm = &detect_magic;
	phys_addr_t size;

	for (size = 32 * SZ_1M; size < 256 * SZ_1M; size <<= 1) {
		if (!__builtin_memcmp(dm, dm + size, sizeof(detect_magic)))
			break;
	}

	if ((size == 256 * SZ_1M) &&
	    (CPHYSADDR(dm + size) < MT7621_LOWMEM_MAX_SIZE) &&
	    __builtin_memcmp(dm, dm + size, sizeof(detect_magic))) {
		memblock_add(MT7621_LOWMEM_BASE, MT7621_LOWMEM_MAX_SIZE);
		memblock_add(MT7621_HIGHMEM_BASE, MT7621_HIGHMEM_SIZE);
	} else {
		memblock_add(MT7621_LOWMEM_BASE, size);
	}
}

mt7621_memory_detect() determines the amount of memory installed by testing whether the physical memory content is mirrored to the next-higher possible memory size. For example if you have 128MB of memory, the value (detect_magic) at (say) 1MB and 33MB would be different, the value at 1MB and 65MB would be different, but the value at 1MB and 129MB would be identical because the 0-128MB range is mirrored to 128MB-256MB.

The if-statement in the second half of the function is just due to the way MT7621 lays out its memory; if you have 512MB memory then the I/O region will end up in the middle of your physical memory. So that conditional sets up the first 448MB as lowmem and the remaining 64MB on the other side of the I/O region as highmem.

Unfortunately this means disabling highmem isn't an effective workaround; when the memory size is misdetected, you'll just end up with 448MB of lowmem, which is obviously still quite wrong. I suspect that your no-highmem build has merely perturbed the kernel layout in a way that avoids the problem occurring.

If you're able to reproduce the problem with different builds, it might be worth instrumenting it to see why mt7621_memory_detect() is failing, but if you just want a reliable workaround I suggest removing target/linux/ramips/patches-5.4/105-mt7621-memory-detect.patch which should return to the upstream memory detection algorithm.

2 Likes

Thank you! Right now I've updated all my four Archer C6 v3 with "CONFIG_HIGHMEM=n" and so far the issue has not appeared anymore, but as you said it might be a (positive) side effect.

This week I will try your suggestion for removing 105-mt7621-memory-detect.patch and check the results.

Good news! I followed your suggestion, re-enabled "CONFIG_HIGHMEM=y" and removed the patch **target/linux/ramips/patches-5.4/105-mt7621-memory-detect.patch**.

Issue solved! So it seems that in fact this patch is the culprit.

I will mark your post as solution for future reference and I will update this info there in the bug report.

1 Like

cc @981213

2 Likes

Well, I think I've spoken too soon. In a rush last night to test it, I just removed the patch and rebooted the router. And it rebooted OK, without boot loop.

But today after a closer inspection in the log file I've noticed that the RAM size is being incorrectly detected, 256MB instead of 128MB:

[    0.000000] Linux version 5.4.162 (dsouza@dsouza00) (gcc version 11.2.0 (OpenWrt GCC 11.2.0 r18233-0a4f5d06c2)) #0 SMP Sun Nov 28 20:15:10 2021
[    0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
[    0.000000] printk: bootconsole [early0] enabled
[    0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
[    0.000000] MIPS: machine is TP-Link Archer C6 v3
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] VPE topology {2,2} total 4
[    0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000000fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001bffffff]
[    0.000000]   node   0: [mem 0x0000000020000000-0x0000000023ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000023ffffff]
[    0.000000] On node 0 totalpages: 65536
[    0.000000]   Normal zone: 576 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 65536 pages, LIFO batch:15
[    0.000000] percpu: Embedded 14 pages/cpu s26480 r8192 d22672 u57344
[    0.000000] pcpu-alloc: s26480 r8192 d22672 u57344 alloc=14*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 64960
[    0.000000] Kernel command line: console=ttyS0,115200n8 rootfstype=squashfs,jffs2
[    0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[    0.000000] Writing ErrCtl register=0005a010
[    0.000000] Readback ErrCtl register=0005a010
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 250320K/262144K available (6107K kernel code, 201K rwdata, 1240K rodata, 1272K init, 206K bss, 11824K reserved, 0K cma-reserved)

I will do a new build disabling HIGHMEM to see if it has any effect. So while removing the patch prevented the bootloop, the memory is still being incorrectly identified.

This should skip the memory detection and force the memory to be 128M:

--- a/target/linux/ramips/dts/mt7621_tplink_archer-x6-v3.dtsi
+++ b/target/linux/ramips/dts/mt7621_tplink_archer-x6-v3.dtsi
@@ -18,6 +18,11 @@
 		bootargs = "console=ttyS0,115200n8";
 	};
 
+	memory@0 {
+		device_type = "memory";
+		reg = <0x0 0x8000000>;
+	};
+
 	keys {
 		compatible = "gpio-keys";
 
3 Likes

Thank you! This change solved the problem. I'm not sure why the memory size detection started failing on this device. Should this .dtsi change be a permanent solution for this issue?

Well, some good and unexpected news.

Today (Dec 5th, 2021) I did a new build (r18287-f9a28d216d) without the above patch to skip the memory detection in the .dtsi file, and now everything is working fine again (tested with 4 Archer C6 v3).

So I'm convinced this was in fact a regression that has now being fixed (by some unidentified commit) and the above patch is not required anymore.

The mystery now is to identify which change/commit fixed this issue... :upside_down_face:

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Hi @981213!

I noticed that you landed a patch ramips: mt7621: do memory detection on KSEG1.

I tried to contact @dsouza about revisit this unresolved bootloop issue.
We messaged and I gave a try to backport your fix to Kernel 5.4 exactly on top of the failing commit: r18195-d1c7df9c4b

@dsouza compiled it and reported back that it fixes the bootlop.

While he reported that a later build (r18287-f9a28d216d) doesn't reproduce the issue, it's clear to me that your patch is the real solution for this problem.

@981213, as Kernel 5.4 is still in release 21.02, may I backport / ask for backporting your commit 2f024b793311 to 21.02?

1 Like

Hi! As you've already done the backport, feel free to submit your work to mail list or Github pull request :smiley:

1 Like

@dsouza just pushed the 21.02 backport:

Please test this too!
And if you don't mind, provide a Tested-by: too, thanks!

@xablocs, I'm willing to test, but how do you want me to test?

I mean, in the previous test, I've applied your backported patch (to kernel 5.4) on the failing build (r18195-d1c7df9c4b) and confirmed it worked.

I understand that the patch is now committed to master, but the current master dropped support to kernel 5.4 and only builds with kernel 5.10.

So, how do you believe I could test it? Perhaps releasing this patch to the 21.01 head (which is still on kernel 5.4) and test with a 21.01 head build?

As you can see, this second branch is based on openwrt-21.02.
If you check its log, you would see its just after the 21.02.2 release! :+1:

Just fetch my branch from my fork into your working copy as a local branch, checkout and build it as usual.

I could provide the git commands later, if you need it. :slightly_smiling_face:


Here you are, @dsouza

# go to your local clone
$ cd openwrt

# add my fork to your working copy and name it e.g. "xabolcs-github"
$ git remote add xabolcs-github https://github.com/xabolcs/openwrt.git
$ git remote -v show
...<snip>...
origin	https://github.com/openwrt/openwrt.git (fetch)
origin	https://github.com/openwrt/openwrt.git (push)
...<snip>...
xabolcs-github	https://github.com/xabolcs/openwrt.git (fetch)
xabolcs-github	https://github.com/xabolcs/openwrt.git (push)

# fetch my "branch-21.02-backport-ramips-memory-detect" branch to your working copy and name it e.g. "21.02-ramips-memory-detect
$ git fetch xabolcs-github branch-21.02-backport-ramips-memory-detect:21.02-ramips-memory-detect
From https://github.com/xabolcs/openwrt
 * [new branch]            branch-21.02-backport-ramips-memory-detect -> 21.02-ramips-memory-detect
 * [new branch]            branch-21.02-backport-ramips-memory-detect -> xabolcs-github/branch-21.02-backport-ramips-memory-detect

# switch to it and check it's history
$ git checkout 21.02-ramips-memory-detect 
Switched to branch '21.02-ramips-memory-detect'

# (I'm syncing OpenWrt's GitHub mirror, as "origin" thats why are you seeing those "openwrt-21.02" branches, you can check them at https://github.com/openwrt/openwrt/commits/openwrt-21.02)
$ git log --oneline
855f60e85e (HEAD -> 21.02-ramips-memory-detect, xabolcs-github/branch-21.02-backport-ramips-memory-detect) ramips: mt7621: do memory detection on KSEG1
7fc336484b (origin/openwrt-21.02, openwrt-21.02) rpcd: backport 802.11ax support
d1c15c41d9 OpenWrt v21.02.2: revert to branch defaults
30e2782e06 (tag: v21.02.2) OpenWrt v21.02.2: adjust config defaults
bf0c965af0 ramips: fix NAND flash driver ECC bit position mask
adb65008c8 kernel: backport fix for initializing skb->cb in the bridge code to 5.4
b7af850bd2 tools/mtools: update to 4.0.35
5d553d8767 tools/fakeroot: fix unresolved symbols on arm64 macOS
c8d6a7c84e tools/fakeroot: fix build on MacOS arm64
83bf22ba2e tools/fakeroot: explicitly pass CPP variable
230ec4c69c bcm4908: backport watchdog and I2C changes
87b9ba9ed9 bcm4908: backport first 5.18 DTS changes
e6a718239f bcm4908: backport bcm_sf2 patch for better LED registers support
e6aaa061d0 bcm4908: backport BCM4908 pinctrl driver
59e7ae8d65 tcpdump: Fix CVE-2018-16301
de948a0bce glibc: update to latest 2.33 HEAD
0c0db6e66b hostapd: Apply SAE/EAP-pwd side-channel attack update 2
5b13b0b02c wolfssl: update to 5.1.1-stable
7d376e6e52 libs/wolfssl: add SAN (Subject Alternative Name) support
5ea2e1d5ba wolfssl: enable ECC Curve 25519 by default
4108d02a29 ustream-ssl: update to Git version 2022-01-16

# and then: build as usual
$ make clean
$ scripts/feeds update -a
$ scripts/feefs install -a
$ make menuconfig
$ make
2 Likes

Thank you @xabolcs for the instructions! Test done and passed (r0+16499-855f60e85e)!

Since my device (Archer C6 v3.2) is not yet supported in 21.02, I built and tested with an image of Archer A6 v3, which has the same hardware.

Boot is OK, test passed (rebooted it 5 times, all boots completed sucessfully)!

BTW, how do I update github with "Tested by"? My user there is "d-me3".

1 Like

Thank you for verification!
I'll handle commit update! :wink:

If you don't mind revealing your real name and email address, then just provide the information here, as a comment:

Tested-by: Your Realname <valid@email.adress>

Of course I could use @dsouza on OpenWrt Forum otherwise.

1 Like

Sure! I just sent you a private message with my name and email address.

Thanks!

Edited: I also added this info as a comment of your commit in GitHub.

1 Like

And it's on the Kernel 5.4 based openwrt-21.02, will be included in the next 21.02 release: 21.02.3! :tada:

1 Like