Get ramoops, mtdoops or mtdpstore working on more devices

I've got a Linksys MX4200v2 and no UART access to it. Recently, I was experiencing some kernel crashes and I wasn't able to capture the logs necessary to troubleshoot the issue.

That led me to a lot of investigation about ramoops and mtdoops which are current kmods in OpenWRT, and mtdpstore which it isn't, but it didn't take too much effort to add and test.

There is a PR re adding ramoops regions and therefore support on qualcommax routers #15688

On that PR, it's already established that ramoops doesn't work on all the devices. From what I've seen on many UART logs, the error is pstore: backend (ramoops) writing error (-28) which would mean storage is full but that's not the case.

What I have experienced testing ramoops, mtdoops and mtdpstore, is that nothing gets ever written, ram or mtd. It isn't just kernel crashes, I've tried also console and user space messages, when the device is completely stable and generating those logs, but they are not being written to the backend regardless it's ram or mtd. All this time, the dmesg messages confirm everything is loaded and in good order, but it just doesn't happen.

This has me a bit puzzled and I am looking for some assistance to crack this one.

Outside OpenWRT's world, a common issue with ramoops is SELinux but that's not enabled in OpenWRT and there are no other similar security modules in place, I believe. Besides, it would affect all the devices, and some of these do get oops logs using ramoops.

Since I get the same results in ram and mtd, it doesn't seem to be an issue particularly linked to either of them.

Other totally speculative ideas I've considered is that there is something in the DTS files of some devices that inadvertently prevents these logs going anywhere, but that doesn't make much sense. A binary firmware somewhere? Qualcomm's minidump?

What I would really like with this post is to see if there is anyone with the technical ability, and of course the desire to use it for this, to work with me to narrow down the likely sources of this problem and hopefully find a solution. I am not a developer but I am proficient at troubleshooting and have many years of experience working with Linux systems.

For the community, I think it would be very beneficial to have tools available to retrieve crash logs on more devices from people without UART access who I am pretty sure it's a majority.

2 Likes

I'll bet @daniel could give you some hints on what's going on here and suggest some avenues to pursue.

I have seen files in my RT3200's /sys/fs/pstore after oops crashes, but it has been maybe a couple years so I have no real recollection of what...

1 Like

Memory mapping (of RAM reserved for IP cores and their firmware) is the most likely cause of problems.

1 Like

Would that also be the case for mtdoops or mtdpstore? With ramoops, I've tried reserving many different ram regions. Since that didn't work, I tested the other 2 which don't write on ram, but the result is seemingly the same. Nothing is ever written. However, I don't have log message e.g. pstore: backend (mtdoops) writing error (-28) to confirm it's exactly the same.

All right, I finally cracked this one out. For ramoops at least.

It's CONFIG_IO_STRICT_DEVMEM=y on target/linux/generic/config-6.6

So, it is a security feature which is what I thought when I saw reports about SELinux causing problems with ramoops.

It seems that it can be dealt with using the boot parameter iomem=relaxed as well as obviously CONFIG_IO_STRICT_DEVMEM=n which is what I've tried.

I need to purge my test environment of any unnecessary changes and test if just adding CONFIG_IO_STRICT_DEVMEM=n to kmod-pstore in package/kernel/linux/modules/fs.mk solves the issue. That'd be ideal and lead to the first PR.

Then I'll investigate mtdoops and eventually try to add support for mtdpstore in OpenWRT but one step at a time. This is just a quick update in case someone is in immediate need of collecting a crash log using ramoops.

EDIT: I've opened #16638 to fix ramoops

3 Likes

An update in regards to mtdoops. I don't know what I did wrong when I tried it before, but it works. It's not the easiest, and certainly something only for very advanced users who know what they are doing, but you can get the logs.

I am going to document here how I've done it for my MX4200v2. It is very important that you understand that this method writes to nand and if you get it wrong you can overwrite essential partitions and brick the device.

  1. Edited target/linux/qualcommax/files/arch/arm64/boot/dts/qcom/ipq8174-mx4200.dtsi to add the oops partition. I used what DD-WRT uses since it makes sense. syscfg is originally given 8MB on Linksys stock, but it's literally a text file and it will never get even near 1MB.

So this;

			partition@13f00000 {
				label = "syscfg";
				reg = <0x13f00000 0xb800000>;
				read-only;
			};

became this;

			partition@13f00000 {
				label = "syscfg";
				reg = <0x13f00000 0xb100000>;
				read-only;
			};
			partition@1f000000 {
				label = "oops";
				reg = <0x1f000000 0x700000>;
			};

Note that the size value for syscfg goes from 0xb800000 to 0xb100000.

  1. Built the firmware with mtdoops enabled as a module. Save yourself some time and add openssh-sftp-server built in as you will need to transfer the module to the router. If you want mtdoops built in, you will need to add the required boot arguments to the DTS file, but I am not going to explain that way of doing it.
  2. Flashed new firmware, ssh into it, used hexdump -C /dev/mtd28 to confirm that it was in fact empty. Remember that devices are numbered from 0 so careful with your counting.
  3. Took a copy of oops. I used LuCI but hexdump would have done. Whatever you do, get the copy out of the router which with LuCI it's automatic as you download it.
  4. Transferred the module kmod-mtdoops to the router which is under bin/targets/qualcommax/ipq807x/packages/. Installed with opkg.
  5. Loaded the module with insmod mtdoops mtddev="oops". Confirmed with dmesg the partition is being used for mtdoops.
  6. Triggered a panic with sh -c 'echo 10 > /proc/sys/kernel/panic; echo c > /proc/sysrq-trigger'.
  7. Once I was back, checked that the content of the dump had been written using hexdump -C /dev/mtd28.
  8. Not really necessary but I restored the empty copy I took first with mtd -e oops -n write your-back-up.bin oops

While it's good to have a second option in case ramoops fails to get the logs that you need, mtdoops is not as convenient as it's not writing to a file system where you can easily see the log after a crash.

Next stop, mtdpstore or whatever you want to call it. It seems that the developers had a difficult time settling on the name and used a different one every patch they sent to the kernel :slight_smile: . Advance: no errors on load, but nothing gets written.

3 Likes

@qosmio
I've been testing your NSS fork Vs regular OpenWRT and ramoops doesn't work on an NSS build with exactly the same changes.

On you fork I added the following to target/linux/qualcommax/files/arch/arm64/boot/dts/qcom/ipq8174-mx4200v2.dts

/delete-node/ &ramoops_region;

/ { 
	reserved-memory {
		ramoops_region: ramoops@51100000 {
			compatible = "ramoops";
			reg = <0x0 0x51100000 0x0 0x100000>;
			no-map;
			record-size = <0x1000>;
			console-size = <0x1000>;
		};
	};
};

And edited ramoops on package/kernel/linux/modules/other.mk

define KernelPackage/ramoops
  SUBMENU:=$(OTHER_MENU)
  TITLE:=Ramoops (pstore-ram)
  DEFAULT:=m if ALL_KMODS
  KCONFIG:=CONFIG_PSTORE_RAM \
	CONFIG_PSTORE_CONSOLE=y \
	CONFIG_IO_STRICT_DEVMEM=n
  DEPENDS:=+kmod-pstore +kmod-reed-solomon
  FILES:= $(LINUX_DIR)/fs/pstore/ramoops.ko
  AUTOLOAD:=$(call AutoLoad,30,ramoops,1)
endef

And I can't get any logs using ramoops whereas if I do the same (minus /delete-node/ &ramoops_region;) on regular OpenWRT, I do get logs. Might be something to do with NSS itself or the binaries.

If you see my previous post about mtdoops, now I realise that when I tried the first time I was using your fork, and when I tried it afterwards and it worked with no changes, I was using regular OpenWRT. So it is also possible that this is a problem with dumps on NSS builds in general, not just ramoops.

Did the region 0x51200000 also not work with NSS fork?

Start Address End Address Size (MB) Desc
0x0000000040000000 0x0000000040ffffff 16 NSS
0x000000004a400000 0x000000004a5fffff 2 TZ App
0x000000004a600000 0x000000004a9fffff 4 Bootloader
0x000000004aa00000 0x000000004aafffff 1 Secondary Bootloader
0x000000004ab00000 0x000000004abfffff 1 Shared Mem
0x000000004ac00000 0x000000004affffff 4 Memory
0x000000004b000000 0x0000000050efffff 95 WCNSS (wifi)
0x0000000050f00000 0x0000000050ffffff 1 Q6 ETR Dump
0x0000000051000000 0x00000000510fffff 1 M3 Dump
0x0000000051200000 0x00000000512fffff 1 Ramoops

I'm not sure why tbh. I know NSS allots the first 16mb, it could still be accessing non-specified regions. In the past on ipq806x I remember having to shift the ramoops region much further down to get it to work. So far the ones confirmed not working with those regions is dl-wrx36 (which I use as my main router) and now mx4200v2 as you've mentioned.

I ran some tests to compare regular builds with NSS builds from your fork.

On a regular OpenWRT with no ramoops_region for MX4200v2:

cat /proc/iomem
40000000-40ffffff : reserved
41000000-4a3fffff : System RAM
  41010000-41a9ffff : Kernel code
  41aa0000-41c1ffff : reserved
  41c20000-41d4ffff : Kernel data
  4a3f1000-4a3fcfff : reserved
4a400000-510fffff : reserved
51100000-7fffffff : System RAM
  7eb00000-7fbfffff : reserved
  7fc18000-7fc1afff : reserved
  7fc1b000-7fd9bfff : reserved
  7fd9c000-7fde3fff : reserved
  7fde6000-7fde6fff : reserved
  7fde7000-7fde8fff : reserved
  7fde9000-7fdf9fff : reserved
  7fdfa000-7fffffff : reserved

I decided to test 0x51100000, 0x51200000, 0x67e00000 and 0x7ea00000 on regular OpenWRT and these are the results:

0x51100000: works.
0x51200000: works.
0x67e00000: works.
0x7ea00000: works.

All of them worked as you can see.

Each time I'd use dmesg | egrep 'pstore|ramoops' to confirm the region and trigger the crash with echo 10 > /proc/sys/kernel/panic; echo c > /proc/sysrq-trigger.

On a NSS build with /delete-node/ &ramoops_region; on target/linux/qualcommax/files/arch/arm64/boot/dts/qcom/ipq8174-mx4200v2.dts

cat /proc/iomem
40000000-40ffffff : reserved
41000000-4a3fffff : System RAM
  41010000-41b1ffff : Kernel code
  41b20000-41ccffff : reserved
  41cd0000-41e0ffff : Kernel data
  4a3f0000-4a3fdfff : reserved
4a400000-510fffff : reserved
51100000-7fffffff : System RAM
  7eb00000-7fbfffff : reserved
  7fc0f000-7fc11fff : reserved
  7fc12000-7fd92fff : reserved
  7fd93000-7fddefff : reserved
  7fde0000-7fde2fff : reserved
  7fde3000-7fdf3fff : reserved
  7fdf4000-7fdf4fff : reserved
  7fdf5000-7fffffff : reserved

I decided to test the same areas as none of them were reserved: 0x51100000, 0x51200000, 0x67e00000 and 0x7ea00000 .

0x51100000: works.
0x51200000: works. I let your patch do its thing and didn't add anything to ipq8174-mx4200v2.dts.
0x67e00000: works.
0x7ea00000: works.

The takes are:

  • ramoops works on NSS builds as well as OpenWRT's on the MX4200v2. That wasn't the case until recently, going with my tests. You can check #16638 as I closed the PR when ramoops started working. I think it got fixed upstream, or maybe it was something on OpenWRT's side and it eventually made it to your fork.
  • When it works, it works. You can reserve this or that region but for me it's always been all or nothing. As long as you reserve memory that's not reserved already, of course.
  • You can add the MX4200v2 as working on #15688. I'll add my comment. I still stand by my previous comment regarding ramoops regions being better dealt with at a device level, because of the many differences in reserved memory that I think you will find between them. For example, 0x51100000 would be the better option for the MX4200v2. Or in its current form, customised reserved mappings should be preceded by /delete-node/ &ramoops_region; and then the full definition added to avoid breaking the convention.

@smaller09 I hope you don't mind I respond to you on another thread. I don't want to pollute the NSS post too much with pstore stuff.

Thanks for the additional information about triggering subsequent crashes.

I can indeed get more than one log too:

ls -lah /sys/fs/pstore/
drwxr-x---    2 root     root           0 Oct 14 21:27 .
drwxr-xr-x    7 root     root           0 Jan  1  1970 ..
-r--r--r--    1 root     root       31.5K Oct 14 21:28 dmesg-pstore_blk-0
-r--r--r--    1 root     root       33.7K Oct 14 21:47 dmesg-pstore_blk-1
-r--r--r--    1 root     root       33.0K Oct 14 21:36 dmesg-pstore_blk-2

The problem is that sometimes it logs the crash, sometimes it doesn't. The success ratio is very poor, below 40%. I'm documenting my numerous tests so I can put a number to it.

I'll continue to investigate. Since I don't have UART access, that's why I got interested in all these ways of logging panics in the first place, I might be missing key information regarding the failed attempts.