Can't flash MikroTik RB951G-2HnD

I use OpenWRT on the MikroTik RB951G-2HnD.
But the recent devices I did receive do have the problem that I cannot flash them anymore. They boot as it should with the ram image but when I perform the sysupgrade -n image-name.bin it does nothing and after a powercycle it just load RouterOS again.

I compared versions with an older original device which works as expected:
RouterOS version info on the working device is: v6.44.5 where the version on the not working device is v6.46.4 I cannot downgrade below v6.46.1 so I cannot tell if this is the cause.

When I boot the ram image and compare the both bootlogs with dmesg. Then I see a difference in the Nand Flash Controller.
The output of the working one:

...
[    4.216908] bootconsole [early0] disabled
[    4.232674] nand: device found, Manufacturer ID: 0x98, Chip ID: 0xf1
[    4.239292] nand: Toshiba NAND 128MiB 3,3V 8-bit
[    4.244078] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[    4.251934] Scanning device for bad blocks
[    4.363426] Bad eraseblock 768 at 0x000006000000
[    4.424183] Creating 3 MTD partitions on "ar934x-nfc":
[    4.429508] 0x000000000000-0x000000040000 : "booter"
[    4.456596] 0x000000040000-0x000000400000 : "kernel"
[    4.483654] 0x000000400000-0x000008000000 : "ubi"
[    4.513761] libphy: Fixed MDIO Bus: probed
...

The output of the failing one:

...
[    4.216765] bootconsole [early0] disabled
[    4.232570] nand: device found, Manufacturer ID: 0xc8, Chip ID: 0xf1
[    4.239186] nand: ESMT NAND 128MiB 3,3V 8-bit
[    4.243700] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 32
[    4.251528] ar934x-nfc ar934x-nfc: unsupported OOB size: 32 bytes
[    4.257832] ar934x-nfc ar934x-nfc: init tail failed, err:-6
[    4.264503] libphy: Fixed MDIO Bus: probed
...

If I do a: cat /proc/mtd on the failing device I get:

root@OpenWrt:/proc# cat /proc/mtd 
dev:    size   erasesize  name
root@OpenWrt:/proc#

While performing it on the working device I get:

root@OpenWrt:~# cat /proc/mtd 
dev:    size   erasesize  name
mtd0: 00040000 00020000 "booter"
mtd1: 003c0000 00020000 "kernel"
mtd2: 07c00000 00020000 "ubi"
root@OpenWrt:~#

I've got the same results with a own compiled 18.06 version and the current 19.07.03 version from the TOH page.

So I think the problem is caused because the NAND controller is changed and the current NAND controller is not supported yet?

I would like to see a confirmation that I'm on the right track and then I can use some help to get support for the changed NAND controller. Or any other direction to solve the problem would be grateful.

1 Like

I have the same issue https://forum.openwrt.org/t/mikrotik-rb951ui-2hnd-unsupported-nand/72052

1 Like

Look similar so probably the NAND is in deed the problem. Now the question is how to solve it.

1 Like

The NAND chip is indeed changed. The chip used in a working device is from Toshiba TC58BVG0S3HTA00 where the newer device do have an NAND from GigaDevices GD9FU1G8F3AMGI. Comparing the datasheets, the biggest difference is that the Internal ECC seems to be disabled on the GigaDevices NAND.
Because the OOB size should be recognized from the ID but is wrong, I tried to add support for this NAND manually in the vim ./build_dir/target-mips_24kc_musl/linux-ar71xx_mikrotik/linux-4.14.193/drivers/mtd/nand/nand_ids.c in the already existing list with incompatible NAND chips.

{"GD9FU1G8F3A 1G 3.3V 8-bit",
    { .id = {0xc8, 0xf1, 0x80, 0x19, 0x42} },
      SZ_2K, SZ_128, SZ_128K, 0, 5, 64, NAND_ECC_INFO(4, SZ_512) }, 

Compiled it and tried to boot the ram-image. The dmesg now shows that the NAND is recognized although I get the following message too weak compared to the one required by the nand chip:

[    6.665970] nand: device found, Manufacturer ID: 0xc8, Chip ID: 0xf1
[    6.672585] nand: ESMT GD9FU1G8F3A 1G 3.3V 8-bit
[    6.677368] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[    6.685219] nand: WARNING: ar934x-nfc: the ECC used on your system is too weak compared to the one required by the NAND chip
[    6.696819] Scanning device for bad blocks
[    6.778870] Creating 3 MTD partitions on "ar934x-nfc":
[    6.784203] 0x000000000000-0x000000040000 : "booter"
[    6.814511] 0x000000040000-0x000000400000 : "kernel"
[    6.844970] 0x000000400000-0x000008000000 : "ubi"
[    6.878746] libphy: Fixed MDIO Bus: probed

According to the datasheet the ECC requirment is 4bit/512bytes which I had configured. I found the function that plots the warning message here: nand_base.c in the method static bool nand_ecc_strength_good(struct mtd_info *mtd). When I look at the method I would expect that the message will disappear when I change the above NAND_ECC_INFO(4, SZ_512) to NAND_ECC_INFO(2, SZ_512) or NAND_ECC_INFO(8, SZ_512) in both cases I still got the same message. So I changed it to NAND_ECC_INFO(0,0) to disable the warning, which results in the following code block:

{"GD9FU1G8F3A 1G 3.3V 8-bit",
    { .id = {0xc8, 0xf1, 0x80, 0x19, 0x42} },
      SZ_2K, SZ_128, SZ_128K, 0, 5, 64, NAND_ECC_INFO(0, 0) }, 

Recompiled it and the warning is gone.

[    0.590582] nand: device found, Manufacturer ID: 0xc8, Chip ID: 0xf1
[    0.597201] nand: ESMT GD9FU1G8F3A 1G 3.3V 8-bit
[    0.601970] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[    0.609844] Scanning device for bad blocks
[    0.619245] random: fast init done
[    0.672264] Creating 3 MTD partitions on "ar934x-nfc":
[    0.677638] 0x000000000000-0x000000040000 : "booter"
[    0.684607] 0x000000040000-0x000000400000 : "kernel"
[    0.691285] 0x000000400000-0x000008000000 : "ubi"

Tried to write the generated openwrt-ar71xx-mikrotik-nand-large-squashfs-sysupgrade.bin to the flash with the sysupgrade -n -v /tmp/openwrt-ar71xx-mikrotik-nand-large-squashfs-sysupgrade.bin command and the device did boot openwrt :wink:

But my big question now are:

  1. Why did I got the warning message? Is it because this NAND chip doesn't have the Internal ECC?
  2. Is there a way to test if the NAND is working as it should including error correction?
  3. Why is datasheet is saying ECC requirement 4bit/512bytes while the internal ECC is disabled?
2 Likes

I have the similar problem with a recent MikroTik RB2011UiAS-RM that uses the same NAND chip (GD9FU1G8F3AMGI) but in my case I get an ECC error even when using NAND_ECC_INFO(0,0):

[   10.956750] __nand_correct_data: uncorrectable ECC error
[   10.962294] UBI: auto-attach mtd6
[   10.965733] ubi0: attaching mtd6

I have modified int nand_scan_tail(struct mtd_info *mtd) in nand_base.c in order to print both, the system and chip ECC configuration and I get the following when using NAND_ECC_INFO(4, SZ_512), which is supposed to be the correct config:

[    7.877168] nand: WARNING: ar934x-nfc: the ECC used on your system (1b/256B) is too weak compared to the one required by the NAND chip (4b/512B)

Do you know why the system is using (1b/256B)?

I think I had the __nand_correct_data: uncorrectable ECC error also till I just tried to flash the openwrt image to the NAND after that it seems to work.
Although I still get the warning with ecc on 4/512

I don't know why the system seems to use the 1/256 I will add some lines to my code on monday to see what it will advertise on a mikrotik with toshiba nand and on a mikrotik with gigadevice nand.

Can you share the code you used to plot the values?

This is the code:

	if (!nand_ecc_strength_good(mtd))
		pr_warn("WARNING: %s: the ECC used on your system (%db/%dB) is too weak compared to the one required by the NAND chip (%db/%dB)\n",
			mtd->name, ecc->strength, ecc->size,
			chip->ecc_strength_ds,
			chip->ecc_step_ds);

It is based on nand_ecc_strength_good(struct mtd_info *mtd) code:

static bool nand_ecc_strength_good(struct mtd_info *mtd)
{
	struct nand_chip *chip = mtd_to_nand(mtd);
	struct nand_ecc_ctrl *ecc = &chip->ecc;
	int corr, ds_corr;

	if (ecc->size == 0 || chip->ecc_step_ds == 0)
		/* Not enough information */
		return true;

	/*
	 * We get the number of corrected bits per page to compare
	 * the correction density.
	 */
	corr = (mtd->writesize * ecc->strength) / ecc->size;
	ds_corr = (mtd->writesize * chip->ecc_strength_ds) / chip->ecc_step_ds;

	return corr >= ds_corr && ecc->strength >= chip->ecc_strength_ds;
}

As you can see NAND_ECC_INFO(0,0) disable the warning but not ECC.

c.ausema thank you!

I believe system is doing the ECC itself and it is using the hamming algorithm which allows to correct only one bit.

I to verify what algorithm is used you can use &chip->ecc->algo attribute. Note that it has one of the values: NAND_ECC_HAMMING and NAND_ECC_BCH.

I have seen that there is an implementation of BCH algorithm (which enables fixing more than one bit) present in already mentioned file nand_base.c. When I have tried to force using it I have got another error:

[    6.249722] UBI error: unable to read from mtd2

I don't had much time today but I did some small testing.
And it seems that the ecc in openwrt for the toshiba and the gigidevices is indeed configured as 1/256. see my debug outputs:

[    7.493257] nand: device found, Manufacturer ID: 0x98, Chip ID: 0xf1
[    7.499873] nand: Toshiba NAND 128MiB 3,3V 8-bit
[    7.504640] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[    7.512515] nand: 
               
               DEBUG: values of ecc strength test:
                 mtd-writesize: 2048
                 ecc->strength: 1
                 ecc->size: 256
                 chip->ecc_strength_ds: 0
                 chip->ecc_step_ds: 0
               
                 ecc->mode: 1
                 ecc->algo: 1
               Debugging done

[    7.549071] Scanning device for bad block
[    7.493035] nand: device found, Manufacturer ID: 0xc8, Chip ID: 0xf1
[    7.499648] nand: ESMT GD9FU1G8F3A 1G 3.3V 8-bit
[    7.504417] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[    7.512290] nand: 
               
               DEBUG: values of ecc strength test:
                 mtd-writesize: 2048
                 ecc->strength: 1
                 ecc->size: 256
                 chip->ecc_strength_ds: 4
                 chip->ecc_step_ds: 512
               
                 ecc->mode: 1
                 ecc->algo: 1
               Debugging done

[    7.549027] nand: 
               
               DEBUG: values of ecc strength test:
                 mtd-writesize: 2048
                 ecc->strength: 1
                 ecc->size: 256
                 chip->ecc_strength_ds: 4
                 chip->ecc_step_ds: 512
                 corr: 8
                 ds_corr: 16
               
                 ecc->algo: 1
               Debugging done

[    7.588001] nand: WARNING: ar934x-nfc: the ECC used on your system (1b/256B) is too weak compared to the one required by the NAND chip (4b/512B)

So my conclusion from this is that we also can configure the GigaDevice to suppress that ecc system warning by setting it to 0/0 instead of 4/512.

The only remaining question is. Do we need to configure a better system ecc because the toshiba did had onchip ecc and the gigadevice doesn't seems to have that according to the datasheets, if I'm right.

I think I would be preferable to change system's ECC to 4/512 using BCH algorithm as pointed by @tpawlowski_ha . Do you know how?

The Toshiba NAND flash seems to use the same configuration although it seems that it had some extra ecc on chip.
But I don't know how to change to the BCH algorithm and it works for know with the Hamming algorithm.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.