Configuration: Mediatek mt7621AT boots from NAND.
Does anybody know why in my router, ECC correction does not work for Uboot stage1 (L2) bootloader, but the same configuration for example in Asus AC85P successfully fixes up to 4bit errors in bootloader?
Does it mean that something absent in mt7621 hw configuration (gpio pull, ext. oscil, etc) or Asus has "licensed" chip with ecc correction enabled from initial boot and other chips can enable NFI ecc correction later only in Uboot after relocation phase when nand driver is initializing.
Because in my router, any 1bit error n execution code of stage1 makes router bricked.
Thanks a lot.
I suppose you are dealing with parallel NAND connected to MT7621AT (not SPI-NAND for which there is only very limited support), right?
According to the datasheet boot mode and NAND page layout is setup by bootstrap pins (RTS2
, RTS3
, TXD1
, GPIO0
) and supported layouts are 2k+64
, 2k+128
, 4k+128
and 4k+224
.
There may also be more secrets in the not fully documented NAND boot header stored at the beginning of the flash:
/* AP BROM Header for NAND */
union nand_boot_header {
struct {
char name[12];
char version[4];
char id[8];
uint16_t ioif; /* I/O interface */
uint16_t pagesize; /* NAND page size */
uint16_t addrcycles; /* Address cycles */
uint16_t oobsize; /* NAND page spare size */
uint16_t pages_of_block; /* Pages of one block */
uint16_t numblocks; /* Total blocks of NAND chip */
uint16_t writesize_shift;
uint16_t erasesize_shift;
uint8_t dummy[60];
uint8_t ecc_parity[28]; /* ECC parity of this header */
};
uint8_t data[0x80];
};
/* NAND header for SPI-NAND with 2KB page + 64B spare */
static const union nand_boot_header snand_hdr_2k_64_data = {
.data = {
0x42, 0x4F, 0x4F, 0x54, 0x4C, 0x4F, 0x41, 0x44,
0x45, 0x52, 0x21, 0x00, 0x56, 0x30, 0x30, 0x36,
0x4E, 0x46, 0x49, 0x49, 0x4E, 0x46, 0x4F, 0x00,
0x00, 0x00, 0x00, 0x08, 0x03, 0x00, 0x40, 0x00,
0x40, 0x00, 0x00, 0x08, 0x10, 0x00, 0x16, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x7B, 0xC4, 0x17, 0x9D,
0xCA, 0x42, 0x90, 0xD0, 0x98, 0xD0, 0xE0, 0xF7,
0xDB, 0xCD, 0x16, 0xF6, 0x03, 0x73, 0xD2, 0xB8,
0x93, 0xB2, 0x56, 0x5A, 0x84, 0x6E, 0x00, 0x00
}
};
So comparing the actual header of the two devices and understanding each bit can help as well.
Also, I suppose you are aware that MediaTek SoC's use a vendor-specific BCH scheme rather than simple 1-bit ECC. Some details about that have previously been discussed here:
Yes, Its NAND 2k+64. Bootstrap pins are correct.
Do not think that problem in bootloader image header. (se below)
Bootloader contains stage1 L2 boot followed by Uboot. As its noted in MTK SDK. Stage1 is known bundle without sources (from known links for mt7621 Uboot dev.)
Header is present also.
So, bootloader is like: HDR, Stage1, offset, Uboot.
BCH tools is used also to create binary with ECC before flashing to chip with programmer.
So, it was detected that some routers are bricked sometimes suddenly. After desoldering NAND and comparing the dump vs original bootloader binary it was detected that device had 1bit (not even 4bit) error in block0. In stage1 area or Uboot.
So, I did more tests when I manually added 1bit error into bootloader image (with ecc already) before flashing it into NAND (hard copy) using programmer and soldering NAND back to device. Summary: NAND NFI ECC controller does not correct errors in stage1 and Uboot data. At all. Correction only starts working when Uboot is loaded in full and NAND driver is initialized in Uboot sources. As a result, if error in Stage1 data we have a brick.
Hopefully, I have personal Asus AC85P with the same configuration: MT7621AT + NAND 2k-64.
I did some tests with it using the same approach: desolder NAND, read dump, add 1bit error - solder back. I was very surprised to see that ASUS successfully fixes all 1bit error and even 4bit errors. So, in Asus, ecc controller works immediately after power on when MTK accesses NAND to read Stage1.
I did more to check if matter is in bootloader content. I flashed to Asus NAND the dump from my bricked device (with my header in bootloader) and Asus loads it successfully (fixing 1bit errors in dump).
So, matter definitely is in hardware: either Asus has "cool" mt7621 chips or something else.
The last test is to desolder mt7621 from Asus and solder it to my dev router but I do not want do it. (bga soldering is not mine best side)
Any ideas?
Ok, so maybe there is another secret bootstrap pin, or this is indeed a something different between hardware revisions of the MT7621 SoC.
Maybe @hackpascal can share some details?
No. The AP BROM Header is only used by ARM chips.
Actually there's nothing secret.
The MT7621 bootloader header definition can be found at:
And the BootROM flow:
No efuse, no undocumented bootstrap pins.
So normally, if the bootrom can't correct even 1 bit error, I believe the ECC parity data in OOB doesn't match the actual NAND configuration.
@evgvi I suggest you build a fresh u-boot image and use BCH tool to add correct OOB data, and then try writing it to NAND and retry your bitflip tests.
P.S.
I have a modified EVB that can easily test different NAND chips. I have tested various of bitflips years ago, and what you described never happens on my test board......
@hackpascal @daniel thank a lot for detailed response. I will compare mkimage.c in our u-boot sources vs provided above one.
I am confused a little with fact that even if my u-boot header or ecc parity inside it is wrong, asus mt7621 was able to successfully run it.
But I am glad to know that there are no hidden bootstraps or secrets exist.
Will check and response with results.
@hackpascal @daniel I see strange thing in our U-boot config.in file. See below
Its hardcoded choice ON_BOARD_NAND_BOOTSTRAP and disabled choice ON_BOARD_NAND_HEADER.
As a result, nand parameters in header that we discussed before are not filled. So, from point of view header nand integrity - I have wrong header. And may be this is a reason why i have not working ecc in stage1.
Do you know something about the choice ON_BOARD_NAND_BOOTSTRAP? Is it known restricted/obsole/undocumented? Does it mean that when nand parameters are based on hw bootstrap pins we do not need them in header?
Looks like way to provide nand information via bootstrap does not enable ecc correction. But your way (provide via header params) works fine as you noted above.
P.S. But still unclear how asus fixed my bricked u-boot dump (with wrong header)
This is optional, as I showed in the flow graph.
The upstream u-boot does not use this parameter field either. Without it, bootrom will still enable ECC according to the bootstrap pins.
I still suggest to check the OOB data. You can provide me the NAND dump binary (with OOB data. u-boot part is enough) which fails for ECC correction. I can check if the ECC parity data is correct.
Will be glad. But how to attach a file? Only images are allowed
@hackpascal Which OOB do you mean? OOB after each 512 bytes sector in page? Or OOB somewhere in header?
The flow to reproduce problem in my u-boot the following;
- build uboot binary
- execute BCH tool to add to binary ECC into OOB after each sector (512b) in page
- add 1bit error somewhere. For example, in Stage1 there is a string "Change MPLL source from XTAL" => "Change MPLM source ..."
- if we execute now BCH d 2048 64, we will see 1bit error in sector
- flash image to nand
- boot from it
In my case I will see in console MPLM. So, the bit was not corrected by controller.
In case of Asus, I see fixed "MPLL"
Note: If the such 1bit will be not in string but in vital code instruction, i will have a brick
So, I did one more test. I added nand info into header like below. If my bootstrap pins somehow do not work, controller should be re-initialized in any case using parameters in header. According to boot algorithm above.
Added ecc using BCH tool. Changed string into MPLM. Flashed to chip.
Booted. And I have MPLM in log. It wasn't corrected.
@hackpascal , any ideas?
I could provide also u-boot binary so, you can check it on your device. But where to upload it.
Thanks in advance
this one. I need full data in NAND to reproduce on my board.
you can send it via email
hackpascal AT gmail.com
Have sent it
I realized that you are not simulating a regular NAND bitflip (from 1 to 0). Changing L to M is actually changing a bit from 0 to 1 (0x4C -> 0x4D or 0b01001100 -> 0b01001101).
Currently I can not try this way due to a limitation of the NAND controller. But I tried changing L to H (0x4C -> 0x48 or 0b01001100 -> 0b01001000), and the ECC correction just work as expected.
Update:
Tried to change L to M, and it also worked:
ECC parity code remains unchanged in both cases.
Hi @hackpascal
Did you do this this your image or with image I sent you to email?
I think the my problem somewhere in hw.
- mt7621 chip is ok.
- bootstrap are checked twice and ok
- NAND chip is ok
- clocks for mt7621? NAND controller clock depend on it also. So, need check XTAL MODE may be? Update: checked. Have 0b011 mode. (40MHz self oscill mode)
Do you have the schematics of board which you used to test mt7621 ecc bootloader?
I just used the image you sent to me.
No. But it does use 40Mhz xtal.
Edit:
The NAND controller is ok, since both bootrom and u-boot use it and ECC correction works in u-boot stage.
Perhaps you've got a mt7621 chip with some unknown defects?
Currently have no ideas. Everything is checked. Only last case is chip itself. Do you have also MT7621AT ?
You mean unsoldered mt7621 chip? No.
I mean I have mt7621at 1944-AMTH. What is your mt7621 model?