Boot issues on the QHora-322 / Puzzle-M902

The router

I have a QNAP QHora-322 router. This is an iEi Puzzle M902 with a
different software load. (iEi owns QNAP, so it's an in-house reuse.)
The software on both the QHora-322 and the Puzzle M902 are derived
from OpenWRT, I believe. The QHora software is a lot of extra stuff,
however, to enable scalable management.

It's a pretty nice box: six 2.5 Gb/s ethernet ports and three 10 Gb/s
ports, plenty of flash & RAM, and an M.2 slot for an NVME SSD.

OpenWRT supports this hardware and has a page for it, with a link from
its entry in the Table of Hardware:
https://openwrt.org/toh/iei/puzzle-m902
But there are some problems with the information on that page, as I'll
relate below. Some of these stem from tiny differences in the boot
setup between the Puzzle M902 box and the QHora-322, that go all the
way down to the u-boot configuration. The OpenWRT web page gives
firmware-flashing details for the Puzzle M-902; they're close but not
identical for what you need to do when the hardware is running the
QHora-322 firmware.

I'm trying to understand the low-level details of the setup and have
run into some mysteries. I'm wondering if anyone knowledgeable can
educate me.

Puzzle M902 & QHora-322 boot are not the same

There's a mystery environment variable, $current_entry, that must
be set in u-boot. It is not mentioned in the OpenWRT web page, and
because of this, the firmware-flashing instructions will fail on a
QHora-322. If you hunt around this forum, you can find one or two
posts that show this setting in firmware-installation instructions,
but it's not explained, just presented.

Here's the root of the problem: u-boot is configured to load the
kernel and its accompanying device-tree blob (dtb) from a pair of
files that live on a filesystem in partition /dev/mmcblk0p1 of the 4GB
eMMC chip. Here are the names of these two files that u-boot loads:

System kernel DTB
iEi Puzzle M902 stock Image cn9132-puzzle-m902.dtb
OpenWRT 24.10.0 Image cn9132-puzzle-m902.dtb
QHora-322 stock Image cn9132-db-A.dtb

This means that on a system running Puzzle or OpenWRT, u-boot
does a
ext4load mmc 0:1 0x6000000 cn9132-puzzle-m902.dtb
to load the DTB into ram; but booting QHora-322 requires
ext4load mmc 0:1 0x6000000 cn9132-db-A.dtb

This little piece of the boot process is contained in a u-boot
environment variable $bootcmd. This inconsistency is annoying, but
it's not that big a deal to fix. You can change $bootcmd to the
right thing from the serial-port console, or from a running linux with
the fw_setenv(1) program. Fine.

But it doesn't work. Here's what happens. I install the OpenWRT DTB
cn9132-puzzle-m902.dtb onto the flash, change u-boot's $bootcmd to
refer to it by its right name, do a u-boot saveenv command to save
the new $bootcmd back to flash, and reboot. The hardware comes back
up to u-boot and I interrupt to get a u-boot interactive prompt --
where I discover that $bootcmd has somehow been reset to the old
command string, the one that tries to load the DTB from cn9132-db-A.dtb!
Which won't work, of course: the kernel will come up without a DTB and
quietly fail.

The mystery variable $current_entry

As far as I can tell, the culprit is this other, mystery u-boot
environment variable: $current_entry. The QHora-322 install sets it
to 1. If you change it to 0 (and change $bootcmd to the correct boot
commands for OpenWRT, and then save both changes with a saveenv)...
then you win.

This little bit of $current_entry magic is not shown in the
firmware-flashing instructions on the OpenWRT web page for the Puzzle
M902. If you poke around on the OpenWRT forum, you will stumble across
firmware-flashing instructions for the QHora-322 that do specify
it... but they don't say why. Even more mysteriously, you can print
out the entire set of u-boot environment variables and look for a
command string that uses $current_entry -- or, indeed, any
reference to this variable at all. Nothing. It's simply not
referenced. I couldn't find any mention of it in the documentation on
the u-boot web site https://docs.u-boot.org/en/latest/. I looked
around with google: nada.

Some people clearly know about it, because some of the
firmware-updating instructions provided by posts in this forum
manipulate it correctly. But none of these posts explain how it works,
what it does. It's just part of the firmware-flashing voodoo you are
told to type in.

I suspect/guess that this variable has something to do with the fact
that the Puzzle M902's 4GB eMMC comes with a dual-boot partition
structure. There are two partitions for a kernel/dtb pair, and two
more
partitions for the root filesystem. Here's the total partition
structure:

Device Label What's in the filesystem
mmcblk0p1 kernel_1 Ext4: Image, cn9132-puzzle-m902.dtb, boot.scr
mmcblk0p2 kernel_2 Ext4: Image, cn9132-db-A.dtb
mmcblk0p3 rootfs_1 Squashfs: read-only root
mmcblk0p4 rootfs_2 Ext4: usr.squashfs rootfs.squashfs
mmcblk0p5 sys_log Ext4: /var/log QHora setup
mmcblk0p6 reserved Ext4: empty
mmcblk0p7 rootfs_data Ext4: r/w overlay root

(Oddly, on the QHora setup, the rootfs partition has an ext4 filesystem
that contains only two files, filesystem images for /usr and /, which
are mounted via loop devices. I don't know why they do this; they could
have just made them two subdirectories and then mounted them into the
correct places with rebind mounts.)

Theories

Here are my guesses as to what's going on:

  • The QNAP people have a dual-boot structure for ease of upgrade,
    so that the upgrade can be done from a linux running out of
    kernel_1 and rootfs_1 writing into kernel_2 and rootfs_2. Once
    the new firmware is installed, the running linux can frob
    $current_entry so that the the bootloader will use the newly
    written partitions on a reboot. On the next upgrade, the roles
    of the partitions are reversed.

    Or maybe it's to provide rollback ability in case an upgrade turns
    out to have been a mistake. Maybe?

  • The $current_entry env-var setting affects some piece of u-boot that
    I can't see that is supposed to select between the two boot choices.

  • And somehow there are two different places on the flash where u-boot
    stores two independent sets of env vars, and somehow the
    $current_entry setting affects which set u-boot uses? Which is odd
    and which I really don't understand.

But that's all conjecture. In short, I can see the effects of this machinery,
but I don't understand what the specific machinery is.

Can anyone clue me in?

Thanks.

EKH

Secure boot.
The $currententry = 1 is telling U-Boot to only boot signed firmware

Tplink call this tp_boot_idx, same thing it's code to tell the U-Boot to only boot signed firmware

Thank you very much. I got a laugh out of your "Guidance Counsellor" tag on your forum account. Very apt.

May I ask you some follow-up (mostly meta-) questions?

  1. How did you know this?
  2. Is there a good way I could have figured this out myself without consuming expert bandwidth on this forum?
  3. Is this documented anywhere?

(These three questions are all related.)

Also, do you know why rebooting with $current_entry=1 also resets the $bootcmd variable? Where is the value used to reset it stored?

While I am writing, just a note for other people who are scanning this thread in the future: people should note that hecatae's answer completely invalidates the theory I proposed in my previous post: $current_entry does not somehow indicate which of the two kernel/rootfs partition-pairs (1 = kernel_1 & 3 = rootfs_1, or 2 = kernel_2 & 4 = rootfs_2) that u-boot chooses for boot.

I'd like to understand, definitively, what the details of this mechanism are, because I expect that people are going to do a search in this forum in the future about this issue. And it probably also should be put into the TOH page for the Puzzle M902 / QHora-322.

EKH

1 Like

Thank you, I agree with your paranoia and question everything approach.

Have you dug into the u-boot on your device, if you head to:

Why would I know this, I have been battling signed firmware since the early days of Android.
Early android devices booted using u-boot and signed firmware, preventing modification and as part of the firmware boot process, the signed firmware reset the u-boot environment variables to prevent tampering of the official firmware.

Since Openwrt has increased in popularity and it has been proven that chipset manufacturers use forked Openwrt sdks for the base of their firmware, hardware manufacturers have increasingly looked for ways to lock down their hardware so that it cannot run unsigned code.

If you request the u-boot source code from QNAP you would be able to find the public key that corresponds to the private key in the signed firmware.

Yes you would have been able to find this out yourself but I don't think you have been running Openwrt since Backfire 10.03.1 when you burnt out a netgear dg834 modem on an llu smpf internet connection.

You're not wasting anyone's time here asking questions, if you have any further ones please ask away, that's what this forum is for to speak to the forum members who have bricked devices and done everything else that you are not doing the same thing.

Is this documented anywhere apart from the u-boot link above, not really, it could be documented on the QNAP qhora-322 device page on the Openwrt wiki.

Have you dug into the u-boot on your device, if you head to:
docs.u-boot.org
U-Boot Verified Boot — Das U-Boot unknown version documentation

I read some of the doc there; there's a lot, and I was somewhat
demand-driven. After reading your reply, I went back & read the
"verified boot" section, although it doesn't say much. Just that
this facility exists in u-boot. There's no detail -- nothing about
sensitivity to specific env variables or how that machinery works.
And I suppose these are bits that each manufacturer compiles into
its specific u-boot.

Why would I know this, I have been battling signed firmware since the
early days of Android.

I extend my commiserations.

Since Openwrt has increased in popularity and it has been proven that
chipset manufacturers use forked Openwrt sdks for the base of their
firmware, hardware manufacturers have increasingly looked for ways to
lock down their hardware so that it cannot run unsigned code.

This is weird, to me. What's in it, for them, to block out enthusiasts
flashing their own OpenWRT or other firmware? Doing so doesn't cost them
business. It helps them make more sales. They get paid by the box, not
the software license.

Also, it is easy to subvert the machinery: you just set the env var,
done. You can do it from the u-boot prompt or you can do it from a
running linux. So... (1) someone with physical access can corrupt your
box, and (2) someone who can get root access into your OS can corrupt
your box. It seems... performative. Security theatre. Am I missing
something? This is not my area of expertise.

Yes you would have been able to find this out yourself but I don't
think you have been running Openwrt since Backfire 10.03.1 when you
burnt out a netgear dg834 modem on an llu smpf internet connection.

I have other scar tissue from misadventures down in the guts of digital
systems, but not with OpenWRT, which I have only been using for about a
month. (Wow, it's very nice.)

You're not wasting anyone's time here asking questions, if you have
any further ones please ask away, that's what this forum is for to
speak to the forum members who have bricked devices and done
everything else that you are not doing the same thing.

I just don't want to waste people's time with RTFM questions -- so they
have more cycles to spend on illuminating the actual darker corners of
the terrain. And part of immigrating into a new ecosystem is learning
where the answers can be found, so one can R the FM on one's own.

Is this documented anywhere apart from the u-boot link above, not
really, it could be documented on the QNAP qhora-322 device page on
the Openwrt wiki.

Yeah, that definitely seems like a good thing to do. I suspect the
QHora-322 & -301W are going to get more attention from enthusiasts,
because they have pretty impressive specs. I've been chastised on the
forum for not following "best practices" and trying to do too much
on the router -- I stuck an M.2 SSD in my QHora-322 and use it for
backups and to serve my home's music out onto the LAN to Sonos boxes.
But it is very tempting to do this, because it cleans up the untidy pile of
hardware one has to have for digital infrastructure. Someone living
in a studio apartment, say, can get by with two boxes: a cable modem
and a QHora-301W (plus actual client computers). That's it. Simple &
minimal.

I will write up this stuff and figure out how to submit it for inclusion
on the actual TOH entry for the QHora-322 when I'm completely done
setting up my box. At the boot-loader level, I still have one remaining
unresolved issue: the proper setting of /etc/fw_env.config. The one
that comes with the QNAP firmware is wrong. Here is what the QNAP firmware uses:

  /dev/mtd1	0x1f0000	0x10000		0x10000		1

This causes fw_printenv to barf -- it prints out

Warning: Bad CRC, using default environment

and then prints out a minimal, default set of env-var assignments. Here is
the different /etc/fw_env.config that the iEi Puzzle-M902 firmware uses:

/dev/mtd3	0x0		0x10000		0x10000		1

This works. And here is a third entry that I constructed from looking
at the dmesg log's contents that refer to the SPI flash:

/dev/mtd3	0x3f0000	0x10000		0x10000		1

Here is what I pulled out of dmesg that seemed relevant:

[    0.808678] spi-nor spi2.0: mx25l3205d (4096 Kbytes)
[    0.813774] 4 fixed-partitions partitions found on MTD device spi2.0
[    0.820155] Creating 4 MTD partitions on "spi2.0":
[    0.824997] 0x000000000000-0x0000001f0000 : "U-Boot"
[    0.830192] 0x0000001f0000-0x000000200000 : "U-Boot ENV Factory"
[    0.836405] 0x000000200000-0x0000003f0000 : "Reserved"
[    0.841733] 0x0000003f0000-0x000000400000 : "U-Boot ENV"

I have not had the courage to use my quote-working-unquote
fw_env.config to set env variables with fw_setenv; only print them
with fw_printenv. What gives me pause is that I have two different
settings for the "U-Boot ENV" partition at /dev/mtd3 that both
appear to work: the iEi Puzzle-M902 one (0x0) and the one I calculated
from dmesg (0x3f0000). I do not understand this. If I'm right, then the
iEi setting must be wrong, and vice versa. Right? So... something's not
right here.

Also, I don't really understand the boilerplate comment at the top
of the standard fw_env.config files:

# Up to two entries are valid, in this case the redundant
# environment sector is assumed present.

I have no idea what this means. Up to two entries? Does that mean I can
have a third, fourth and so on entry that is invalid? What is an
"entry," anyway? Is it a field/column in the file (device offset, env
size, flash sector size)? Or does it refer to an entire record/row in
the file -- you get "up to two" valid rows? I'm lost.

If you can clue me in on this, I can include it in the stuff I submit
for inclusion in the TOH page for the QHora-322. And I can also set
my u-boot env vars from linux, without having to use my serial-line
cable.

Thanks very much for the advice.

EKH