Here's a story about my recently acquired used R7800. The device had been
unable to boot any firmware images, but through some serious hacking I figured
out why and fixed it.
I learned a number of things along the way and am leaving my story and tips
here to be resourceful for anyone that may benefit from it.
Table of Contents
- Background
- Story
- debug_ll kernel
- Kernel Load Address (kla) Modification
Background
I'm the same Bazz that got the Netgear WNR2000v4 Board Support Package (BSP)
working properly in OpenWrt following the work of franz.flasch and
bukington.
https://patchwork.ozlabs.org/project/openwrt/list/?submitter=66219&state=*
I'm happy to say 5 years later that code is still alive in the OpenWrt
kernel codebase, and it happened to have inspired work from others.
https://forum.archive.openwrt.org/viewtopic.php?id=65576
Story
Upon receiving my device upgrade, the R7800, I was surprised to see that it
had been shipped in only its bare item packaging, with no additional padding or
boxes to help protect it. Besides the promotional packaging, this was it:
Well, with the power on, the unit would exhibit only a steady amber power
light (after all the LEDS flashed once). No SSID. No ping over ethernet at the
default gateway address. It was time to try factory reset.
So while the unit was powered on, I tried holding down the factory reset button
on the back for awhile, expecting a reset, but nothing happened. Then I
learned about holding that button down before powering up, continuing to hold
while the button blinks amber, and letting go after it blinks white. This is
TFTP recovery mode.
I tried uploading stock firmware images from Netgear's site
https://www.netgear.com/support/product/R7800.aspx#download as well as
firmware from hnyman and Kong, but they all had the same problem on my
device: when the device rebooted, it once again produced a steady amber
light(kernel not booting), and Occasionally the device would suddenly
soft-reboot itself (boot loop). But the fact that I could enter TFTP
recovery mode was certainly hopeful. Let's do serial connection!
So it was time to get serial hooked up. The picture that is shown on OpenWrt's toh
(https://openwrt.org/toh/netgear/r7800) of how to open the device is
kinda intimidating, but that's OK. I've been down this road before,
it's easier than it looks. You just gotta get started, then it'll work
itself out. Honestly the most frustrating part of opening the damned thing is the very
last step, where you need to somehow "unhinge" the back. Unhinge is really
not the right word for it. It's more like, gently have a struggle until
somehow you do the right thing. I'm sure the modders who come ahead of us
would like a better guidance on that step. I can't explain it lol.
So happy that the pin headers were already on the board. That is love.
Ah time to find my Arduino Uno, which I find so convenient for 3.3v serial
despite it being 5v. Short the reset pin to GND and wham! Good to go.
screen /dev/ttyACM0 115200
That did the trick. Yay, and it's my ol' friend U-Boot. Love this
guy, he always comes to teh rescue. <3
I dive into the help and learn all I can from my env. I've posted my U-boot env
along with various other u-boot printouts near the end of this post.
But here is the bootup sequence at that point in time.
U-Boot 2012.07 [local,local] (Sep 03 2015 - 17:33:28)
U-boot 2012.07 dni1 V0.4 for DNI HW ID: 29764958 NOR flash 0MB; NAND flash 128MB; RAM 512MB; 1st Radio 4x4; 2nd Radio 4x4; Cascade
smem ram ptable found: ver: 0 len: 5
DRAM: 491 MiB
NAND: SF: Unsupported manufacturer 00
ipq_spi: SPI Flash not found (bus/cs/speed/mode) = (0/0/48000000/0)
128 MiB
MMC:
*** Warning - bad CRC, using default environment
PCI0 Link Intialized
PCI1 Link Intialized
In: serial
Out: serial
Err: serial
131072 bytes read: OK
MMC Device 0 not found
cdp: get part failed for 0:HLOS
Net: MAC1 addr:3c:37:xx:xx:xx:xx
athrs17_reg_init: complete
athrs17_vlan_config ...done
S17c init done
MAC2 addr:3c:37:xx:xx:xx:xx
eth0, eth1
Hit any key to stop autoboot: 0
Client starts...[Listening] for ADVERTISE...TTT
Retry count exceeded; boot the image as usual
nmrp server is stopped or failed !
Loading from device 0: nand0 (offset 0x1480000)
** check kernel image **
Verifying Checksum ... OK
** check rootfs image **
Verifying Checksum ... OK
MMC Device 0 not found
Loading from nand0, offset 0x1480000
Image Name: Linux-3.4.103
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 2157608 Bytes = 2.1 MiB
Load Address: 41508000
Entry Point: 41508000
Automatic boot of image at addr 0x44000000 ...
Image Name: Linux-3.4.103
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 2157608 Bytes = 2.1 MiB
Load Address: 41508000
Entry Point: 41508000
Verifying Checksum ... OK
Loading Kernel Image ... OK
OK
mtdparts variable not set, see 'help mtdparts'
no partitions defined
defaults:
mtdids : nand0=msm_nand
mtdparts: none
info: "mtdparts" not set
Using machid 0x136c from environment
Starting kernel ...
It's nice to see more into what was happening, I was so glad for the boot
logs that others have posted. If it wasn't for you, I wouldn't have known
that some of those scary error messages are normal and have nothing to do
with the problem I was trying to solve!
I will prevent the story from becoming tirefully long, but I gotta tell
you about this one moment. After a certain tedious set of experiments, I
hit a point where the tty, at boot time, seemed like the usual kernel
hang, but after a few seconds, the power light turned white and was
blinking
faster than usual!! Unfortunately, a few seconds later it would soft reboot.
But I remember it was a very exciting moment for me in this journey.
So anyways, this serial was giving more info. I could now see I was stuck
at an inconclusive "Starting Kernel ..." message.
I uncovered the problem by compiling an OpenWrt RAM image with
debug_ll enabled. I provide plenty of info on making one of these later in
this post.
At this point, it's important to note that the stock
Netgear U-boot env bootargs` have a tty setting that will prevent output on
OpenWrt firmware. In order to make serial output visible, the bootargs
"console" setting must be cleared.
Original bootargs:
bootargs=console=ttyHSL1,115200n8
To clear bootargs:
setenv bootargs
To make that permanent across reboots:
saveenv
Once I realized tty output, I got a big hint from Linux on what was
failing:
Starting kernel ...
Uncompressing Linux...
XZ-compressed data is corrupt
-- System halted
This confirmed that the system was able to get past the bootloader stage
and into the beginning of kernel execution. That gave me a lot of hope
that I could restore full functionality to the device. The kernel
decompression (which is done "in-place" / on top of itself) was failing
crc32 check. Therefore, I concluded that a portion of RAM at the
decompression location was faulty.
Here are the default RAM locations that kernels get loaded to:
- OpenWrt: 0x42208000
- Netgear: 0x41508000
From these locations, the kernel is decompressed. From the images I built,
I noticed the kernel alone (vmlinux) was about 8MB uncompressed. So
techincally speaking, the problem likely occurs in ~8MB window from those
starting addresses. But, just to steer clear of any further headaches, I
decided not to interrogate the area and simply avoid using it.
Getting back to the story; Of course I had no guarantees that higher RAM
was not also damaged, but I knew the next step was to try to
run a kernel from a higher location. If I could succeed in doing that, I
would be home free.
So, I customized my kernel to boot from an elevated address to bypass the
problematic RAM area. It's been running well for 5 days now. (success!)
This wasn't so easily done, though. I underwent other trials and
tribulations, but that's enough story time. From here on out, I just want
to leave helpful stuff.
To make everything automatic, I had to modify my bootcmd
U-boot
environment variable. I ran the following from u-boot:
setenv bootcmd 'sleep 2; nmrp; if loadn_dniimg 0 0x1480000 0x55000000 && chk_dniimg 0x55000000; then bootm 0x55000000; else fw_recovery; fi'
Notice that I'm using bootm
. I had worked so hard to get everything working right
without using "bootipq2" that I was scared to even try it, lol. I don't think
I'm missing anything from it, though.
From here on out, I'd like to share various information that I have
accumulated from this journey.
debug_ll kernel
Compiling a kernel with debug_ll enabled is useful when you can't debug
the system directly (eg JTAG) and need a way to diagnose an early boot
issue that is perhaps kernel related. debug_ll will allow you to further
deduce how far the kernel has executed, and provide more helpful verbose
debug messages. Build one for R7800 as follows:
Here are the OpenWrt menuconfig items:
Global Build Settings ->
Kernel Build Options ->
Enable Support for printk
Compile the kernel with early printk (NEW)
Here is a diffconfig made with scripts/diffconfig.sh
. I also specified
to not suppress preinit and init stderr messages.
CONFIG_TARGET_ipq806x=y
CONFIG_TARGET_ipq806x_generic=y
CONFIG_TARGET_ipq806x_generic_DEVICE_netgear_r7800=y
CONFIG_DEVEL=y
CONFIG_IMAGEOPT=y
CONFIG_INITOPT=y
CONFIG_KERNEL_DEBUG_LL=y
CONFIG_KERNEL_DEBUG_LL_UART_NONE=y
CONFIG_KERNEL_EARLY_PRINTK=y
CONFIG_PREINITOPT=y
# CONFIG_TARGET_INIT_SUPPRESS_STDERR is not set
# CONFIG_TARGET_PREINIT_SUPPRESS_STDERR is not set
You may incorporate these settings to your .config file as guided in
https://openwrt.org/docs/guide-developer/build-system/use-buildsystem#using_diff_file
cat diffconfig >> .config # append changes to bottom of .config
make defconfig # apply changes
Now let's talk about the build stage. the debug_ll setting will cause
OpenWrt to require some info from you during the kernel config stage. You
have a couple ways to handle this.
One way is to build with make V=s
. Doing a regular make
will not
provide the results we are after. I can't quite recall what
happens, either the build will fail, or the default debug uart settings
will get selected, and we don't want those. We need the custom r7800
values. With make V=s
, we will be able to do the user interaction when
the kernel config stage occurs.
The other way is doing a kernel config, seperate from the higher level
openwrt config. The benefit to doing this is you won't have to input the
same values every time in the middle of building a kernel. So if you are
doing a lot of testing, I would recommend this.
First I'll detail the user-interaction stage, as this will answer a lot on
what to do for both methods.
Eventually the build will stop to ask you to fill in some questions. Here
are the prompts along with the answers needed for R7800:
Some of these will get filled in automatically.
Kernel low-level debugging functions (read help!) (DEBUG_LL) [Y/n/?] y
Kernel low-level debugging port
> 1. Kernel low-level debugging messages via QCOM UARTDM
(DEBUG_QCOM_UARTDM) (NEW)
2. Kernel low-level debugging via EmbeddedICE DCC channel
(DEBUG_ICEDCC)
3. Kernel low-level debug output via semihosting I/O
(DEBUG_SEMIHOSTING)
4. Kernel low-level debugging via 8250 UART (DEBUG_LL_UART_8250)
5. Kernel low-level debugging via ARM Ltd PL01x Primecell UART
(DEBUG_LL_UART_PL01X)
choice[1-5]: 1
Physical base address of debug UART (DEBUG_UART_PHYS) [0xf991e000]
(NEW) 0x16340000
Virtual base address of debug UART (DEBUG_UART_VIRT) [0xfa71e000]
(NEW) [doesn't matter; you can hit enter here]
So you select 1, the QCOM UARTDM, followed by specifying physical base
address 0x16340000.
You might be wondering from where could you have deduced 0x16340000? It
can be found in the r7800 dts file,
target/linux/ipq806x/files-4.14/arch/arm/boot/dts/qcom-ipq8065-r7800.dts:207
serial@16340000 {
Now, the virtual address is absolutely irrelevant since our
platform does not call iotable_init() (in which case we would get the
virtual address from the table provided to the call). This knowledge was
discovered from:
https://community.arm.com/developer/ip-products/processors/f/classic-processors-forum/8962/how-to-specify-virtual-address-for-pl011-uart-in-linux-kernel
https://lists.openwrt.org/pipermail/openwrt-devel/2016-August/002173.html
Now that you understand fully about these settings, I will show how to
make them permanent.
make kernel_menuconfig CONFIG_TARGET=subtarget
Kernel Hacking ->
[*] Kernel low-level debugging functions (read help!)
Once you enable this, it'll open up the menus to apply the rest of the
settings that have already been discussed.
Kernel Load Address (kla) Modification
For ipq806x, the kernel load address (kla) must be synced between two
different files. This is due to a certain OpenWrt Linux patch,
target/linux/ipq806x/patches-4.14/0060-HACK-arch-arm-force-ZRELADDR-on-arch-qcom.patch.
Actually, the write-up in that patch file is a good one. In reference to its
comments, I'll add in my own words that although r7800 supports Device Tree
(DT), since the patch changes zreladdr forcefully to also provide support for
older devices using ATAGS, the kernel will listen to the patch's setting even
for DT supported devices, ignoring their reserved mem settings located within.
This zreladdr setting affects where kernel is decompressed to (kernel xz sorta
decompresses upon itself).
Here is my patch that you can learn from.
diff --git a/target/linux/ipq806x/image/Makefile b/target/linux/ipq806x/image/Makefile
index e1eb090de3..c0504135da 100644
--- a/target/linux/ipq806x/image/Makefile
+++ b/target/linux/ipq806x/image/Makefile
@@ -15,7 +15,7 @@ define Device/Default
KERNEL_DEPENDS = $$(wildcard $(DTS_DIR)/$$(DEVICE_DTS).dts)
KERNEL_INITRAMFS_PREFIX := $$(IMG_PREFIX)-$(1)-initramfs
KERNEL_PREFIX := $$(IMAGE_PREFIX)
- KERNEL_LOADADDR = 0x42208000
+ KERNEL_LOADADDR = 0x43208000
SUPPORTED_DEVICES := $(subst _,$(comma),$(1))
IMAGE/sysupgrade.bin = sysupgrade-tar | append-metadata
IMAGE/sysupgrade.bin/squashfs :=
diff --git a/target/linux/ipq806x/patches-4.14/0060-HACK-arch-arm-force-ZRELADDR-on-arch-qcom.patch b/target/linux/ipq806x/patches-4.14/0060-HACK-arch-arm-force-ZRELADDR-on-arch-qcom.patch
index f810f6ac46..a503a1c898 100644
--- a/target/linux/ipq806x/patches-4.14/0060-HACK-arch-arm-force-ZRELADDR-on-arch-qcom.patch
+++ b/target/linux/ipq806x/patches-4.14/0060-HACK-arch-arm-force-ZRELADDR-on-arch-qcom.patch
@@ -59,4 +59,4 @@ Signed-off-by: Mathieu Olivari <mathieu@codeaurora.org>
--- /dev/null
+++ b/arch/arm/mach-qcom/Makefile.boot
@@ -0,0 +1 @@
-+zreladdr-y+= 0x42208000
++zreladdr-y+= 0x43208000
SO YES!! I HAVE DONE IT!!!