Netgear R7800 exploration (IPQ8065, QCA9984)

Now that I have my serial console working, I've been able to install a lede build downloaded from the lede downloads tree: lede-17.01.0-rc2-r3131-42f3c1f-ipq806x-R7800-squashfs-factory.img

However, having installed this image, when I try to install a sysupgrade that I've built from a tree updated about an hour ago, I get this (in the Luci upgrade UI):

The uploaded image file does not contain a supported format. Make sure that you choose the generic image format for your platform.

If I try to flash the sysupgrade that's equivalent to what's already running, it accepts it as a valid image (I didn't actually proceed to flash it). Am I missing something? Should I not expect a build from the head of the tree to be usable? There's such a long stream of discussion on this thread that although I am sure the answer to my question is in there somewhere, I haven't been able to find it.

I'm wondering if some of this stuff might belong on the wiki... :slight_smile:

(BTW, I should say that the factory image from the same build also fails; what I get in this case is:

MODEL ID on image: D7800
Firmware Image MODEL ID do not match open source firmware ID
131072 bytes read: OK
HW ID on board: 29764958+0+128+512+4x4+4x4+cascade
HW ID on image: 29764958+0+128+512+4x4+4x4
Firmware Image HW ID do not match Board HW ID
Board HW ID mismatch,it is forbidden to be written to flash!!)

You have built the image for a different router...
D7800 is not R7800.

That would do it. :slight_smile:

@blogic @dissent1

With r3824 my R7800 fails to boot. Kernel starts but the seems to get stalled. Every ~60 seconds a note about being stalled:

[    2.504598]  TX Checksum insertion supported
[    2.506857]  Wake-Up On Lan supported
[    2.511457]  Enable RX Mitigation via HW Watchdog Timer
[   23.499306] INFO: rcu_sched detected stalls on CPUs/tasks:
[   23.503683]  1-...: (2 ticks this GP) idle=145/140000000000000/0 softirq=28/28 fqs=1050
[   23.503775]  (detected by 0, t=2102 jiffies, g=-282, c=-283, q=13)
[   23.513058] Task dump for CPU 1:
[   23.517917] swapper/0       R  running task        0     1      0 0x00000002
[   23.525581] [<c05c5c54>] (__schedule) from [<00000058>] (0x58)
[   86.549311] INFO: rcu_sched detected stalls on CPUs/tasks:
[   86.553679]  1-...: (2 ticks this GP) idle=145/140000000000000/0 softirq=28/28 fqs=4202
[   86.553773]  (detected by 0, t=8407 jiffies, g=-282, c=-283, q=13)

The router booted ok yesterday with r3799 with no other kernel-related changes than using 4.9 instead of 4.4. Otherwise vanilla build.

Bootlogs:

Looking at the changelog, this looks to me as the most suspicious commit:

"ipq806x: add ipq4019 support"
https://git.lede-project.org/?p=source.git;a=commit;h=c2d50bdeb34cfc359f28aeb2fe7648cc335bc623

(As it modifies config symbols, adds several patches and makes DTS changes.)

Other commits look more innocent to me.

will investigate on my ap148 later today, sorry for the inconvenience

I am trying to find the regression range, and sadly it seems that there are multiple failures. I have so far made two minimal test builds and it looks like this:

  • with r3811-eb3ac8281b everything still works at the first glance
  • with r3816-5c617aec05 wifi firmware is broken and wifi is unusable but the router boots and there is normal wired connectivity. Most likely the ath10k-firmware update has broken things :frowning:

Wifi breakage:

 Reboot (SNAPSHOT, r3816-5c617aec05)

[   12.808966] procd: - init -
[   12.961880] kmodloader: loading kernel modules from /etc/modules.d/*
[   12.964879] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   12.972196] Loading modules backported from Linux version wt-2017-01-31-0-ge882dff19e7f
[   12.972593] Backport generated by backports.git backports-20160324-13-g24da7d3c
[   12.999550] ath10k_pci 0000:01:00.0: enabling device (0140 -> 0142)
[   12.999637] ath10k_pci 0000:01:00.0: enabling bus mastering
[   13.000105] ath10k_pci 0000:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[   13.130838] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:01:00.0.bin failed with error -2
[   13.130888] ath10k_pci 0000:01:00.0: Falling back to user helper
[   13.183995] firmware ath10k!pre-cal-pci-0000:01:00.0.bin: firmware_loading_store: map pages failed
[   13.184338] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/cal-pci-0000:01:00.0.bin failed with error -2
[   13.191960] ath10k_pci 0000:01:00.0: Falling back to user helper
[   13.673417] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[   13.673451] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[   13.684393] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.4-00074 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 fa32e88e
[   15.728598] ath10k_pci 0000:01:00.0: unable to read from the device
[   15.728621] ath10k_pci 0000:01:00.0: could not execute otp for board id check: -110
[   15.733663] ath10k_pci 0000:01:00.0: failed to get board id from otp: -110
[   15.741474] ath10k_pci 0000:01:00.0: could not probe fw (-110)
[   15.749117] ath10k_pci 0001:01:00.0: enabling device (0140 -> 0142)
[   15.754119] ath10k_pci 0001:01:00.0: enabling bus mastering
[   15.754552] ath10k_pci 0001:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[   15.890686] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0001:01:00.0.bin failed with error -2
[   15.890727] ath10k_pci 0001:01:00.0: Falling back to user helper
[   15.941501] firmware ath10k!pre-cal-pci-0001:01:00.0.bin: firmware_loading_store: map pages failed
[   15.941716] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/cal-pci-0001:01:00.0.bin failed with error -2
[   15.949434] ath10k_pci 0001:01:00.0: Falling back to user helper
[   16.201093] ath10k_pci 0001:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[   16.201123] ath10k_pci 0001:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[   16.211923] ath10k_pci 0001:01:00.0: firmware ver 10.4-3.4-00074 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 fa32e88e
[   18.248608] ath10k_pci 0001:01:00.0: unable to read from the device
[   18.248634] ath10k_pci 0001:01:00.0: could not execute otp for board id check: -110
[   18.253677] ath10k_pci 0001:01:00.0: failed to get board id from otp: -110
[   18.261457] ath10k_pci 0001:01:00.0: could not probe fw (-110)
[   18.269981] ip_tables: (C) 2000-2006 Netfilter Core Team

EDIT:
I am currently recompiling those two builds and will re-test. I forgot that "git reset --hard" also reseted to using kernel 4.4, so those two test builds are invalid from kernel 4.9 perspective. :frowning2:

EDIT2:
The same result with properly compiled kernel 4.9 versions. wifi is broken at r3816 but the router boots.

I have narrowed down the non-boot to be between r3818-r3820.

  • r3818-1cb406d019 boots ok (but wifi is broken, already by the earlier commits)
  • r3820-d5b10bb560 does not boot. (Note: I tried both as it is and with the MTD config item change reverted, as I can't see how that item is related to the crypto stuff that the commit is about)

So it looks like the non-boot condition is caused by one of these:

  • d5b10bb560 ipq806x: make the dwc3 driver and required phy drivers built-in
  • 7dc5617173 ipq806x: enable QCE hardware crypto inside the kernel

I am currently compiling r3819

EDIT:
r3819-7dc5617173 boots ok (without wifi) when the MTD change was reverted, so it looks like the culprit is r3820:

  • d5b10bb560 ipq806x: make the dwc3 driver and required phy drivers built-in

Ps. Luckily the whole ipq806x seems to be broken in the phase1 buildbot, so the faulty builds will not reach the general public.

@hnyman

Sorry, I just saw that John merged the IPQ40XX. I guess, I'll be here answering questions in the mean time.
The wifi breakage is caused by the removal of 936-ath10k_skip_otp_check.patch from the mac80211 package.

I've talked to Michal Kazior about this back in November 2016. But he hasn't come up with a solution either.
The issue with the 936- is that it breaks IPQ40XX device identification and it gets detected as a pcie device
(rather than AHB) and this in turn causes the ath10k driver to abort since it can't find the matching boarddata.

Since I don't have a IPQ806x, I can't test what would be a good solution/WA for this. However, this needs to
be fixed in some way since the upcoming IPQ807X (from what I know) also integrates the WIFI-MAC into the
SoC (so it going to be AHB/AXI too).

From IPQ40XX perspective:

  /* otp and board file not needed if calibration data is present */
  if (calret) {
          ret = ath10k_core_get_board_id_from_otp(ar);
          if (ret && ret != -EOPNOTSUPP) {

is the problematic part. Qualcomm switched from "cal" to "pre-cal". The pre-cal data contains the actual project/board id, that gets later returned back to the driver. Since the pre-cal is mandatory, the calret value is 0 for the IPQ40XX and the
board identification step is skipped. (And I think ath10k just assumes it's a PCIE chip).

So I wonder, what's wrong with the QCA988x/QCA99xx, since ath10k_core_get_board_id_from_otp() should work there as well?

Thanks for your response.
Good to know that the reason for the wifi breakage is identified.
Hopefully the wifi functionality can be first restored to the existing ipq806x device base and then a solution suitable for both the new and old devices can be looked for.

But like I said in the previous message, luckily the whole target is currently broken in buildbot, so the damage does not get currently spread very wide.

Looks like the commit preventing the boot is:

r3820 d5b10bb560 
ipq806x: make the dwc3 driver and required phy drivers built-in

When I revert that one commit and build r3845, the router boots ok (but without wifi).

@blogic @chunkeey

Thanks for the hint.
Restoring that mac80211 patch restores wifi functionality to R7800 with r3845.

To summarise findings so far:

I reverted the dwc3 commit and restored the 936-ath10k... patch and now R7800 boots again and has wifi.

@hnyman
The FritzBox 4040 boots with r3853-a8164bd171 just fine. There's no hang when it is initializing the dwc3 and wifi works there as well. As for the hang on ipq806x: chances are, your hang is possibly caused by a reversed? loading order of the usb phy drivers, dwc3 and dwc3-of-simple. I've seen similar issues with IPQ40XX (it was fixed by making dwc3 in charge of the clocks).

Hey, have you tried putting pre cal data through a device tree? Though you'll need a parser for that maybe to insert it during boot

@chunkeey
I looked more closely to your wifi-breaking commit https://git.lede-project.org/?p=source.git;a=commit;h=cc189c0b7fa015978b04bb663a75b1da726376b5 and noticed that it also decreases the wifi driver's buffer sizes (pci buffer, htt rx ring) for all ath10k users, not just for the new chip.

I am not a wifi driver expert, so I wonder what is the expected performance impact from this change?

@blogic
Thanks for reverting the dwc3 commit. Any chance that also this wifi-breaking commit could be reverted for now, until a better solution can be figured out? It seems pretty strange that a hack that makes the radios of the existing devices to work is deleted when a new radio chip does not like it. Yeah, the existing hack may be theoretically wrong (as upstream seems to think), but it has made the radios of the current devices to work.

Ps. I tried to find the origins of that wifi hack, and it has appeared on the openwrt-devel mailing list in April 2016 as part of introducing C2600 router, but there has been no explanation / reasoning for that hack.
https://lists.openwrt.org/pipermail/openwrt-devel/2016-April/040880.html

Sure, there's this utility:

which I think would allow you to patch the dtb "live". This might also be interesting for the RPI and APU-folks,
since they usually need it to change GPIO assignments, provide initialization stuff/enumeration of connected
spi and i2c devices, ...

However, it doesn't solve the problem at hand. Since providing the cal- or pre-cal data isn't the problem.
It's whenever or not you can ignore the error from ath10k_core_get_board_id_from_otp() or not...

I don't know where exactly the IPQ806x fails. In theory the following patch for the mac80211-package:

--- a/drivers/net/wireless/ath/ath10k/core.c       2017-03-23 12:44:41.899549793 +0100
+++ b/drivers/net/wireless/ath/ath10k/core.c    2017-03-23 12:44:11.912856777 +0100
@@ -686,7 +686,7 @@ static int ath10k_core_get_board_id_from
    if (ret) {
            ath10k_err(ar, "could not execute otp for board id check: %d\n",
                       ret);
-               return ret;
+               return -EOPNOTSUPP;
    }

    board_id = MS(result, ATH10K_BMI_BOARD_ID_FROM_OTP);

might fix it. @hnyman ?

Edit:

Yes, this patch is necessary for the RT-AC58U. The problem is that the device will panic during
operation because it runs out of memory. Note: The AHB code shares the same buffers with the
PCI implementation.

As for the impact on performance: This has to be measured. I don't have any IPQ806x. But there
was no difference for the IPQ40XX (Tested with FB4040, which has 256 MiB) and the QCA9880 in
my C7. What are your numbers?

[quote="chunkeey, post:276, topic:285"]
I don't know where exactly the IPQ806x fails. In theory the following patch for the mac80211-package:
...
might fix it. @hnyman ?
[/quote]Thanks.
I tested it and at the first glance it really does fix wifi.

Explanation why it works:
the return value of "ath10k_core_get_board_id_from_otp" is set to be -EOPNOTSUPP when failure, as the function "ath10k_core_probe_fw" allows just that error code as a "harmless" failure for the otp board id check call. If it sees that error it uses the board file as a backup source.

I am currently re-compiling the whole firmware to verify the result, (as I manually opkg installed the fix first).

It would be great if other users of ipq806x devices could verify the result (for C2600 etc.)

kernel log with this patch looks like this for one radio:

[   16.163318] ath10k_pci 0000:01:00.0: enabling device (0140 -> 0142)
[   16.163401] ath10k_pci 0000:01:00.0: enabling bus mastering
[   16.163850] ath10k_pci 0000:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[   16.337294] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:01:00.0.bin failed with error -2
[   16.337351] ath10k_pci 0000:01:00.0: Falling back to user helper
[   22.837360] firmware ath10k!pre-cal-pci-0000:01:00.0.bin: firmware_loading_store: map pages failed
[   23.212157] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[   23.212211] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[   23.226748] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.4-00074 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 fa32e88e
[   25.259266] ath10k_pci 0000:01:00.0: unable to read from the device
[   25.259288] ath10k_pci 0000:01:00.0: could not execute otp for board id check: -110
[   25.277326] ath10k_pci 0000:01:00.0: failed to fetch board data for bus=pci,vendor=168c,device=0046,subsystem-vendor=168c,subsystem-device=cafem...from ath10k/QCA9984/hw1.0/board-2.bin
[   25.277588] ath10k_pci 0000:01:00.0: board_file api 1 bmi_id N/A crc32 dd636801
[   26.800717] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal file max-sta 512 raw 0 hwcrypto 1
[   26.882020] ath: EEPROM regdomain: 0x0
[   26.882030] ath: EEPROM indicates default country code should be used
[   26.882036] ath: doing EEPROM country->regdmn map search
[   26.882046] ath: country maps to regdmn code: 0x3a
[   26.882055] ath: Country alpha2 being used: US
[   26.882062] ath: Regpair used: 0x3a

The solution worked also after the full build and flash, so I submitted it as a PR:
https://github.com/lede-project/source/pull/995

Ok, I think I see it now. there's a second part to this 936-ath10k_skip_otp_check.patch in the
ath10k-firmware package as well:

define Package/ath10k-firmware-qca9984/install
     $(INSTALL_DIR) $(1)/lib/firmware/ath10k/QCA9984/hw1.0
     ln -s \
            ../../cal-pci-0000:01:00.0.bin \
            $(1)/lib/firmware/ath10k/QCA9984/hw1.0/board.bin
     $(INSTALL_DATA) \
            $(DL_DIR)/$(QCA9984_BOARD_FILE_DL) \
            $(1)/lib/firmware/ath10k/QCA9984/hw1.0/board-2.bin

The symbolic link from: /lib/firmware/ath10k/cal-pci-0000:01:00.0.bin to /lib/firmware/ath10k/QCA9984/hw1.0/board.bin.

So the cal-data of the 2.4GHz (I think so?) Radio is being used as the board data for both.
This wouldn't work for some of the IPQ40XX. For example the RT-AC58U uses different
PA/FE chips (The 2.4GHz has two RTC6649E. The 5GHz two SKY85728-11). So each
radio will only work correctly if the the right configuration is select from the board-2.bin file.

With just one board.bin for both radios, how is this working for the QCA9984?

Note: I tried using just board.bin with IPQ40XX. And while the device would boot, the WIFI
performance is abysmal.

Note2: board-2.bin houses a collection of different board data. There's a tool ath10k-bdencoder
in the qca-swiss-army-knife project, that can create and extract boards from board-2.bin file.

Meanwhile I've found a correct and fully functional ipq806x tsens driver, i haven't tested it yet though, it should be ok
https://github.com/dissent1/r7800/commit/b699ec91058dbe6fddd0504d9434eaab9042d51b

1 Like