Build for Netgear R7800

[quote="burningjoe, post:192, topic:316"]
I have problems compiling your build lede-r4214-822ee54544-20170526.This is part of the log, which contains the error: https://pastebin.com/CUwfk5CsI've followed your instructions for rebuilding your build environment.

Is this an error on my side or is there something wrong in the line?
[/quote]You need to delete that file package/network/config/firewall/patches/100-test.patch

It contained a firewall fix from Jow in early May and that code has now been mainlined to the firewall with the latest firewall3 version bump. So the patch is now unnecessary (as it tries to add source code that already exists)

Specifically, the patch matches this commit in firewall3 sources
https://git.lede-project.org/?p=project/firewall3.git;a=commitdiff;h=e5dfc8253bebb7cfed06f81f34bbe1afdf285735;hp=f62595480555f4034841cfbdec5858645528ae7d
which commit has been imported to LEDE on Saturday (just after the build you are using) by this firewall upgrade
https://git.lede-project.org/?p=source.git;a=commitdiff;h=6e46f6edc4ee8ad127658c55616bb9d32a8f2d1a

The patch will get removed from my next build. So, you can also wait a few hours and then download new firmware creation patches that will apply cleanly.

EDIT:
new version without the patch: lede-r4235-61eb18d3f7-20170529

1 Like

I'm noticing a bunch of problems with the 5GHz radio on r4235. Changing any frequency options will kill the radio entirely, forcing a reboot to bring it back up. I also noticed it has some issues selecting the appropriate frequency on reboot. I had forced channel 36, and it rebooted on 40. I tried forcing 48, and it booted into 48 fine. I then tried 52 and it failed to associated on reboot. Lastly, I tried auto and it also failed to associate. All these tests were done at 80Hz channel width in AC mode. I'm just going to leave it on 36 (40) in the meantime.

Is anybody else noticing this?

That is new. Probably visible in system log as:

Wed May 31 12:08:20 2017 daemon.notice netifd: radio0 (10144): WARNING (wireless_add_process): executable path /usr/sbin/wpad does not match process 1953 path ()
Wed May 31 12:08:20 2017 daemon.notice netifd: radio0 (10144): Device setup failed: HOSTAPD_START_FAILED

Channel selection depends quite much on the country CRDA settings etc., so it is quite possible that some channels will not work due to DFS restrictions etc.

i followed dissent1's advice to change to a fixed a channel.

Changed the location to world:
For 5ghz the AP was not visible on several channels but i finally found one that worked

For 2,4 ghz i chose channel 11 and the wifi crashed anyway.
Testing channel 6 now.

@chunkeey @hnyman QCA9984
I've done some extensive testing on the 2.4 ghz issue, though I haven't encountered crash like some users, but I have faced very low upstream throughput (from router to client device), tested on 2 different devices:
2.4 ghz:
router to client device - 25-50 mbits/s
client device to router - 100+ mbits/s
For reference on 5ghz I get 500/500 in both directions.
I've tested all firmware branches 10.4-3.2, 10.4-3.3 and 10.4-3.4. I have also tested CT firmware. Ath10k-CT driver constantly produces errors on both bands, so not taking it into account.
I've also extracted board.bin from board-2.bin, tried symlinking the 2.4 ghz cal file to board.bin instead of 5ghz.
Result is the same in all cases.
Also, that crap that is being produced in identification line
Sun Jun 4 03:05:22 2017 kern.err kernel: [ 26.344681] ath10k_pci 0001:01:00.0: failed to fetch board data for bus=pci,vendor=168c,device=0046,subsystem-vendor=168c,subsystem-device=cafeOh???
Sun Jun 4 03:05:22 2017 kern.err kernel: [ 26.344681] m
Sun Jun 4 03:05:22 2017 kern.err kernel: [ 26.344681] ??????,? from ath10k/QCA9984/hw1.0/board-2.bin

happens on all firmwares and board files with current compat-wireless.

So concluding said above the stability and performance issue that concerns 2.4 ghz band is mostly ath10k driver related, but not firmware.

Also, it seems that our device does have OTP to get calibration and identification from, because I've found a lot of mailing lists with various logs from Netgear R7800 on ath10k driver that have OTP working.
Adding to that, there's an upstream patch to get bmi identification working for pre-cal file on QCA99xx https://patchwork.kernel.org/patch/9748097/
But it doesn't work in our case because we have not pre-cal (pure cal), but cal file (pre-cal + board data). So maybe the offset is shifted or it is possible to extract pre-cal from it.
@nbd

1 Like

Just a note: The ?????,x crap was fixed

This commit will probably be in the next compat-wireless refresh. So no need to worry about it.

I can't say much about the bad performance on the 2.4G though. :worried:

I was looking for those messages. Do you have a link to the "various logs from Netgear R7800 on ath10k driver that have OTP working"? I can't seem to find anything since most of the OEM bootlogs are not that verbose when it comes to wifi.

Note: The QCA9984 cards from compex/unex/... do have an eeprom.
That's why the BMI Identification is/was working for them, since this was always supported by ath10k in the kernel.

As for pre-cal, cal and otp: Adrian Chadd explained in his post what's going on behind the scenes: https://www.mail-archive.com/ath10k@lists.infradead.org/msg06233.html:

[...]
Each board data is custom for the board layout / part selection - it's
a template that is used during calibration. The data in OTP is just a
diff against the board template (board.bin / board-2.bin.) [...]

If people aren't using unique BMI IDs (which is another question we
have for QCA) then it's possible you don't have enough information to
"know" which board data to use, so it has to be overridden by a custom
package. We do this at work for our own boards as well - they're
sufficiently different to a reference board that indeed we need to "know".

Now, the reason for pre-loading the calibration data is because it's
needed early in the boot process so the firmware/driver has some idea
of what the hardware is.

So, the driver steps should be:

  • If you have a pre-calibration file, you load that in before you kick
    the firmware too hard;
  • then you read the calibration data /back/ - then the normal firmware
    process will fetch the board ID;
  • then it loads the board-2.bin matching the board/BMI ID, then
  • starts things normally.

Now, I forget if the pre-cal data (and say, data in flash versus data
in OTP) is the whole thing or a diff against the board data. I'd have
to triple-check. The OTP data is certainly just a diff against the
board data.
[...]

Seems that I have been using some truly magic keywords when searching for otp issue on qca9984 and R7800, but I cant find those logs at first glance now.
Alas here are some brief findings:
http://lists.infradead.org/pipermail/lede-dev/2016-December/004987.html - Netgear R9000 with qca9984, highly presumably in the same boat
https://www.spinics.net/lists/linux-wireless/msg160696.html but there are some extras

TP-link C2600 with qca9980 is in the same boat.
OEM bootlog, post 559 https://forum.openwrt.org/viewtopic.php?id=54973&p=23

Moving the discussion here

lede-r4362-5e4bb476c0-20170610

Build contains wifi firmware calibration fixes, as the ath10k wifi driver is patched to properly read the router-specific board data from flash. Details:

I have tried the changes a few days myself and have not noticed any negative changes.

It is not clear from there discussions or the commit description what issue this change fixes. Any chance you can help and describe it in a few sentences?

It fixes the detection and utilisation of the proper device/radio-specific calibration files that are included in the art/caldata partition on flash.

Earlier the same generic file has been used for both 2.4 and 5 GHz radios, because the ath10k wifi driver has failed to read the correct board info file from the router itself. That has been a kludge to get the radios working.

The patches firstly increase the timeout used in the ID reading (as the reading seems to take slightly over 2 seconds that was the earlier timeout limit), and as as the ID reading now works, the second change is the proper naming of the radio-specific calibration files read from the flash so that 2.4 and 5 GHz radios use separate calibration data.

It is yet uncertain what is the net effect of the changes, but likely the performance of one of the radios improves somewhat.

Ps. note that the discussion started in this thread about two weeks ago, was moved to the exploration thread and finally resulted in the PR. (And the whole thing was initially highlighted in March when adding IPQ4xxx support killed all QCA9XXX radios as the original hack was temporarily removed and a new hack was implemented. discussion from Netgear R7800 exploration (IPQ8065, QCA9984) onward)

1 Like

Thx for the explanation. I personally have not noticed any issues myself, but I only have a few clients mostly on 5GHz. Will this be applied to 17.01?

Possibly, but first get it to the LEDE master...

EDIT:
I added the wifi otp change to my 17.01 build in
lede1701-r3437-a6b5ddfd9b-20170611

@hnyman @dissent1 @chunkeey

Unfortunately with the build hnyman uploaded a few days ago this error is back:

ath10k_pci 0001:01:00.0: rx ring became corrupted: -5

@tetsuo55
Like yourself, I've been experiencing random ath10k crashes since I started using LEDE with the R7800. I have two of these installed in different buildings and it's usually the 2.4GHz SSID that disappears. Sometimes once a day, sometimes once a week. I've tried countless combinations of countries and channels without improvement. I've tried the last few hnyman builds.

Do you know of a particular build that you have used that does not exhibit these crashes?

Unfortunately no.

My best results are with @dissent1 build.

@dissent1 could you make a new build with the new patches and your buffer reverts?

In my case, a couple of days with @hnyman r3437 build and with log. It loads correctly the firmware.

root@LEDE:~# dmesg | grep ath10k
[ 24.242114] ath10k_pci 0000:01:00.0: enabling device (0140 -> 0142)
[ 24.242239] ath10k_pci 0000:01:00.0: enabling bus mastering
[ 24.242849] ath10k_pci 0000:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[ 24.374453] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:01:00.0.bin failed with error -2
[ 24.374499] ath10k_pci 0000:01:00.0: Falling back to user helper
[ 31.252484] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[ 31.252520] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[ 31.263451] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.4-00082 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 f301de65
[ 33.542766] ath10k_pci 0000:01:00.0: board_file api 2 bmi_id 0:1 crc32 751efba1
[ 39.392884] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 512 raw 0 hwcrypto 1
[ 39.477843] ath10k_pci 0001:01:00.0: enabling device (0140 -> 0142)
[ 39.477970] ath10k_pci 0001:01:00.0: enabling bus mastering
[ 39.478669] ath10k_pci 0001:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[ 39.619878] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0001:01:00.0.bin failed with error -2
[ 39.619909] ath10k_pci 0001:01:00.0: Falling back to user helper
[ 39.998219] ath10k_pci 0001:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[ 39.998251] ath10k_pci 0001:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[ 40.009097] ath10k_pci 0001:01:00.0: firmware ver 10.4-3.4-00082 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 f301de65
[ 42.284021] ath10k_pci 0001:01:00.0: board_file api 2 bmi_id 0:2 crc32 751efba1
[ 48.192570] ath10k_pci 0001:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 512 raw 0 hwcrypto 1
[71418.792044] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 1
[71418.792107] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 1

That sounds more like a bug in the ath10k driver than a purely R7800 related problem. I have not encountered that myself, but I have rather modest wireless usage.

It could be due to the ath10k buffer size reduction commit by @chunkeey that @dissent1 has tested reverting in his build, but intuitively I think that the "corruption" points more to an actual bug in ath10k than just buffer exhaustion.

I think @hnyman may consider including it in his newer builds? Those buffer sizes has been set for a reason obviously and I'm not sure it's a good catch to decrease it.

There are not much updates concerning R7800 so not sure you need the bleeding edge version if you don't experience any other issues, because you won't notice changes.

1 Like