Netgear R7800 exploration (IPQ8065, QCA9984)

I have compiled the qca8k branch successfully but when configure the WAN to PPPOE and it crashs. Is someone managed to get traffic offload working because currently the throughtput substantially affected in heavy task.

blogic's qca8k has a bug that crashes when pppoe is used, he is working on a fix but no ETA. More than that offloading is unstable yet.

@dissent1 @chunkeey @blogic

I noticed that kernel log error messages seem to indicate that R7800 ath10k initialisation is now looking for firmware-6.bin instead of the firmware5.bin.

But ath10k Makefile is still installing firmware-5.bin
https://git.lede-project.org/?p=source.git;a=blob;f=package/firmware/ath10k-firmware/Makefile;hb=HEAD#l311

[   19.211571] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:01:00.0.bin failed with error -2
[   19.211608] ath10k_pci 0000:01:00.0: Falling back to user helper
[   31.985753] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/QCA9984/hw1.0/firmware-6.bin failed with error -2
[   31.985783] ath10k_pci 0000:01:00.0: Falling back to user helper
[   32.031964] firmware ath10k!QCA9984!hw1.0!firmware-6.bin: firmware_loading_store: map pages failed
[   32.315733] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[   32.315780] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[   32.329100] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.4-00082 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 f301de65
[   34.607539] ath10k_pci 0000:01:00.0: board_file api 2 bmi_id 0:1 crc32 751efba1
[   40.455033] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 512 raw 0 hwcrypto 1

https://github.com/torvalds/linux/commit/aad1fd7f7677d05013b5fe247a5a6e1464c69a0f#diff-53fd3aaa018dbabcdbd37eb543e1abae

TL;DR: The ATH10K_FW_API_MAX was bumped from 5 to 6 because of the (new) QCA6174 hw3.0 firmware. Since all ath10k share the same firmware loading code:

All ath10k will try to load firmware-6.bin (and currently all other than the QCA6174 will fail)... But it shouldn't be that bad.

But yeah, something seems horribly wrong with the QCA9984 in the IPQ806x platform. I mean the IPQ40XX does manage to hit 550MBit/s throughput (See the Romans post on the ML). Not to mention that the QCA9980 and QCA9984 are much better than the QCA4019. After all they support 2x2 160MHz VHT or 4x4 at 80 MHz VHT channels (from what I know) they should be much faster!

1 Like

Actually I'm starting to think that it's kernel related and not the wireless driver is causing this. After some unrelated backports commits I've starting having the issue on both bands
(if you remember I've had it on 2.4 ghz previously).
Apart from that nbg6817 doesn't seem to have the issue while having similar hardware.
My assumption is that smth wrong with the byte alignment or maybe u-boot messes in ram in same addresses as wireless driver?
There's already 2 mibs in the end of ram region that uboot uses (discovered it when was investigating stability issues in the beginning of support by lede). Maybe it also does smth somewhere in the middle?

Update: weird enough but current HEAD seem to crash the device for some ppl in 10 seconds after 1st client attaches over wifi and that happens right after factory reset. With settings saved it doesn't crash but throughput is extremely low accompanied by broken frames (the issue that is being discussed).
@blogic @chunkeey @jcadduono

Well, the IPQ806x isn't alone. The Marvell mamba? have crashes/problems with 4.9 too:

It could that you (and others) are hitting a SoC or CPU errata. You could enable ARM_ERRATA_798181
and ARM_ERRATA_773022 in the kernel and test if it does helps or not, at least it's something easy to test.
But it might also do just be a waste of time.

You could also try 4.14. I know that hauke has been busy with porting the LEDE patches to 4.14
https://git.lede-project.org/?p=lede/hauke/staging.git;a=shortlog;h=refs/heads/kernel-4.14

I too have the RT-AC58U running on 4.14. But I can't realistically port the IPQ806X parts without the hardware.

@blogic what's your comment? Will you look into this issue? Or, could you make the ipq40xx its own target?
Because this way, the ipq806x and ipq40xx can have separate kernels... And a lot of grief because of interest- and potential merge-conflicts could be avoided in the future.

Tx became even twice worse - was 50 mbit, now 25

That's a result too. Did you check if ustream-ssl still messes up in the same way as well or did something change (for better or worse)? There are likely more erratas to test. Usually, the kConfigs description contains the necessary information to decide whenever they could apply to your device or not.

Your best bet would be to run git bisect, since as you said: the change happend recently.

If you want to do another "long shot", you could also play around with the custom board-2.bin again.

I've ran through ath10k commits and didn't notice any that would break 5ghz in addition to already broken 2.4ghz actually in my case.
I think I'll try my luck with k4.14 soon

Update: forgot to mention - I've already played with board bins recently - gpl and not, without any effect. I've even echoed 0 into pre-cal to check what happens.

FYI, that Marvell issue with the 4.9 kernel (which has been around forever and nobody was ever able to find/fix) was only with the WRT1900ACv1 devices. All other Marvell devices seemed to work fine with the 4.9 kernel. Unfortunately I had a v1 device and needed to run a custom build based on the 4.4 kernel to avoid the issue.

Ultimately, that issue is what brought me to buy the R7800.

@hnyman

i have this error... all normal?
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.278570] blk_update_request: I/O error, dev mtdblock0, sector 0
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.279108] blk_update_request: I/O error, dev mtdblock0, sector 8
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.284228] blk_update_request: I/O error, dev mtdblock0, sector 16
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.290371] blk_update_request: I/O error, dev mtdblock0, sector 24
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.296514] blk_update_request: I/O error, dev mtdblock0, sector 0
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.302214] Buffer I/O error on dev mtdblock0, logical block 0, async page read
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.406396] blk_update_request: I/O error, dev mtdblock0, sector 0
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.406421] Buffer I/O error on dev mtdblock0, logical block 0, async page read
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.412466] blk_update_request: I/O error, dev mtdblock1, sector 0
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.419285] blk_update_request: I/O error, dev mtdblock1, sector 8
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.425618] blk_update_request: I/O error, dev mtdblock1, sector 16
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.431774] blk_update_request: I/O error, dev mtdblock1, sector 24
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.437936] Buffer I/O error on dev mtdblock1, logical block 0, async page read
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.447137] Buffer I/O error on dev mtdblock1, logical block 0, async page read

i tried different build and i keep crashing when i access to wifi...

now with lede stock stable looks for now it doesn't... any idea?

And also... why only 20/15mb space?? It have a 128mb flash where is all the space?


Ok i found this... i think it should be placed in the first post so that a new r7800 user underestand the small space... I brought this router for lede... not for that shitty netgear firmware (that from the source looks like based on openwrt....)

@chunkeey
What do you think, could following patches ruin the alignment of frames for qca9984 that leads to issues under discussion?
https://github.com/lede-project/source/blob/master/package/kernel/mac80211/patches/307-mac80211-add-hdrlen-to-ieee80211_tx_data.patch
https://github.com/lede-project/source/blob/master/package/kernel/mac80211/patches/308-mac80211-add-NEED_ALIGNED4_SKBS-hw-flag.patch
These patches haven't made its way upstream for some reason. Besides it seems that padding is done within hw already. Or is it different kind of padding?
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.9.65&id=9e19e13261423eeb4398177001daa874c2128aa4
edit: the same issue seem to happen on R7500v2 with qca9980
https://bugs.lede-project.org/index.php?do=details&task_id=1173

Both these patches seem to rot in patchwork with being marked as "changes requested".
https://patchwork.kernel.org/patch/8359111/
https://patchwork.kernel.org/patch/8359121/ <-- has the discussion.

I don't think that these cause the regression you are suffering. In fact, these patches seem to be missing the ath10k part and it's kinda weird that they introduce a unused HW flag IEEE80211_HW_NEEDS_ALIGNED4_SKBS. There's no code that sets or checks the flag. From what I can tell, these patches do very little.

As for patches. You should look at the treasure trove qualcomm is hording on their site:
https://source.codeaurora.org/quic/qsdk/oss/system/feeds/wlan-open/tree/mac80211/patches?h=coconut (Yeah, so much for " I find it frustrating OpenWRT/LEDE doesn't try to work with upstream on ixing these things right. " :zipper_mouth_face: :man_facepalming: )

More precisely, the patches starting with a00-xxx.
The highlights includes the smartantenna stuff. a MU-MIMO fix for QCA99x0. This ce5 full fix:
https://source.codeaurora.org/quic/qsdk/oss/system/feeds/wlan-open/tree/mac80211/patches/a00-800-008-ath10k-fix-ce5-full-issue.patch?h=coconut

This memory-leak fix (looks a bit weird in this context, as the memory leak would need to happen when the device is down / in reset?!)
https://source.codeaurora.org/quic/qsdk/oss/system/feeds/wlan-open/tree/mac80211/patches/a00-900-0013-ath10k-fix-rx-ring-memory-leak.patch?h=coconut

And maybe more.

However, since you and others report that this problem started with (recent?) 4.9 and 4.4 with the same compat-wireless/backports don't experience the issue. I don't think this will help much at all. Again, the issue(s) might be hiding in plain sight, but without hardware I can't not really debug this myself. It could be lurking in the SoC (errarta, bad clocks) or a issue with pcie, ... At most I can tell about my experience based from what I know about IPQ4019 (which is a totally different platform) and the more recent experience from porting it to 4.14.

If you want to take another shot in the dark. Can you test if the iperf3 performance has stayed the same, or did degenerate as well between 4.4 and 4.9 as well, if you run a localhost loopback?

Woah... how could I miss this

It would be good to at least add those patches to LEDE,ideally upstream them

well, it even worse. Please take the time and read: https://www.spinics.net/lists/linux-wireless/msg167883.html
This is/was all well known by the ath10k maintainer for some time now. At this point, I have to ask, what can you even do?

Furthermore, it's not like QCA9984 is the only issue. The customer QCA6174 boards have massive issues with performance too. Just look at the threads over at github:
https://github.com/kvalo/ath10k-firmware/pull/2
https://github.com/kvalo/ath10k-firmware/pull/3
(Bonus points: some of the post link to Fedora and Ubuntu Bugreports)

1 Like

Thanks for the links.
To me it really does not make sense to not upstream anything.
That is classic corporate style,lets just fork everything and maintain our own fork which gets harder and harder to maintain as everything else moves forward and we have to backport.

So for r7800 what patch do we need to apply to solve problem with 4.9 kernel?

Anyone have this error?

Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.251798] ath10k_pci 0000:01:00.0: failed to flush transmit queue (skip 0 ar-state 1): 0
Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.296253] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.296281] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.309466] ath10k_pci 0000:01:00.0: firmware crashed! (guid c95b3960-6710-4367-ad39-3706a2029428)
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.309501] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.317563] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.329756] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.4-00082 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 f301de65
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.336109] ath10k_pci 0000:01:00.0: board_file api 2 bmi_id 0:1 crc32 751efba1
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.349056] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 512 raw 0 hwcrypto 1
Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.368374] ath10k_pci 0000:01:00.0: failed to get memcpy hi address for firmware address 4: -16
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.368399] ath10k_pci 0000:01:00.0: failed to read firmware dump area: -16
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.376263] ath10k_pci 0000:01:00.0: Copy Engine register dump:
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.382956] ath10k_pci 0000:01:00.0: [00]: 0x0004a000 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.388808] ath10k_pci 0000:01:00.0: [01]: 0x0004a400 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.397885] ath10k_pci 0000:01:00.0: [02]: 0x0004a800 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.406757] ath10k_pci 0000:01:00.0: [03]: 0x0004ac00 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.415598] ath10k_pci 0000:01:00.0: [04]: 0x0004b000 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.424470] ath10k_pci 0000:01:00.0: [05]: 0x0004b400 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.433311] ath10k_pci 0000:01:00.0: [06]: 0x0004b800 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.442152] ath10k_pci 0000:01:00.0: [07]: 0x0004bc00 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.450960] ath10k_pci 0000:01:00.0: [08]: 0x0004c000 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.459872] ath10k_pci 0000:01:00.0: [09]: 0x0004c400 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.468738] ath10k_pci 0000:01:00.0: [10]: 0x0004c800 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.477587] ath10k_pci 0000:01:00.0: [11]: 0x0004cc00 3735928559 3735928559 3735928559 3735928559

my wifi is very very unstable... someone can help me ?

think the wifi lost packet on the way...

The only people that have knowledge about the internals of the firmware and can fix ath10k-firmware issues are working for QCA in one way or another. From past experience, I can savely say they will not visit this forum... ever... not even if you provide them a direct link...

You should post this to ath10k@lists.infradead.org . However, you'll have to put some effort into the mail. Simply posting a dump will get you nowhere. You have to describe your setup a bit and include information about your own WIFI clients you use (and the ones in your vicinity). You have to make it crystal clear that this is a regression and it was working before in the same setup but with a different fw. Furthermore you'll have hope the issue gains some attention (like serious posts from other affected users and there are always some!). The bigger the commotion, the better! It would be best if you coordinate this issue with others.

"Still not convinced?" :thinking: Look at this "QCA9984 bmi identification failure" case study:
https://www.spinics.net/lists/linux-wireless/msg160469.html :rofl: