Netgear R7800 exploration (IPQ8065, QCA9984)

I've ran through ath10k commits and didn't notice any that would break 5ghz in addition to already broken 2.4ghz actually in my case.
I think I'll try my luck with k4.14 soon

Update: forgot to mention - I've already played with board bins recently - gpl and not, without any effect. I've even echoed 0 into pre-cal to check what happens.

FYI, that Marvell issue with the 4.9 kernel (which has been around forever and nobody was ever able to find/fix) was only with the WRT1900ACv1 devices. All other Marvell devices seemed to work fine with the 4.9 kernel. Unfortunately I had a v1 device and needed to run a custom build based on the 4.4 kernel to avoid the issue.

Ultimately, that issue is what brought me to buy the R7800.

@hnyman

i have this error... all normal?
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.278570] blk_update_request: I/O error, dev mtdblock0, sector 0
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.279108] blk_update_request: I/O error, dev mtdblock0, sector 8
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.284228] blk_update_request: I/O error, dev mtdblock0, sector 16
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.290371] blk_update_request: I/O error, dev mtdblock0, sector 24
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.296514] blk_update_request: I/O error, dev mtdblock0, sector 0
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.302214] Buffer I/O error on dev mtdblock0, logical block 0, async page read
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.406396] blk_update_request: I/O error, dev mtdblock0, sector 0
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.406421] Buffer I/O error on dev mtdblock0, logical block 0, async page read
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.412466] blk_update_request: I/O error, dev mtdblock1, sector 0
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.419285] blk_update_request: I/O error, dev mtdblock1, sector 8
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.425618] blk_update_request: I/O error, dev mtdblock1, sector 16
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.431774] blk_update_request: I/O error, dev mtdblock1, sector 24
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.437936] Buffer I/O error on dev mtdblock1, logical block 0, async page read
Wed Nov 22 17:52:43 2017 kern.err kernel: [ 14.447137] Buffer I/O error on dev mtdblock1, logical block 0, async page read

i tried different build and i keep crashing when i access to wifi...

now with lede stock stable looks for now it doesn't... any idea?

And also... why only 20/15mb space?? It have a 128mb flash where is all the space?

Ok i found this... i think it should be placed in the first post so that a new r7800 user underestand the small space... I brought this router for lede... not for that shitty netgear firmware (that from the source looks like based on openwrt....)

@chunkeey
What do you think, could following patches ruin the alignment of frames for qca9984 that leads to issues under discussion?
https://github.com/lede-project/source/blob/master/package/kernel/mac80211/patches/307-mac80211-add-hdrlen-to-ieee80211_tx_data.patch
https://github.com/lede-project/source/blob/master/package/kernel/mac80211/patches/308-mac80211-add-NEED_ALIGNED4_SKBS-hw-flag.patch
These patches haven't made its way upstream for some reason. Besides it seems that padding is done within hw already. Or is it different kind of padding?
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.9.65&id=9e19e13261423eeb4398177001daa874c2128aa4
edit: the same issue seem to happen on R7500v2 with qca9980
https://bugs.lede-project.org/index.php?do=details&task_id=1173

Both these patches seem to rot in patchwork with being marked as "changes requested".
https://patchwork.kernel.org/patch/8359111/
https://patchwork.kernel.org/patch/8359121/ <-- has the discussion.

I don't think that these cause the regression you are suffering. In fact, these patches seem to be missing the ath10k part and it's kinda weird that they introduce a unused HW flag IEEE80211_HW_NEEDS_ALIGNED4_SKBS. There's no code that sets or checks the flag. From what I can tell, these patches do very little.

As for patches. You should look at the treasure trove qualcomm is hording on their site:
https://source.codeaurora.org/quic/qsdk/oss/system/feeds/wlan-open/tree/mac80211/patches?h=coconut (Yeah, so much for " I find it frustrating OpenWRT/LEDE doesn't try to work with upstream on ixing these things right. " :zipper_mouth_face: :man_facepalming: )

More precisely, the patches starting with a00-xxx.
The highlights includes the smartantenna stuff. a MU-MIMO fix for QCA99x0. This ce5 full fix:
https://source.codeaurora.org/quic/qsdk/oss/system/feeds/wlan-open/tree/mac80211/patches/a00-800-008-ath10k-fix-ce5-full-issue.patch?h=coconut

This memory-leak fix (looks a bit weird in this context, as the memory leak would need to happen when the device is down / in reset?!)
https://source.codeaurora.org/quic/qsdk/oss/system/feeds/wlan-open/tree/mac80211/patches/a00-900-0013-ath10k-fix-rx-ring-memory-leak.patch?h=coconut

And maybe more.

However, since you and others report that this problem started with (recent?) 4.9 and 4.4 with the same compat-wireless/backports don't experience the issue. I don't think this will help much at all. Again, the issue(s) might be hiding in plain sight, but without hardware I can't not really debug this myself. It could be lurking in the SoC (errarta, bad clocks) or a issue with pcie, ... At most I can tell about my experience based from what I know about IPQ4019 (which is a totally different platform) and the more recent experience from porting it to 4.14.

If you want to take another shot in the dark. Can you test if the iperf3 performance has stayed the same, or did degenerate as well between 4.4 and 4.9 as well, if you run a localhost loopback?

Woah... how could I miss this

It would be good to at least add those patches to LEDE,ideally upstream them

well, it even worse. Please take the time and read: https://www.spinics.net/lists/linux-wireless/msg167883.html
This is/was all well known by the ath10k maintainer for some time now. At this point, I have to ask, what can you even do?

Furthermore, it's not like QCA9984 is the only issue. The customer QCA6174 boards have massive issues with performance too. Just look at the threads over at github:

(Bonus points: some of the post link to Fedora and Ubuntu Bugreports)

1 Like

Thanks for the links.
To me it really does not make sense to not upstream anything.
That is classic corporate style,lets just fork everything and maintain our own fork which gets harder and harder to maintain as everything else moves forward and we have to backport.

So for r7800 what patch do we need to apply to solve problem with 4.9 kernel?

Anyone have this error?

Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.251798] ath10k_pci 0000:01:00.0: failed to flush transmit queue (skip 0 ar-state 1): 0
Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.296253] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.296281] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 1
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.309466] ath10k_pci 0000:01:00.0: firmware crashed! (guid c95b3960-6710-4367-ad39-3706a2029428)
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.309501] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.317563] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.329756] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.4-00082 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast crc32 f301de65
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.336109] ath10k_pci 0000:01:00.0: board_file api 2 bmi_id 0:1 crc32 751efba1
Wed Nov 29 21:42:37 2017 kern.info kernel: [  468.349056] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 512 raw 0 hwcrypto 1
Wed Nov 29 21:42:37 2017 kern.warn kernel: [  468.368374] ath10k_pci 0000:01:00.0: failed to get memcpy hi address for firmware address 4: -16
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.368399] ath10k_pci 0000:01:00.0: failed to read firmware dump area: -16
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.376263] ath10k_pci 0000:01:00.0: Copy Engine register dump:
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.382956] ath10k_pci 0000:01:00.0: [00]: 0x0004a000 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.388808] ath10k_pci 0000:01:00.0: [01]: 0x0004a400 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.397885] ath10k_pci 0000:01:00.0: [02]: 0x0004a800 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.406757] ath10k_pci 0000:01:00.0: [03]: 0x0004ac00 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.415598] ath10k_pci 0000:01:00.0: [04]: 0x0004b000 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.424470] ath10k_pci 0000:01:00.0: [05]: 0x0004b400 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.433311] ath10k_pci 0000:01:00.0: [06]: 0x0004b800 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.442152] ath10k_pci 0000:01:00.0: [07]: 0x0004bc00 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.450960] ath10k_pci 0000:01:00.0: [08]: 0x0004c000 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.459872] ath10k_pci 0000:01:00.0: [09]: 0x0004c400 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.468738] ath10k_pci 0000:01:00.0: [10]: 0x0004c800 3735928559 3735928559 3735928559 3735928559
Wed Nov 29 21:42:37 2017 kern.err kernel: [  468.477587] ath10k_pci 0000:01:00.0: [11]: 0x0004cc00 3735928559 3735928559 3735928559 3735928559

my wifi is very very unstable... someone can help me ?

think the wifi lost packet on the way...

The only people that have knowledge about the internals of the firmware and can fix ath10k-firmware issues are working for QCA in one way or another. From past experience, I can savely say they will not visit this forum... ever... not even if you provide them a direct link...

You should post this to ath10k@lists.infradead.org . However, you'll have to put some effort into the mail. Simply posting a dump will get you nowhere. You have to describe your setup a bit and include information about your own WIFI clients you use (and the ones in your vicinity). You have to make it crystal clear that this is a regression and it was working before in the same setup but with a different fw. Furthermore you'll have hope the issue gains some attention (like serious posts from other affected users and there are always some!). The bigger the commotion, the better! It would be best if you coordinate this issue with others.

"Still not convinced?" :thinking: Look at this "QCA9984 bmi identification failure" case study:
https://www.spinics.net/lists/linux-wireless/msg160469.html :rofl:

Problem is that for now in master r7800 is completely broken...
Just look at bugs lede page... First bug sum up all the problems... I think we need to coordinate and create a big report to make it clear that there are lots of problem with this specific router.

Those can get really fierce and long.

So I've just compiled current trunk with k4.4 and the issue is gone.
I guess the only way to verify is to try to bisect when the issue was 1st introduced and confirm that it is k4.9 related, but not the corresponding lede patches to ipq806x or kernel

anyway for now i'm emailing with one guy and i'm testing some custom build firmware to fix a bug related to power save (sen him the bug report and told me that)

@chunkeey

anyway it is too strange that with my pc i get an entire wifi crash and with my phone i get the rx corrupted ring

edit: actually... with this custom version i have very bad band in 5ghz (unless some random ring corruption sh*t) but it is very stable... no crash... luci still bugged anyway so i think we lost some packet on the way...

edit2: spoke too soon rx ring corrupted and wifi crashed... well one problem solved...

If there's anyone interested and you have wifi issues, you can compile current trunk against k4.4 instead of k4.9. Wifi issues for me are fixed with k4.4. Just apply the following patch to you tree:
https://github.com/dissent1/r7800/commit/bc9d2d9896255c0481e91e336042ce17cc2737e8

Thanks to @nbd for providing mac80211 backports patch!

That makes me to wonder if something has gone slightly wrong in the 4.4-->4.9 kernel bump ipq806x patches.

If I remember right from my own kernel bump experiments of that time, it was rather easy to get the refresh wrong for some of the patches (e.g. the context of some of the kernel hardware flags).

Yes, I will try to bisect the breaking commit when I have time, so we can be sure what it's exactly related to.

But, unfortunately, I fear that I may miss the real commit because the issue tend to be elusive and doesn't trigger on some builds sometimes.

What nic support the smart antenna algo?

@dissent1 i was checking the commit to support 4.9, could be here the problem?
https://git.lede-project.org/?p=source.git;a=blob;f=target/linux/ipq806x/patches-4.9/0071-pcie-qcom-fixes.patch;h=dd403e29574f199f94a237d247251c53ea731ab8;hb=7ffaf71d53f71a901f41cb87dff3dfe0f68e442d


Ben Greear fixed my crash... (only problem now are msdu, poor performance and corruption ring)

here the firmware... it's from the Candela Technologies (ath10k-ct) so not the official one... Can someone test it?
https://drive.google.com/file/d/1ThmKbdJUNEQ9yMXJ248to-PFr5jfhFkm/view?usp=sharing