Netgear R7800 exploration (IPQ8065, QCA9984)

Hi, I'm running the OpenWrt 18.06-SNAPSHOT r7297-13dccfc8e4 build, and if I enable both 2.4GHz and 5GHz wifi simultaneously my R7800 just hard resets once clients connect. It just reboots, and then it works fine, until the clients connect again, which causes it to reset again.

I'm running out of things to try to debug this, do any of you maybe have an idea how to investigate this? I've tried all kinds of different configurations, and individually both 5GHz and 2.4GHz run well.

Since it just resets I'm not getting any handle on what might be wrong, I've tried running while true; do dmesg -c; done over a wired connection to try and get any output before it crashes, but nothing in there.

I've even tried using the ath10k-ct firmware, but with it it behaves exactly the same.

Any help would be greatly appreciated.

Config looks like this:

config wifi-device 'radio0'
	option type 'mac80211'
	option hwmode '11a'
	option path 'soc/1b500000.pci/pci0000:00/0000:00:00.0/0000:01:00.0'
	option htmode 'VHT80'
	option channel '52'
	option country 'US'
	option legacy_rates '1'

config wifi-iface 'default_radio0'
	option device 'radio0'
	option network 'lan'
	option mode 'ap'
	option encryption 'psk-mixed'
	option key 'secret'
	option wps_pushbutton '0'
	option ssid 'OpenWrt5'
	option disabled '1'

config wifi-device 'radio1'
	option type 'mac80211'
	option channel '11'
	option hwmode '11g'
	option path 'soc/1b700000.pci/pci0001:00/0001:00:00.0/0001:01:00.0'
	option htmode 'HT20'
	option country 'US'
	option legacy_rates '1'

config wifi-iface 'default_radio1'
	option device 'radio1'
	option network 'lan'
	option mode 'ap'
	option ssid 'OpenWrt'
	option encryption 'psk-mixed'
	option key 'secret'
	option wps_pushbutton '0'

dmesg | grep ath looks like this, maybe you can spot something in there?

[   18.377486] ath10k_pci 0000:01:00.0: assign IRQ: got 67
[   18.377864] ath10k_pci 0000:01:00.0: enabling device (0140 -> 0142)
[   18.377942] ath10k_pci 0000:01:00.0: enabling bus mastering
[   18.378389] ath10k_pci 0000:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[   18.568383] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/QCA9984/hw1.0/firmware-6.bin failed with error -2
[   18.568430] ath10k_pci 0000:01:00.0: Falling back to user helper
[   18.595136] firmware ath10k!QCA9984!hw1.0!firmware-6.bin: firmware_loading_store: map pages failed
[   18.708116] ath10k_pci 0000:01:00.0: Unknown FW IE: 30
[   18.708143] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[   18.712145] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[   18.723700] ath10k_pci 0000:01:00.0: firmware ver 10.4-ct-9984-xtW-010-868495e api 5 features peer-flow-ctrl crc32 b68bff6b
[   21.042564] ath10k_pci 0000:01:00.0: board_file api 2 bmi_id 0:1 crc32 dd6d039c
[   26.865910] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 32 raw 0 hwcrypto 1
[   26.956529] ath: EEPROM regdomain: 0x0
[   26.956542] ath: EEPROM indicates default country code should be used
[   26.956552] ath: doing EEPROM country->regdmn map search
[   26.956571] ath: country maps to regdmn code: 0x3a
[   26.956582] ath: Country alpha2 being used: US
[   26.956592] ath: Regpair used: 0x3a
[   26.961759] ath10k_pci 0001:01:00.0: assign IRQ: got 100
[   26.962811] ath10k_pci 0001:01:00.0: enabling device (0140 -> 0142)
[   26.962946] ath10k_pci 0001:01:00.0: enabling bus mastering
[   26.963622] ath10k_pci 0001:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[   27.135905] ath10k_pci 0001:01:00.0: Direct firmware load for ath10k/QCA9984/hw1.0/firmware-6.bin failed with error -2
[   27.135936] ath10k_pci 0001:01:00.0: Falling back to user helper
[   27.255579] firmware ath10k!QCA9984!hw1.0!firmware-6.bin: firmware_loading_store: map pages failed
[   27.255729] ath10k_pci 0001:01:00.0: Unknown FW IE: 30
[   27.263544] ath10k_pci 0001:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[   27.268557] ath10k_pci 0001:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 1
[   27.280507] ath10k_pci 0001:01:00.0: firmware ver 10.4-ct-9984-xtW-010-868495e api 5 features peer-flow-ctrl crc32 b68bff6b
[   29.596422] ath10k_pci 0001:01:00.0: board_file api 2 bmi_id 0:2 crc32 dd6d039c
[   35.409898] ath10k_pci 0001:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 32 raw 0 hwcrypto 1
[   35.495752] ath: EEPROM regdomain: 0x0
[   35.495763] ath: EEPROM indicates default country code should be used
[   35.495773] ath: doing EEPROM country->regdmn map search
[   35.495789] ath: country maps to regdmn code: 0x3a
[   35.495801] ath: Country alpha2 being used: US
[   35.495810] ath: Regpair used: 0x3a

Hi, I’m new here, recently moved from stock due to need for multi-wan for my IPTV. Stocks firmware qos doing very great. However, using openwrt 18.06.1 r7258-5eb055306f, I can’t even ping 8.8.8.8 (request time-out) if there’s heavy download traffic eg. Torrent, IDM. Playing normal YouTube doesn’t introduce any packet loss (request time out), only if there’s heavy download traffic, then packet loss happen. My gaming network is fully affected (spikes and freezes). All this happen when sqm is enabled using cake > piece of cake qos script. My advertised bandwidth is 20/5 Mbps. Download/Upload set up in sqm 17/3.7 down/up. Link layer adaptation set to none(default). Advance conf: none/default.

Any reply would be appreciated

Here, I will post my sqm config and my tc -d qdisc

tc -d qdisc
qdisc noqueue 0: dev lo root refcnt 2 
qdisc mq 0: dev eth0 root 
qdisc fq_codel 0: dev eth0 parent :1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
qdisc mq 0: dev eth1 root 
qdisc fq_codel 0: dev eth1 parent :1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
qdisc noqueue 0: dev br-lan root refcnt 2 
qdisc noqueue 0: dev eth1.1 root refcnt 2 
qdisc noqueue 0: dev eth0.3 root refcnt 2 
qdisc noqueue 0: dev wlan1 root refcnt 2 
qdisc noqueue 0: dev wlan0 root refcnt 2 
qdisc noqueue 0: dev eth0.2 root refcnt 2 
qdisc cake 8031: dev pppoe-wan root refcnt 2 bandwidth 3700Kbit besteffort dual-srchost nat split-gso rtt 100.0ms raw overhead 0 
qdisc ingress ffff: dev pppoe-wan parent ffff:fff1 ---------------- 
qdisc fq_codel 0: dev tun0 root refcnt 2 limit 10240p flows 1024 quantum 1500 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
qdisc cake 8032: dev ifb4pppoe-wan root refcnt 2 bandwidth 18100Kbit besteffort dual-dsthost nat wash split-gso rtt 100.0ms raw overhead 0
/etc/config/sqm
config queue 'eth0'
        option linklayer 'none'
        option verbosity '5'
        option debug_logging '0'
        option qdisc 'cake'
        option upload '3700'
        option interface 'pppoe-wan'
        option enabled '1'
        option download '18100'
        option script 'piece_of_cake.qos'
        option qdisc_advanced '1'
        option squash_dscp '1'
        option squash_ingress '1'
        option ingress_ecn 'ECN'
        option egress_ecn 'NOECN'
        option qdisc_really_really_advanced '1'
        option iqdisc_opts 'nat dual-dsthost'
        option eqdisc_opts 'nat dual-srchost'

config queue 'eth0' should be pppoe and it is important to select the link layer correctly as i told you. For fiber should be 38 or 39 of overhead.

I thought pppoe should be put in the “interface” option?

But according to the wiki, overhead for fiber is hard to determined.

fiber and ethernet it is much harder to figure out the exact overhead to specify… (the question is typically how is the ISP's upstream traffic shaper configured

Src: https://openwrt.org/docs/guide-user/network/traffic-shaping/sqm-details

I’m currently out of town, will try to specify the overhead as soon as I get back and update u

I'm running fiber 300/300. When i was tweaking the shaper parameters i tested with many options. I've asked @moeller0 some questions about this and followed his guidance.
Maybe this thread can help you out:

About the queue interface, i've also read (in the openwrt forums) that if you're using pppoe it is better to set the queue in the wan interface itself, but i cannot tell which thread it was

Ok, I tried to set the line with pppoe as u instructed from

config queue 'eth0'
        option linklayer 'none'
        option verbosity '5'
        option debug_logging '0'
        ...

to

config queue 'pppoe'
        option linklayer 'none'
        option verbosity '5'
        option debug_logging '0'
        ...

However, after changing that, the config in the luci gone (as it has been deleted, literally no config at all) and starting the sqm script, doesn't produce any log (maybe it detect config based on luci, so if luci sqm config missing, it treated the config as missing), changing back to eth0, the config showed up as before. trying to delete sqm config and make the config using luci interface doesn't list config queue, it was just like this

config queue
        option linklayer 'none'
        option verbosity '5'
        option debug_logging '0'
        ...

and the sqm produce log saying it's started again.

I took a glimpse about the post, but I don't think it is relevant with my issue. I indeed see improvement in my network, almost no spike in the latency. Regular browser download, streaming 1080p doesn't give any packet loss or spikes, however sometimes torrenting and downloading games data (fortnite iOS [ I suspect it uses multiple stream connection]) give packet loss. It even doesn't give any spikes, just pure packet loss with not a single packet received during download (literally 100% packet loss). However, browsing still continue perform well, streaming too doesn't introduce any buffering (IT'S FABULOUS). BUT, PACKET LOSS happened. It doesn't make sense to me at all.

UPDATE: I noticed just recently something strange happened. Streaming video 1080p during games download that causes pure packet loss in my internet, give packet response (rather good latency about 30 ms, no timed-out request at all). However, once the streaming stop, it continue to give packet loss again.
I will try to record my screen in a next few days. I'm so busy :sleepy:

Sorry! the pppoe-wan queue was in my 17.01 settings. Now in the 18.06 the queue is eth0 and and you specify the interface with the option interface pppoe-wan. You're correct. As for my sqm settings:

config queue 'eth0'
	option debug_logging '0'
	option qdisc 'cake'
	option verbosity '5'
	option qdisc_advanced '1'
	option squash_dscp '1'
	option squash_ingress '1'
	option ingress_ecn 'ECN'
	option egress_ecn 'NOECN'
	option linklayer 'ethernet'
	option qdisc_really_really_advanced '1'
	option iqdisc_opts 'nat dual-dsthost'
	option eqdisc_opts 'nat dual-srchost'
	option script 'piece_of_cake.qos'
	option interface 'pppoe-wan'
	option overhead '38'
	option enabled '1'
	option download '298000'
	option upload '298000'

Do I have to use -Os when compiling the kernel or will -O2 or -O3 work, was wondering because of the size issue.
When using -Ofast the vmlinux.elf size was slightly above 6MB and bzImage was 2.2MB, I guess it's too much? Not sure which of those files it uses.

EDIT:
-Ofast worked but dropbear had issues, not sure if it's because of that flag but i tried -O2 instead and all was good.

I seem to be losing 5GHz regularly. I've tried a few different builds; as of this moment I am running my own build off master from today, and I see this:

[ 6473.681862] ath10k_pci 0000:01:00.0: firmware crashed! (guid n/a)
[ 6473.681907] ath10k_pci 0000:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[ 6473.686951] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0
[ 6473.698720] ath10k_pci 0000:01:00.0: firmware ver 10.4-ct-9984-fW-011-cf79c7f api 5 features peer-flow-ctrl,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,cust-stats-CT crc32 25783e66
[ 6473.705588] ath10k_pci 0000:01:00.0: board_file api 2 bmi_id 0:1 crc32 dd6d039c
[ 6473.727475] ath10k_pci 0000:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 32 raw 0 hwcrypto 1

as well as lots of messages like this:

[ 6055.825542] ath10k_pci 0000:01:00.0: failed to submit frame: -19
[ 6055.825596] ath10k_pci 0000:01:00.0: failed to push frame: -19
[ 6055.837842] ath10k_pci 0000:01:00.0: failed to submit frame: -19
[ 6055.837902] ath10k_pci 0000:01:00.0: failed to push frame: -19
[ 6055.889570] ath10k_pci 0000:01:00.0: failed to submit frame: -19
[ 6055.889649] ath10k_pci 0000:01:00.0: failed to push frame: -19
[ 6055.923108] ath10k_pci 0000:01:00.0: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[ 6474.262001] ieee80211 phy0: Hardware restart was requested

Has anyone else seen this?

you should report this to ath10k-ct

OK, I did: https://github.com/greearb/ath10k-ct/issues/38

Anyone here that can give me a honest comparison between the normal ath10k and the ath10k-ct drivers? i'm currently holding off on the -ct ones and wondering if they are good enough already to be used as a daily driver?

ath10k-ct problem with frequent disconnection with some device (s8 for example )

There's an issue with the default ath10k-ct firmware for qca9984 that leads to miserable speeds when 802.11w is enabled. However, I'm running a beta firmware that fixes this issue.

Default firmware:
http://candelatech.com/downloads/ath10k-9984-10-4/firmware-5-ct-full-htt-mgt-community-11.bin-lede.001

Beta firmware:
http://candelatech.com/downloads/ath10k-9984-10-4b/ath10k-fw-beta/firmware-5-ct-full-htt-mgt-community.bin

EDIT: added link to newer beta firmware.

Also watching this. And seems ath10k-ct has some issues. Anyone know why using ath10k-ct as default?

I believe the reason behind the change to ath10k-ct as default is that the developer/maintainer is far more responsive when it comes bug reports and feedback in general than qca has ever been.

Hi there. I had it running LEDE 17.01.2 and it was rock solid, but after upgrading to 18.06.1 the device is no longer stable. wifi is not stable and I've even had the device itself needing a reboot (and blinking light). For the most part, it is stable at least a few days until it just requires a reboot. I've read through the thread and it seems there are some problems, but just to be clear: Do you all have the same experience with 18.06.1? Is there any nightlies that are (more) stable? I don't want to go back to 17.x since I like the repartioned space.

I am also having stability problems with 18.06.1 compared to v17, although my issues are slightly different to what I've seen described by others. At random intervals (2-7 days), I will lose access to the LAN and WAN connections but wireless access (ssh and luci) is fine. Rebooting corrects the issue. I've configured watchcat to reboot if this happens when I am not at home. Unfortunately nothing's logged to indicate what the problem might be. If there is a way to capture more useful debugging information I'd be happy to do that.