Build for Netgear R7800

wired · May 17, 2019, 7:30pm

Anyone seeing lockups/freezes? I have a very simple setup running in AP mode and I disabled most non-essential services (dns, dhcp, firewall, ...) But I just found it frozen, wifi lights solid, but no wifi connections. The switch was still alive as I could ping between devices attached to it, but could not ssh to the AP via the wired ports. I had to power cycle. Not sure whether there are any crash logs, dmesg is gone after a reboot.

fantom-x · May 17, 2019, 8:37pm

Is it overheating? Could also be a faulty power supply.

wired · May 18, 2019, 12:56am

Temperature looks stable 50-53C and it doesn't feel hot and not getting a lot of traffic either. I hope it's not a hardware issue.

I see this crash in the log but it seems to recover from it, or maybe not always?

Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.497781] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.118 #0
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.519918] Hardware name: Generic DT based system
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.525940] [<c030f3b8>] (unwind_backtrace) from [<c030b5a8>] (show_stack+0x14/0x20)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.530610] [<c030b5a8>] (show_stack) from [<c07a0e78>] (dump_stack+0x88/0x9c)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.538514] [<c07a0e78>] (dump_stack) from [<c0322d08>] (__warn+0xf0/0x11c)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.545537] [<c0322d08>] (__warn) from [<c0322df4>] (warn_slowpath_null+0x20/0x28)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.552441] [<c0322df4>] (warn_slowpath_null) from [<bf803364>] (ath10k_htt_t2h_msg_handler+0x11fc/0x217c [ath10k_core])
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.560207] [<bf803364>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf804108>] (ath10k_htt_t2h_msg_handler+0x1fa0/0x217c [ath10k_core])
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.571148] [<bf804108>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf8049b0>] (ath10k_htt_txrx_compl_task+0x6cc/0xb10 [ath10k_core])
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.583704] [<bf8049b0>] (ath10k_htt_txrx_compl_task [ath10k_core]) from [<bf84ab1c>] (ath10k_pci_napi_poll+0x78/0x118 [ath10k_pci])
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.596461] [<bf84ab1c>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c069ea20>] (net_rx_action+0x144/0x31c)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.608302] [<c069ea20>] (net_rx_action) from [<c03015c8>] (__do_softirq+0xf0/0x264)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.617767] [<c03015c8>] (__do_softirq) from [<c0327148>] (irq_exit+0xdc/0x148)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.625663] [<c0327148>] (irq_exit) from [<c0364190>] (__handle_domain_irq+0xa8/0xc8)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.632688] [<c0364190>] (__handle_domain_irq) from [<c0301488>] (gic_handle_irq+0x6c/0xb8)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.640676] [<c0301488>] (gic_handle_irq) from [<c030c18c>] (__irq_svc+0x6c/0x90)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.648828] Exception stack(0xc0b01f50 to 0xc0b01f98)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.656478] 1f40:                                     00000001 00000000 00000000 60000013
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.661529] 1f60: ffffe000 c0b03cbc c0b03c70 00000000 00000000 c0a2ca28 00000000 00000000
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.669681] 1f80: c0b01f90 c0b01fa0 c0359f8c c03735a0 60000013 ffffffff
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.677839] [<c030c18c>] (__irq_svc) from [<c03735a0>] (rcu_idle_exit+0x0/0x98)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.684257] [<c03735a0>] (rcu_idle_exit) from [<c0a00cd0>] (start_kernel+0x3fc/0x408)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.691616] ---[ end trace ab9d08b17c1e5a5e ]---

fantom-x · May 18, 2019, 1:13am

Are you running a ct variant or ath10k? I had lots of lock ups and crashes with ct.

wired · May 18, 2019, 1:48am

No, CT doesn't work well at all for me, the freeze was on "old".

wired · May 20, 2019, 3:57pm

@hnyman any plans for builds that include the high CPU fix as a 1-off cherrypick before the fix is officially merged?

hnyman · May 20, 2019, 4:10pm

Sorry, I have not had time to look into this.
I might include the patch, but based on his PR 1984 message 3 days ago, @ynezz might be close to merging the MIB counter fix officially. So, hopefully official the fix gets implemented soon.

ynezz · May 20, 2019, 7:38pm

Fixes just landed in the master.

hnyman · May 20, 2019, 8:18pm

master-r10061-0f6b944c92-20190520-ath10k build contains the fixes.
(no -ct build, yet)

wired · May 21, 2019, 3:56pm

Thank you for the build with the CPU fix. Now what will I use all that power for?

Question on irqbalance: do you guys use it, any pros/cons? Where do you start it from? rc.local?

shelterx · May 22, 2019, 8:23pm

Unrelated to hnymans build but I've been running with the MiB fix and counters off for a while and with the ath10k-ct driver and the official QCA 3.9.0.1 firmware.

I must say the router is rock stable and zero load avg.
OpenWrt SNAPSHOT, r9859-66e2acad9c
22:16:45 up 40 days, 2:03, load average: 0.00, 0.00, 0.00

@wired
I run just run irqbalance with no arguments from rc.local. That makes it start as a background service. I'm sure it helps some things, but it's not like I benchmarked it.

perceival · May 22, 2019, 9:44pm

Does ct build from May 22nd contain MIB fix?

slh · May 22, 2019, 9:59pm

diff --git a/files/etc/Compile_info.txt b/files/etc/Compile_info.txt
new file mode 100644
index 0000000000..8d3c47c60b
--- /dev/null
+++ b/files/etc/Compile_info.txt
@@ -0,0 +1,6 @@
+OpenWrt master r10069-33b81b5721 / 2019-05-22 23:31
+---
+main      2019-05-22 33b81b5 Revert "bc: update to 1.07.1"
+luci      2019-05-20 2cdc5f1 Merge pull request #2691 from sumpfralle/
+packages  2019-05-22 7a64d25 Merge pull request #9031 from James-TR/dn
+routing   2019-05-07 040b8e8 nodogsplash: Release v3.3.2-1 (#468)

Which means, yes.

hnyman · May 23, 2019, 5:03am

Sure. The fix is now in the master sources repo and that -ct build has been compiled from a later commit (which can be seen already from the build name string, and details from the build info files that are visible in the included info and patch files (like slh shows above)).

wired · May 25, 2019, 4:16pm

Just curious, anyone else seeing these crashes in dmesg/logread? Known issue? Harmless? Depends on certain wifi clients? Still trying to figure out why my router freezes every few days and not sure whether this is related.

[18348.278269] WARNING: CPU: 0 PID: 0 at backports-4.19.32-1/drivers/net/wireless/ath/ath10k/htt_rx.c:1179 ath10k_htt_t2h_msg_handler+0x11fc/0x217c [ath10k_core]
...
[18348.439243] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.119 #0
[18348.461342] Hardware name: Generic DT based system
[18348.467372] [<c030f3b8>] (unwind_backtrace) from [<c030b5a8>] (show_stack+0x14/0x20)
[18348.472040] [<c030b5a8>] (show_stack) from [<c07a0fb8>] (dump_stack+0x88/0x9c)
[18348.479942] [<c07a0fb8>] (dump_stack) from [<c0322d08>] (__warn+0xf0/0x11c)
[18348.486967] [<c0322d08>] (__warn) from [<c0322df4>] (warn_slowpath_null+0x20/0x28)
[18348.493880] [<c0322df4>] (warn_slowpath_null) from [<bf803364>] (ath10k_htt_t2h_msg_handler+0x11fc/0x217c [ath10k_core])
[18348.501656] [<bf803364>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf804108>] (ath10k_htt_t2h_msg_handler+0x1fa0/0x217c [ath10k_core])
[18348.512592] [<bf804108>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf8049b0>] (ath10k_htt_txrx_compl_task+0x6cc/0xb10 [ath10k_core])
[18348.525142] [<bf8049b0>] (ath10k_htt_txrx_compl_task [ath10k_core]) from [<bf84ab1c>] (ath10k_pci_napi_poll+0x78/0x118 [ath10k_pci])
[18348.537897] [<bf84ab1c>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c069eb68>] (net_rx_action+0x144/0x31c)
[18348.549729] [<c069eb68>] (net_rx_action) from [<c03015c8>] (__do_softirq+0xf0/0x264)
[18348.559196] [<c03015c8>] (__do_softirq) from [<c0327148>] (irq_exit+0xdc/0x148)
[18348.567092] [<c0327148>] (irq_exit) from [<c0364190>] (__handle_domain_irq+0xa8/0xc8)
[18348.574117] [<c0364190>] (__handle_domain_irq) from [<c0301488>] (gic_handle_irq+0x6c/0xb8)
[18348.582102] [<c0301488>] (gic_handle_irq) from [<c030c18c>] (__irq_svc+0x6c/0x90)
[18348.590254] Exception stack(0xc0b01f48 to 0xc0b01f90)
[18348.597913] 1f40:                   00000001 00000000 00000000 c0315160 ffffe000 c0b03cbc
[18348.602959] 1f60: c0b03c70 00000000 00000000 c0a2ca28 00000000 00000000 c0b01f90 c0b01f98
[18348.611102] 1f80: c030878c c0308790 60000013 ffffffff
[18348.619257] [<c030c18c>] (__irq_svc) from [<c0308790>] (arch_cpu_idle+0x38/0x44)
[18348.624298] [<c0308790>] (arch_cpu_idle) from [<c0359efc>] (do_idle+0xe8/0x1bc)
[18348.631756] [<c0359efc>] (do_idle) from [<c035a244>] (cpu_startup_entry+0x1c/0x20)
[18348.638789] [<c035a244>] (cpu_startup_entry) from [<c0a00cd0>] (start_kernel+0x3fc/0x408)

perceival · May 26, 2019, 4:19am

I had it once a few days back but I was juggling different versions, so not really sure which one was that. In general my impression is -ct driver is unstable and crashes frequently.

wired · May 26, 2019, 5:46am

I agree that -ct is not ready for prime time. My problems happen on "old" though.

perceival · May 27, 2019, 10:43am

I am new to openwrt and my experience so far isn't bad. Still from time to time some process (usually nlbwmon) is going wild and causing high CPU utilization. Therefore I decided to put some simple monitoring and alerting in place. I am putting it here since it has been tested with R7800 only but should be pretty universal.

Sendmail first (configured for outlook.com)

opkg install ssmtp

/etc/ssmtp/ssmtp.conf

root=myemail@outlook.com
mailhub=smtp-mail.outlook.com:587
rewriteDomain=outlook.com
UseStartTLS=YES
AuthUser=myemail@outlook.com
AuthPass=myOutlookApplicationPassword

/etc/ssmtp/revaliases

root:myemail@outlook.com:smtp-mail.outlook.com:587

And last but not least following line to be added to /etc/rc.local before exit 0

t=$(while true; do T=$(sleep 15m; if [ `cat /sys/devices/virtual/thermal/thermal_zone9/temp` -ge 60000 ] && [ `sleep 15m; cat /sys/devices/virtual/thermal/thermal_zone9/temp` -ge 60000 ]; then echo -e "Subject: RouterName Thermal/High CPU Alert\n\n"`cat /sys/devices/virtual/thermal/thermal_zone9/temp`"\n"`uptime`"\n"`top -b -n 3`"\n" | sendmail mymail@gmail.com; sleep 3h; fi); done) &

First sleep is to ignore higher CPU load at the boot-up and governs also how often temperature is being checked. Second one to send alerts only if higher temperature is observed within 15 minutes time frame. Third one to send alerts not more often than every 3 hours.
I found it easier to monitor temperature however script can be modified to check actual CPU utilization instead.
Mail being sent contains some diagnostic info (result of uptime overlaps with top but is easier to be read).
Threshold temperature set to 60 Centigrade.

Nague · May 27, 2019, 3:02pm

Same issue here with 18.06.2 and nlbwmon.

perceival · May 27, 2019, 7:30pm

Version with the uptime is simpler. Everything is the same except of entry in the /etc/rc.local:

(while true; do (sleep 15m; ([ `uptime | awk '{print (int($8*100))}'` -ge 300 ] && (echo -e "Subject: RouterName Load Alert\n\n"`cat /sys/devices/virtual/thermal/thermal_zone9/temp`"\n"`uptime`"\n"`top -b -n 3`"\n" | sendmail mymail@gmail.com); sleep 3h)); done) &

Treshold for 15 minutes load average is set to 3.00.