Build for Netgear R7800

Anyone seeing lockups/freezes? I have a very simple setup running in AP mode and I disabled most non-essential services (dns, dhcp, firewall, ...) But I just found it frozen, wifi lights solid, but no wifi connections. The switch was still alive as I could ping between devices attached to it, but could not ssh to the AP via the wired ports. I had to power cycle. Not sure whether there are any crash logs, dmesg is gone after a reboot.

Is it overheating? Could also be a faulty power supply.

Temperature looks stable 50-53C and it doesn't feel hot and not getting a lot of traffic either. I hope it's not a hardware issue.

I see this crash in the log but it seems to recover from it, or maybe not always?

Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.497781] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.118 #0
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.519918] Hardware name: Generic DT based system
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.525940] [<c030f3b8>] (unwind_backtrace) from [<c030b5a8>] (show_stack+0x14/0x20)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.530610] [<c030b5a8>] (show_stack) from [<c07a0e78>] (dump_stack+0x88/0x9c)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.538514] [<c07a0e78>] (dump_stack) from [<c0322d08>] (__warn+0xf0/0x11c)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.545537] [<c0322d08>] (__warn) from [<c0322df4>] (warn_slowpath_null+0x20/0x28)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.552441] [<c0322df4>] (warn_slowpath_null) from [<bf803364>] (ath10k_htt_t2h_msg_handler+0x11fc/0x217c [ath10k_core])
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.560207] [<bf803364>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf804108>] (ath10k_htt_t2h_msg_handler+0x1fa0/0x217c [ath10k_core])
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.571148] [<bf804108>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf8049b0>] (ath10k_htt_txrx_compl_task+0x6cc/0xb10 [ath10k_core])
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.583704] [<bf8049b0>] (ath10k_htt_txrx_compl_task [ath10k_core]) from [<bf84ab1c>] (ath10k_pci_napi_poll+0x78/0x118 [ath10k_pci])
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.596461] [<bf84ab1c>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c069ea20>] (net_rx_action+0x144/0x31c)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.608302] [<c069ea20>] (net_rx_action) from [<c03015c8>] (__do_softirq+0xf0/0x264)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.617767] [<c03015c8>] (__do_softirq) from [<c0327148>] (irq_exit+0xdc/0x148)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.625663] [<c0327148>] (irq_exit) from [<c0364190>] (__handle_domain_irq+0xa8/0xc8)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.632688] [<c0364190>] (__handle_domain_irq) from [<c0301488>] (gic_handle_irq+0x6c/0xb8)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.640676] [<c0301488>] (gic_handle_irq) from [<c030c18c>] (__irq_svc+0x6c/0x90)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.648828] Exception stack(0xc0b01f50 to 0xc0b01f98)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.656478] 1f40:                                     00000001 00000000 00000000 60000013
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.661529] 1f60: ffffe000 c0b03cbc c0b03c70 00000000 00000000 c0a2ca28 00000000 00000000
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.669681] 1f80: c0b01f90 c0b01fa0 c0359f8c c03735a0 60000013 ffffffff
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.677839] [<c030c18c>] (__irq_svc) from [<c03735a0>] (rcu_idle_exit+0x0/0x98)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.684257] [<c03735a0>] (rcu_idle_exit) from [<c0a00cd0>] (start_kernel+0x3fc/0x408)
Thu May 16 08:54:47 2019 kern.warn kernel: [ 7429.691616] ---[ end trace ab9d08b17c1e5a5e ]---

Are you running a ct variant or ath10k? I had lots of lock ups and crashes with ct.

No, CT doesn't work well at all for me, the freeze was on "old".

@hnyman any plans for builds that include the high CPU fix as a 1-off cherrypick before the fix is officially merged?

Sorry, I have not had time to look into this.
I might include the patch, but based on his PR 1984 message 3 days ago, @ynezz might be close to merging the MIB counter fix officially. So, hopefully official the fix gets implemented soon.

1 Like

Fixes just landed in the master.

4 Likes

master-r10061-0f6b944c92-20190520-ath10k build contains the fixes.
(no -ct build, yet)

1 Like

Thank you for the build with the CPU fix. Now what will I use all that power for? :slight_smile:

Question on irqbalance: do you guys use it, any pros/cons? Where do you start it from? rc.local?

Unrelated to hnymans build but I've been running with the MiB fix and counters off for a while and with the ath10k-ct driver and the official QCA 3.9.0.1 firmware.

I must say the router is rock stable and zero load avg.
OpenWrt SNAPSHOT, r9859-66e2acad9c
22:16:45 up 40 days, 2:03, load average: 0.00, 0.00, 0.00

@wired
I run just run irqbalance with no arguments from rc.local. That makes it start as a background service. I'm sure it helps some things, but it's not like I benchmarked it. :slight_smile:

1 Like

Does ct build from May 22nd contain MIB fix?

diff --git a/files/etc/Compile_info.txt b/files/etc/Compile_info.txt
new file mode 100644
index 0000000000..8d3c47c60b
--- /dev/null
+++ b/files/etc/Compile_info.txt
@@ -0,0 +1,6 @@
+OpenWrt master r10069-33b81b5721 / 2019-05-22 23:31
+---
+main      2019-05-22 33b81b5 Revert "bc: update to 1.07.1"
+luci      2019-05-20 2cdc5f1 Merge pull request #2691 from sumpfralle/
+packages  2019-05-22 7a64d25 Merge pull request #9031 from James-TR/dn
+routing   2019-05-07 040b8e8 nodogsplash: Release v3.3.2-1 (#468)

Which means, yes.

Sure. The fix is now in the master sources repo and that -ct build has been compiled from a later commit (which can be seen already from the build name string, and details from the build info files that are visible in the included info and patch files (like slh shows above)).

Just curious, anyone else seeing these crashes in dmesg/logread? Known issue? Harmless? Depends on certain wifi clients? Still trying to figure out why my router freezes every few days and not sure whether this is related.

[18348.278269] WARNING: CPU: 0 PID: 0 at backports-4.19.32-1/drivers/net/wireless/ath/ath10k/htt_rx.c:1179 ath10k_htt_t2h_msg_handler+0x11fc/0x217c [ath10k_core]
...
[18348.439243] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.119 #0
[18348.461342] Hardware name: Generic DT based system
[18348.467372] [<c030f3b8>] (unwind_backtrace) from [<c030b5a8>] (show_stack+0x14/0x20)
[18348.472040] [<c030b5a8>] (show_stack) from [<c07a0fb8>] (dump_stack+0x88/0x9c)
[18348.479942] [<c07a0fb8>] (dump_stack) from [<c0322d08>] (__warn+0xf0/0x11c)
[18348.486967] [<c0322d08>] (__warn) from [<c0322df4>] (warn_slowpath_null+0x20/0x28)
[18348.493880] [<c0322df4>] (warn_slowpath_null) from [<bf803364>] (ath10k_htt_t2h_msg_handler+0x11fc/0x217c [ath10k_core])
[18348.501656] [<bf803364>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf804108>] (ath10k_htt_t2h_msg_handler+0x1fa0/0x217c [ath10k_core])
[18348.512592] [<bf804108>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf8049b0>] (ath10k_htt_txrx_compl_task+0x6cc/0xb10 [ath10k_core])
[18348.525142] [<bf8049b0>] (ath10k_htt_txrx_compl_task [ath10k_core]) from [<bf84ab1c>] (ath10k_pci_napi_poll+0x78/0x118 [ath10k_pci])
[18348.537897] [<bf84ab1c>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c069eb68>] (net_rx_action+0x144/0x31c)
[18348.549729] [<c069eb68>] (net_rx_action) from [<c03015c8>] (__do_softirq+0xf0/0x264)
[18348.559196] [<c03015c8>] (__do_softirq) from [<c0327148>] (irq_exit+0xdc/0x148)
[18348.567092] [<c0327148>] (irq_exit) from [<c0364190>] (__handle_domain_irq+0xa8/0xc8)
[18348.574117] [<c0364190>] (__handle_domain_irq) from [<c0301488>] (gic_handle_irq+0x6c/0xb8)
[18348.582102] [<c0301488>] (gic_handle_irq) from [<c030c18c>] (__irq_svc+0x6c/0x90)
[18348.590254] Exception stack(0xc0b01f48 to 0xc0b01f90)
[18348.597913] 1f40:                   00000001 00000000 00000000 c0315160 ffffe000 c0b03cbc
[18348.602959] 1f60: c0b03c70 00000000 00000000 c0a2ca28 00000000 00000000 c0b01f90 c0b01f98
[18348.611102] 1f80: c030878c c0308790 60000013 ffffffff
[18348.619257] [<c030c18c>] (__irq_svc) from [<c0308790>] (arch_cpu_idle+0x38/0x44)
[18348.624298] [<c0308790>] (arch_cpu_idle) from [<c0359efc>] (do_idle+0xe8/0x1bc)
[18348.631756] [<c0359efc>] (do_idle) from [<c035a244>] (cpu_startup_entry+0x1c/0x20)
[18348.638789] [<c035a244>] (cpu_startup_entry) from [<c0a00cd0>] (start_kernel+0x3fc/0x408)

I had it once a few days back but I was juggling different versions, so not really sure which one was that. In general my impression is -ct driver is unstable and crashes frequently.

I agree that -ct is not ready for prime time. My problems happen on "old" though.

I am new to openwrt and my experience so far isn't bad. Still from time to time some process (usually nlbwmon) is going wild and causing high CPU utilization. Therefore I decided to put some simple monitoring and alerting in place. I am putting it here since it has been tested with R7800 only but should be pretty universal.

  1. Sendmail first (configured for outlook.com)
opkg install ssmtp

/etc/ssmtp/ssmtp.conf

root=myemail@outlook.com
mailhub=smtp-mail.outlook.com:587
rewriteDomain=outlook.com
UseStartTLS=YES
AuthUser=myemail@outlook.com
AuthPass=myOutlookApplicationPassword

/etc/ssmtp/revaliases

root:myemail@outlook.com:smtp-mail.outlook.com:587
  1. And last but not least following line to be added to /etc/rc.local before exit 0
t=$(while true; do T=$(sleep 15m; if [ `cat /sys/devices/virtual/thermal/thermal_zone9/temp` -ge 60000 ] && [ `sleep 15m; cat /sys/devices/virtual/thermal/thermal_zone9/temp` -ge 60000 ]; then echo -e "Subject: RouterName Thermal/High CPU Alert\n\n"`cat /sys/devices/virtual/thermal/thermal_zone9/temp`"\n"`uptime`"\n"`top -b -n 3`"\n" | sendmail mymail@gmail.com; sleep 3h; fi); done) &

First sleep is to ignore higher CPU load at the boot-up and governs also how often temperature is being checked. Second one to send alerts only if higher temperature is observed within 15 minutes time frame. Third one to send alerts not more often than every 3 hours.
I found it easier to monitor temperature however script can be modified to check actual CPU utilization instead.
Mail being sent contains some diagnostic info (result of uptime overlaps with top but is easier to be read).
Threshold temperature set to 60 Centigrade.

1 Like

Same issue here with 18.06.2 and nlbwmon.

Version with the uptime is simpler. Everything is the same except of entry in the /etc/rc.local:

(while true; do (sleep 15m; ([ `uptime | awk '{print (int($8*100))}'` -ge 300 ] && (echo -e "Subject: RouterName Load Alert\n\n"`cat /sys/devices/virtual/thermal/thermal_zone9/temp`"\n"`uptime`"\n"`top -b -n 3`"\n" | sendmail mymail@gmail.com); sleep 3h)); done) &

Treshold for 15 minutes load average is set to 3.00.