Ipq806x NSS build (Netgear R7800 / TP-Link C2600 / Linksys EA8500)

I've been running a 21.02 build (r0+16646-b7fa062d2f) @vochong thankfully compiled for me two weeks ago. This was running stable for over a week with NSS acceleration working. I left SQM disabled. Yesterday I got a spontaneous reboot. This particular build does not have ramoops enabled, but to my surprise I did see a kernel error in the syslog:

kernel error
Sep  4 15:09:15 OpenWrt kernel: [ 3707.494762] rcu: INFO: rcu_sched self-detected stall on CPU
Sep  4 15:09:15 OpenWrt kernel: [ 3707.494804] rcu: #0110-....: (2100 ticks this GP) idle=37e/0/0x3 softirq=170530/170530 fqs=1048
Sep  4 15:09:15 OpenWrt kernel: [ 3707.499148] #011(t=2101 jiffies g=685793 q=6563)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.507820] NMI backtrace for cpu 0
Sep  4 15:09:15 OpenWrt kernel: [ 3707.512163] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.203 #0
Sep  4 15:09:15 OpenWrt kernel: [ 3707.515460] Hardware name: Generic DT based system
Sep  4 15:09:15 OpenWrt kernel: [ 3707.521733] [<c030fad4>] (unwind_backtrace) from [<c030b9f0>] (show_stack+0x14/0x20)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.526334] [<c030b9f0>] (show_stack) from [<c0904398>] (dump_stack+0x94/0xa8)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.534231] [<c0904398>] (dump_stack) from [<c090b228>] (nmi_cpu_backtrace+0xa4/0xd8)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.541256] [<c090b228>] (nmi_cpu_backtrace) from [<c090b39c>] (nmi_trigger_cpumask_backtrace+0x140/0x154)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.549163] [<c090b39c>] (nmi_trigger_cpumask_backtrace) from [<c037e370>] (rcu_dump_cpu_stacks+0xa4/0xcc)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.558709] [<c037e370>] (rcu_dump_cpu_stacks) from [<c03826b8>] (rcu_sched_clock_irq+0x668/0x858)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.568344] [<c03826b8>] (rcu_sched_clock_irq) from [<c0388948>] (update_process_times+0x38/0x6c)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.577291] [<c0388948>] (update_process_times) from [<c039aeb4>] (tick_sched_timer+0x54/0xb8)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.586226] [<c039aeb4>] (tick_sched_timer) from [<c0389154>] (__hrtimer_run_queues+0x168/0x22c)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.594731] [<c0389154>] (__hrtimer_run_queues) from [<c0389fe4>] (hrtimer_interrupt+0x13c/0x2e0)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.603675] [<c0389fe4>] (hrtimer_interrupt) from [<c075aa9c>] (msm_timer_interrupt+0x38/0x48)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.612445] [<c075aa9c>] (msm_timer_interrupt) from [<c0376140>] (handle_percpu_devid_irq+0x84/0x168)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.620944] [<c0376140>] (handle_percpu_devid_irq) from [<c036fd48>] (generic_handle_irq+0x28/0x40)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.630232] [<c036fd48>] (generic_handle_irq) from [<c0370464>] (__handle_domain_irq+0x68/0xd0)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.639093] [<c0370464>] (__handle_domain_irq) from [<c05e15d8>] (gic_handle_irq+0x5c/0xb8)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.647768] [<c05e15d8>] (gic_handle_irq) from [<c0301b0c>] (__irq_svc+0x6c/0x90)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.656089] Exception stack(0xc0c01c78 to 0xc0c01cc0)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.663731] 1c60:                                                       d9158c40 00000002
Sep  4 15:09:15 OpenWrt kernel: [ 3707.668786] 1c80: 95335a68 d999d46c 00000002 d9158d38 d9158c40 00000000 00000002 00000ab4
Sep  4 15:09:15 OpenWrt kernel: [ 3707.676945] 1ca0: d915e628 00000010 95335a69 c0c01ccc d9158d50 bf9b2c9c 80000113 ffffffff
Sep  4 15:09:15 OpenWrt kernel: [ 3707.685215] [<c0301b0c>] (__irq_svc) from [<bf9b2c9c>] (ieee80211_ctstoself_get+0x218/0x258 [mac80211])
Sep  4 15:09:15 OpenWrt kernel: [ 3707.693489] [<bf9b2c9c>] (ieee80211_ctstoself_get [mac80211]) from [<bf9b2f18>] (ieee80211_txq_schedule_start+0x40/0x90 [mac80211])
Sep  4 15:09:15 OpenWrt kernel: [ 3707.702603] [<bf9b2f18>] (ieee80211_txq_schedule_start [mac80211]) from [<bfa51da0>] (ath10k_mac_tx_push_pending+0x50/0xfc [ath10k_core])
Sep  4 15:09:15 OpenWrt kernel: [ 3707.714389] [<bfa51da0>] (ath10k_mac_tx_push_pending [ath10k_core]) from [<bfa5fd6c>] (ath10k_htt_t2h_msg_handler+0x2e0/0x1eac [ath10k_core])
Sep  4 15:09:15 OpenWrt kernel: [ 3707.726814] [<bfa5fd6c>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bfab04c0>] (ath10k_pci_htt_rx_cb+0x178/0x354 [ath10k_pci])
Sep  4 15:09:15 OpenWrt kernel: [ 3707.739524] [<bfab04c0>] (ath10k_pci_htt_rx_cb [ath10k_pci]) from [<bfa7ee8c>] (ath10k_ce_per_engine_service+0x78/0xc4 [ath10k_core])
Sep  4 15:09:15 OpenWrt kernel: [ 3707.751499] [<bfa7ee8c>] (ath10k_ce_per_engine_service [ath10k_core]) from [<bfa7ef48>] (ath10k_ce_per_engine_service_any+0x70/0x208 [ath10k_core])
Sep  4 15:09:15 OpenWrt kernel: [ 3707.763443] [<bfa7ef48>] (ath10k_ce_per_engine_service_any [ath10k_core]) from [<bfab1b1c>] (ath10k_pci_napi_poll+0x50/0x114 [ath10k_pci])
Sep  4 15:09:15 OpenWrt kernel: [ 3707.776429] [<bfab1b1c>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c07910b0>] (__napi_poll+0x34/0x168)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.788916] [<c07910b0>] (__napi_poll) from [<c0791404>] (net_rx_action+0xd8/0x21c)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.798200] [<c0791404>] (net_rx_action) from [<c0302318>] (__do_softirq+0x130/0x2d4)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.805839] [<c0302318>] (__do_softirq) from [<c0323300>] (irq_exit+0xbc/0xe0)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.813821] [<c0323300>] (irq_exit) from [<c0370468>] (__handle_domain_irq+0x6c/0xd0)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.820945] [<c0370468>] (__handle_domain_irq) from [<c05e15d8>] (gic_handle_irq+0x5c/0xb8)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.828842] [<c05e15d8>] (gic_handle_irq) from [<c0301b0c>] (__irq_svc+0x6c/0x90)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.836991] Exception stack(0xc0c01ee0 to 0xc0c01f28)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.844649] 1ee0: 00000000 0000035a 1ce4e000 dd98fb00 dcc37000 00000000 dd98eeb0 0000035a
Sep  4 15:09:15 OpenWrt kernel: [ 3707.849688] 1f00: 0000035a 00000000 54757fe0 546f7460 00000015 c0c01f30 c0732ac0 c0732ac4
Sep  4 15:09:15 OpenWrt kernel: [ 3707.857833] 1f20: 20000013 ffffffff
Sep  4 15:09:15 OpenWrt kernel: [ 3707.865994] [<c0301b0c>] (__irq_svc) from [<c0732ac4>] (cpuidle_enter_state+0x94/0x498)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.869292] [<c0732ac4>] (cpuidle_enter_state) from [<c0732f0c>] (cpuidle_enter+0x30/0x4c)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.877278] [<c0732f0c>] (cpuidle_enter) from [<c034b5b4>] (do_idle+0x1d8/0x240)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.885609] [<c034b5b4>] (do_idle) from [<c034b8c4>] (cpu_startup_entry+0x1c/0x20)
Sep  4 15:09:15 OpenWrt kernel: [ 3707.893162] [<c034b8c4>] (cpu_startup_entry) from [<c0b00fb4>] (start_kernel+0x4c8/0x4d8)
Sep  4 15:09:20 OpenWrt kernel: [ 3713.114578] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-... } 2104 jiffies s: 845 root: 0x1/.
Sep  4 15:09:20 OpenWrt kernel: [ 3713.114628] rcu: blocking rcu_node structures:
Sep  4 15:09:20 OpenWrt kernel: [ 3713.123829] Task dump for CPU 0:
Sep  4 15:09:20 OpenWrt kernel: [ 3713.128435] swapper/0       R  running task        0     0      0 0x00000002
Sep  4 15:09:20 OpenWrt kernel: [ 3713.131760] [<c091b880>] (__schedule) from [<c0732f0c>] (cpuidle_enter+0x30/0x4c)
Sep  4 15:09:20 OpenWrt kernel: [ 3713.138881] [<c0732f0c>] (cpuidle_enter) from [<c034b5b4>] (do_idle+0x1d8/0x240)
Sep  4 15:09:20 OpenWrt kernel: [ 3713.146241] [<c034b5b4>] (do_idle) from [<c034b8c4>] (cpu_startup_entry+0x1c/0x20)
Sep  4 15:09:20 OpenWrt kernel: [ 3713.153618] [<c034b8c4>] (cpu_startup_entry) from [<c0b00fb4>] (start_kernel+0x4c8/0x4d8)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.522773] rcu: INFO: rcu_sched self-detected stall on CPU
Sep  4 15:10:18 OpenWrt kernel: [ 3770.522810] rcu: #0110-....: (8364 ticks this GP) idle=37e/0/0x3 softirq=170530/170530 fqs=4178
Sep  4 15:10:18 OpenWrt kernel: [ 3770.527156] #011(t=8404 jiffies g=685793 q=7562)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.535828] NMI backtrace for cpu 0
Sep  4 15:10:18 OpenWrt kernel: [ 3770.540171] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.203 #0
Sep  4 15:10:18 OpenWrt kernel: [ 3770.543468] Hardware name: Generic DT based system
Sep  4 15:10:18 OpenWrt kernel: [ 3770.549736] [<c030fad4>] (unwind_backtrace) from [<c030b9f0>] (show_stack+0x14/0x20)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.554342] [<c030b9f0>] (show_stack) from [<c0904398>] (dump_stack+0x94/0xa8)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.562239] [<c0904398>] (dump_stack) from [<c090b228>] (nmi_cpu_backtrace+0xa4/0xd8)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.569264] [<c090b228>] (nmi_cpu_backtrace) from [<c090b39c>] (nmi_trigger_cpumask_backtrace+0x140/0x154)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.577172] [<c090b39c>] (nmi_trigger_cpumask_backtrace) from [<c037e370>] (rcu_dump_cpu_stacks+0xa4/0xcc)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.586716] [<c037e370>] (rcu_dump_cpu_stacks) from [<c03826b8>] (rcu_sched_clock_irq+0x668/0x858)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.596352] [<c03826b8>] (rcu_sched_clock_irq) from [<c0388948>] (update_process_times+0x38/0x6c)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.605299] [<c0388948>] (update_process_times) from [<c039aeb4>] (tick_sched_timer+0x54/0xb8)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.614234] [<c039aeb4>] (tick_sched_timer) from [<c0389154>] (__hrtimer_run_queues+0x168/0x22c)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.622738] [<c0389154>] (__hrtimer_run_queues) from [<c0389fe4>] (hrtimer_interrupt+0x13c/0x2e0)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.631683] [<c0389fe4>] (hrtimer_interrupt) from [<c075aa9c>] (msm_timer_interrupt+0x38/0x48)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.640454] [<c075aa9c>] (msm_timer_interrupt) from [<c0376140>] (handle_percpu_devid_irq+0x84/0x168)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.648952] [<c0376140>] (handle_percpu_devid_irq) from [<c036fd48>] (generic_handle_irq+0x28/0x40)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.658239] [<c036fd48>] (generic_handle_irq) from [<c0370464>] (__handle_domain_irq+0x68/0xd0)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.667100] [<c0370464>] (__handle_domain_irq) from [<c05e15d8>] (gic_handle_irq+0x5c/0xb8)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.675775] [<c05e15d8>] (gic_handle_irq) from [<c0301b0c>] (__irq_svc+0x6c/0x90)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.684097] Exception stack(0xc0c01c78 to 0xc0c01cc0)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.691740] 1c60:                                                       d9158c40 00000002
Sep  4 15:10:18 OpenWrt kernel: [ 3770.696793] 1c80: 523f6ca2 d999d46c 00000002 d9158d38 d9158c40 00000000 00000002 00000ab4
Sep  4 15:10:18 OpenWrt kernel: [ 3770.704954] 1ca0: d915e628 00000010 523f6ca3 c0c01ccc d9158d50 bf9b2c9c 80000113 ffffffff
Sep  4 15:10:18 OpenWrt kernel: [ 3770.713220] [<c0301b0c>] (__irq_svc) from [<bf9b2c9c>] (ieee80211_ctstoself_get+0x218/0x258 [mac80211])
Sep  4 15:10:18 OpenWrt kernel: [ 3770.721404] [<bf9b2c9c>] (ieee80211_ctstoself_get [mac80211]) from [<bf9b2f18>] (ieee80211_txq_schedule_start+0x40/0x90 [mac80211])
Sep  4 15:10:18 OpenWrt kernel: [ 3770.730609] [<bf9b2f18>] (ieee80211_txq_schedule_start [mac80211]) from [<bfa51da0>] (ath10k_mac_tx_push_pending+0x50/0xfc [ath10k_core])
Sep  4 15:10:18 OpenWrt kernel: [ 3770.742355] [<bfa51da0>] (ath10k_mac_tx_push_pending [ath10k_core]) from [<bfa5fd6c>] (ath10k_htt_t2h_msg_handler+0x2e0/0x1eac [ath10k_core])
Sep  4 15:10:18 OpenWrt kernel: [ 3770.754822] [<bfa5fd6c>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bfab04c0>] (ath10k_pci_htt_rx_cb+0x178/0x354 [ath10k_pci])
Sep  4 15:10:18 OpenWrt kernel: [ 3770.767490] [<bfab04c0>] (ath10k_pci_htt_rx_cb [ath10k_pci]) from [<bfa7ee8c>] (ath10k_ce_per_engine_service+0x78/0xc4 [ath10k_core])
Sep  4 15:10:18 OpenWrt kernel: [ 3770.779505] [<bfa7ee8c>] (ath10k_ce_per_engine_service [ath10k_core]) from [<bfa7ef48>] (ath10k_ce_per_engine_service_any+0x70/0x208 [ath10k_core])
Sep  4 15:10:18 OpenWrt kernel: [ 3770.791451] [<bfa7ef48>] (ath10k_ce_per_engine_service_any [ath10k_core]) from [<bfab1b1c>] (ath10k_pci_napi_poll+0x50/0x114 [ath10k_pci])
Sep  4 15:10:18 OpenWrt kernel: [ 3770.804437] [<bfab1b1c>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c07910b0>] (__napi_poll+0x34/0x168)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.816924] [<c07910b0>] (__napi_poll) from [<c0791404>] (net_rx_action+0xd8/0x21c)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.826208] [<c0791404>] (net_rx_action) from [<c0302318>] (__do_softirq+0x130/0x2d4)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.833846] [<c0302318>] (__do_softirq) from [<c0323300>] (irq_exit+0xbc/0xe0)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.841831] [<c0323300>] (irq_exit) from [<c0370468>] (__handle_domain_irq+0x6c/0xd0)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.848953] [<c0370468>] (__handle_domain_irq) from [<c05e15d8>] (gic_handle_irq+0x5c/0xb8)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.856851] [<c05e15d8>] (gic_handle_irq) from [<c0301b0c>] (__irq_svc+0x6c/0x90)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.864999] Exception stack(0xc0c01ee0 to 0xc0c01f28)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.872657] 1ee0: 00000000 0000035a 1ce4e000 dd98fb00 dcc37000 00000000 dd98eeb0 0000035a
Sep  4 15:10:18 OpenWrt kernel: [ 3770.877698] 1f00: 0000035a 00000000 54757fe0 546f7460 00000015 c0c01f30 c0732ac0 c0732ac4
Sep  4 15:10:18 OpenWrt kernel: [ 3770.885841] 1f20: 20000013 ffffffff
Sep  4 15:10:18 OpenWrt kernel: [ 3770.894000] [<c0301b0c>] (__irq_svc) from [<c0732ac4>] (cpuidle_enter_state+0x94/0x498)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.897301] [<c0732ac4>] (cpuidle_enter_state) from [<c0732f0c>] (cpuidle_enter+0x30/0x4c)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.905288] [<c0732f0c>] (cpuidle_enter) from [<c034b5b4>] (do_idle+0x1d8/0x240)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.913618] [<c034b5b4>] (do_idle) from [<c034b8c4>] (cpu_startup_entry+0x1c/0x20)
Sep  4 15:10:18 OpenWrt kernel: [ 3770.921170] [<c034b8c4>] (cpu_startup_entry) from [<c0b00fb4>] (start_kernel+0x4c8/0x4d8)
Sep  4 15:10:23 OpenWrt kernel: [ 3776.472733] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-... } 8440 jiffies s: 845 root: 0x1/.
Sep  4 15:10:23 OpenWrt kernel: [ 3776.472782] rcu: blocking rcu_node structures:
Sep  4 15:10:23 OpenWrt kernel: [ 3776.481982] Task dump for CPU 0:
Sep  4 15:10:23 OpenWrt kernel: [ 3776.486580] swapper/0       R  running task        0     0      0 0x00000002
Sep  4 15:10:23 OpenWrt kernel: [ 3776.489915] [<c091b880>] (__schedule) from [<c0732f0c>] (cpuidle_enter+0x30/0x4c)
Sep  4 15:10:23 OpenWrt kernel: [ 3776.497011] [<c0732f0c>] (cpuidle_enter) from [<c034b5b4>] (do_idle+0x1d8/0x240)
Sep  4 15:10:23 OpenWrt kernel: [ 3776.504399] [<c034b5b4>] (do_idle) from [<c034b8c4>] (cpu_startup_entry+0x1c/0x20)
Sep  4 15:10:23 OpenWrt kernel: [ 3776.511771] [<c034b8c4>] (cpu_startup_entry) from [<c0b00fb4>] (start_kernel+0x4c8/0x4d8)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.541976] rcu: INFO: rcu_sched self-detected stall on CPU
Sep  4 15:11:21 OpenWrt kernel: [ 3833.542011] rcu: #0110-....: (14627 ticks this GP) idle=37e/0/0x3 softirq=170530/170530 fqs=7308
Sep  4 15:11:21 OpenWrt kernel: [ 3833.546357] #011(t=14706 jiffies g=685793 q=8026)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.555031] NMI backtrace for cpu 0
Sep  4 15:11:21 OpenWrt kernel: [ 3833.559459] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.203 #0
Sep  4 15:11:21 OpenWrt kernel: [ 3833.562844] Hardware name: Generic DT based system
Sep  4 15:11:21 OpenWrt kernel: [ 3833.569113] [<c030fad4>] (unwind_backtrace) from [<c030b9f0>] (show_stack+0x14/0x20)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.573716] [<c030b9f0>] (show_stack) from [<c0904398>] (dump_stack+0x94/0xa8)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.581612] [<c0904398>] (dump_stack) from [<c090b228>] (nmi_cpu_backtrace+0xa4/0xd8)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.588640] [<c090b228>] (nmi_cpu_backtrace) from [<c090b39c>] (nmi_trigger_cpumask_backtrace+0x140/0x154)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.596547] [<c090b39c>] (nmi_trigger_cpumask_backtrace) from [<c037e370>] (rcu_dump_cpu_stacks+0xa4/0xcc)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.606093] [<c037e370>] (rcu_dump_cpu_stacks) from [<c03826b8>] (rcu_sched_clock_irq+0x668/0x858)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.615727] [<c03826b8>] (rcu_sched_clock_irq) from [<c0388948>] (update_process_times+0x38/0x6c)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.624675] [<c0388948>] (update_process_times) from [<c039aeb4>] (tick_sched_timer+0x54/0xb8)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.633608] [<c039aeb4>] (tick_sched_timer) from [<c0389154>] (__hrtimer_run_queues+0x168/0x22c)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.642114] [<c0389154>] (__hrtimer_run_queues) from [<c0389fe4>] (hrtimer_interrupt+0x13c/0x2e0)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.651058] [<c0389fe4>] (hrtimer_interrupt) from [<c075aa9c>] (msm_timer_interrupt+0x38/0x48)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.659829] [<c075aa9c>] (msm_timer_interrupt) from [<c0376140>] (handle_percpu_devid_irq+0x84/0x168)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.668328] [<c0376140>] (handle_percpu_devid_irq) from [<c036fd48>] (generic_handle_irq+0x28/0x40)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.677616] [<c036fd48>] (generic_handle_irq) from [<c0370464>] (__handle_domain_irq+0x68/0xd0)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.686476] [<c0370464>] (__handle_domain_irq) from [<c05e15d8>] (gic_handle_irq+0x5c/0xb8)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.695151] [<c05e15d8>] (gic_handle_irq) from [<c0301b0c>] (__irq_svc+0x6c/0x90)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.703473] Exception stack(0xc0c01c78 to 0xc0c01cc0)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.711115] 1c60:                                                       d9158c40 00000002
Sep  4 15:11:21 OpenWrt kernel: [ 3833.716170] 1c80: 0f3beab7 d999d46c 00000002 d9158d38 d9158c40 00000000 00000002 00000ab4
Sep  4 15:11:21 OpenWrt kernel: [ 3833.724330] 1ca0: d915e628 00000010 0f3beab8 c0c01ccc d9158d50 bf9b2c9c 80000113 ffffffff
Sep  4 15:11:21 OpenWrt kernel: [ 3833.732598] [<c0301b0c>] (__irq_svc) from [<bf9b2c9c>] (ieee80211_ctstoself_get+0x218/0x258 [mac80211])
Sep  4 15:11:21 OpenWrt kernel: [ 3833.740780] [<bf9b2c9c>] (ieee80211_ctstoself_get [mac80211]) from [<bf9b2f18>] (ieee80211_txq_schedule_start+0x40/0x90 [mac80211])
Sep  4 15:11:21 OpenWrt kernel: [ 3833.749983] [<bf9b2f18>] (ieee80211_txq_schedule_start [mac80211]) from [<bfa51da0>] (ath10k_mac_tx_push_pending+0x50/0xfc [ath10k_core])
Sep  4 15:11:21 OpenWrt kernel: [ 3833.761730] [<bfa51da0>] (ath10k_mac_tx_push_pending [ath10k_core]) from [<bfa5fd6c>] (ath10k_htt_t2h_msg_handler+0x2e0/0x1eac [ath10k_core])
Sep  4 15:11:21 OpenWrt kernel: [ 3833.774198] [<bfa5fd6c>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bfab04c0>] (ath10k_pci_htt_rx_cb+0x178/0x354 [ath10k_pci])
Sep  4 15:11:21 OpenWrt kernel: [ 3833.786866] [<bfab04c0>] (ath10k_pci_htt_rx_cb [ath10k_pci]) from [<bfa7ee8c>] (ath10k_ce_per_engine_service+0x78/0xc4 [ath10k_core])
Sep  4 15:11:21 OpenWrt kernel: [ 3833.798882] [<bfa7ee8c>] (ath10k_ce_per_engine_service [ath10k_core]) from [<bfa7ef48>] (ath10k_ce_per_engine_service_any+0x70/0x208 [ath10k_core])
Sep  4 15:11:21 OpenWrt kernel: [ 3833.810826] [<bfa7ef48>] (ath10k_ce_per_engine_service_any [ath10k_core]) from [<bfab1b1c>] (ath10k_pci_napi_poll+0x50/0x114 [ath10k_pci])
Sep  4 15:11:21 OpenWrt kernel: [ 3833.823815] [<bfab1b1c>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c07910b0>] (__napi_poll+0x34/0x168)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.836299] [<c07910b0>] (__napi_poll) from [<c0791404>] (net_rx_action+0xd8/0x21c)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.845583] [<c0791404>] (net_rx_action) from [<c0302318>] (__do_softirq+0x130/0x2d4)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.853221] [<c0302318>] (__do_softirq) from [<c0323300>] (irq_exit+0xbc/0xe0)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.861205] [<c0323300>] (irq_exit) from [<c0370468>] (__handle_domain_irq+0x6c/0xd0)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.868328] [<c0370468>] (__handle_domain_irq) from [<c05e15d8>] (gic_handle_irq+0x5c/0xb8)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.876225] [<c05e15d8>] (gic_handle_irq) from [<c0301b0c>] (__irq_svc+0x6c/0x90)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.884375] Exception stack(0xc0c01ee0 to 0xc0c01f28)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.892033] 1ee0: 00000000 0000035a 1ce4e000 dd98fb00 dcc37000 00000000 dd98eeb0 0000035a
Sep  4 15:11:21 OpenWrt kernel: [ 3833.897072] 1f00: 0000035a 00000000 54757fe0 546f7460 00000015 c0c01f30 c0732ac0 c0732ac4
Sep  4 15:11:21 OpenWrt kernel: [ 3833.905216] 1f20: 20000013 ffffffff
Sep  4 15:11:21 OpenWrt kernel: [ 3833.913376] [<c0301b0c>] (__irq_svc) from [<c0732ac4>] (cpuidle_enter_state+0x94/0x498)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.916676] [<c0732ac4>] (cpuidle_enter_state) from [<c0732f0c>] (cpuidle_enter+0x30/0x4c)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.924664] [<c0732f0c>] (cpuidle_enter) from [<c034b5b4>] (do_idle+0x1d8/0x240)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.932994] [<c034b5b4>] (do_idle) from [<c034b8c4>] (cpu_startup_entry+0x1c/0x20)
Sep  4 15:11:21 OpenWrt kernel: [ 3833.940546] [<c034b8c4>] (cpu_startup_entry) from [<c0b00fb4>] (start_kernel+0x4c8/0x4d8)
Sep  4 15:11:30 OpenWrt kernel: [ 3843.032007] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-... } 15096 jiffies s: 845 root: 0x1/.
Sep  4 15:11:30 OpenWrt kernel: [ 3843.032056] rcu: blocking rcu_node structures:
Sep  4 15:11:30 OpenWrt kernel: [ 3843.041603] Task dump for CPU 0:
Sep  4 15:11:30 OpenWrt kernel: [ 3843.045973] swapper/0       R  running task        0     0      0 0x00000002
Sep  4 15:11:30 OpenWrt kernel: [ 3843.049272] [<c091b880>] (__schedule) from [<c0732f0c>] (cpuidle_enter+0x30/0x4c)
Sep  4 15:11:30 OpenWrt kernel: [ 3843.056380] [<c0732f0c>] (cpuidle_enter) from [<c034b5b4>] (do_idle+0x1d8/0x240)
Sep  4 15:11:30 OpenWrt kernel: [ 3843.063749] [<c034b5b4>] (do_idle) from [<c034b8c4>] (cpu_startup_entry+0x1c/0x20)
Sep  4 15:11:30 OpenWrt kernel: [ 3843.071131] [<c034b8c4>] (cpu_startup_entry) from [<c0b00fb4>] (start_kernel+0x4c8/0x4d8)

I did not expected to see this error on a 21.02 build which looks like the errors @Mpilon and @vochong recently posted. So whatever is crashing my R7800, it's heavy enough to get a "previously known stable" 21.02 to randomly reboot. Who of us is running the mainline ath10k driver with the very latest firmware version QCA-ATH10Kw10.4-3.9.0.2-00157?

Are there ath10k-ct builds out there that experience random reboots when NSS is confirmed working?

please explain where to get this, how to use this and also how to embed this in an image :slight_smile:
(and also latest drivers, maybe..)
@vochong for sure is using it..

Ah, I assumed most OpenWRT savvy users would know from the top of their heads, my mistake :sweat_smile:.

The easiest way to share is via command line (SSH). Log in to your router and do:

opkg list-installed | grep ath10k

You get output like:

ath10k-board-qca9984 - 20220708-1
ath10k-firmware-qca9984 - 20220708-1
kmod-ath10k - 5.10.138+5.15.58-1-2

The example for above :point_up_2: shows the mainline ath10k driver. If you have the -ct driver the output will look like this :point_down::

ath10k-board-qca99x0 - 20220708-1
ath10k-firmware-qca99x0-ct - 2020-11-08-1
kmod-ath10k-ct - 5.10.136+2022-05-13-f808496f-1

In the -ct example above :point_up_2: you see the -ct that is appended in the name of the ath10k package.

Now if you have the mainline ath10k driver and you want to check which firmware version is active there are several ways to check that:

dmesg | grep firmware

You get output like:

[   24.277724] ath10k_pci 0000:01:00.0: firmware ver 10.4-3.9.0.2-00157 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast,no-ps,peer-fixed-rate,iram-recovery crc32 6cdc6ff9
[   31.443419] ath10k_pci 0001:01:00.0: firmware ver 10.4-3.9.0.2-00157 api 5 features no-p2p,mfp,peer-flow-ctrl,btcoex-param,allows-mesh-bcast,no-ps,peer-fixed-rate,iram-recovery crc32 6cdc6ff9

If you have the mainline ath10k driver, there is another way to check the firmware version:

head -1 /lib/firmware/ath10k/QCA9984/hw1.0/firmware-5.bin

You then get output like:

QCA-ATH10Kw10.4-3.9.0.2-00157ww4"b$SGMT@D@IAAP@$wAp@IA&@-IAxyADu4yAIA:|8
                                                                        JAJAJAJAJAIALALAPA.YABnYAaApA
                                                                                                     ',JA-JAJAJAgAyA /IA
root@ap-R7800:~# 

In the example above :point_up_2: you see the full version string: QCA-ATH10Kw10.4-3.9.0.2-00157.

How to use the mainline ath10k driver or the -ct driver is explained in the second post of this topic when building your own version. When you pick a master image it will contain the latest driver and firmware for both mainline ath10k and the -ct variant. If you pick a 22.03 image it depends on who built it... The commands I showed above will allow you to verify what you're actually running.

I know there are pros and cons regarding mainline ath10k or the -ct driver. I'm not sure if there's a problem with stability on PPPoE and the ath10k-ct driver or not. Recently stability issues have risen also in the timeframe where we started using the latest ath10k firmware. Now if I'm correct, this NSS build also offloads WiFi traffic to be accelerated, am I right @ACwifidude ? What if we're chasing NSS issues together with PPPoE offloading, while it might also be NSS with ath10k firmware (the most recent version) that has issues? So I thought; what if we do a "quick check" on who experiences issues and what driver/firmware they're using?

this was the part i missed. So ok, i should be using latest drivers and firmware of mainline ath10k (i'lll confirm as soon as i get home from office). So still i can't get 160MHz to work, it is still a mess..

Tishipp uses ath10k-ct and his router also had a random reboot with SWBA overruns followed by "rcu_sched self-detected stall on CPU". Did you see any SWBA overrun logs in your syslog?

Ath10k addicts :frowning: (who tend to be more talkative in their drunk state. Lol)

Mpilon
sppmaster
vochong
D43m0n
pattagghiu

Ath10k-rehabilitated :slight_smile: (who tend to be more taciturn)

ACwifidude
Tishipp
And most other people
1 Like

I'm not seeing hardware - CPU - cache issues as a source of this latest issue. It's too reproducible, too nicely behaved, the same thing is happening on multiple systems.

Hardware issues tend to be more spectacular.

And I've been running the performance governor in all of my tests.

I very gratefully received a reply from @hansd for HowTo enable coredumps in procd:

AND I take it to be an example of howto configure a new service ... or enable coredumps for one of the existing init.d services ...

I'm still unable to get unlimited core files for any process ... I added the relevant line to /etc/rc.common to no good effect - unless procd isn't involved in setting limits for the ssh client login shell.

I'm still wrestling with getting objdump to correctly show C source.

1 Like

The Master branch was updated to the latest linux-firmware-2022-0708 (which has the mainline ath10k firmware v157 for R7800) since about 2-3 months ago. Only the 22.03 branch (including the official) still gets stuck with the old firmware (with the horrible mainline ath10k firmware v131 from 4 years ago for R7800).

To get the latest ath10k firmware v157 for R7800 in 22.03 and earlier, you need to manually install it or include the change when building the private image.

@ACwifidude @Mpilon

If I remember correctly, the CPU/cache frequency scaling-triggered reboots always tend to produce some core dump with the error "Internal error: Oops" with "clk_change_rate" in the dump, as shown below. For our recent random reboots, no core dump was generated at all. It looks exactly like the CPU gets in a hung/stalled state, preventing "procd" from updating the SoC watchdog timer. When the SoC watchdog timer is not reset and gets expired, it automatically resets the CPU, giving it no chance to generate any core dump.

<1>[227356.676523] pgd = b8925557
<1>[227356.683804] [e1a07000] *pgd=00000000
<0>[227356.686595] Internal error: Oops: 80000005 [#1] SMP ARM

<0>[227357.135403] [<c062c778>] (__timer_delay) from [<c069478c>] (__krait_mux_set_sel+0x7c/0x9c)
<0>[227357.143635] [<c069478c>] (__krait_mux_set_sel) from [<c069480c>] (krait_mux_set_parent+0x60/0x64)
<0>[227357.151882] [<c069480c>] (krait_mux_set_parent) from [<c06961cc>] (krait_notifier_cb+0x58/0xb8)
<0>[227357.160919] [<c06961cc>] (krait_notifier_cb) from [<c0341b00>] (srcu_notifier_call_chain+0x7c/0xf4)
<0>[227357.169851] [<c0341b00>] (srcu_notifier_call_chain) from [<c0684bc0>] (__clk_notify+0x70/0x94)
<0>[227357.178965] [<c0684bc0>] (__clk_notify) from [<c068902c>] (clk_change_rate+0xfc/0x2b8)
<0>[227357.187381] [<c068902c>] (clk_change_rate) from [<c0689070>] (clk_change_rate+0x140/0x2b8)
<0>[227357.195368] [<c0689070>] (clk_change_rate) from [<c068929c>] (clk_core_set_rate_nolock+0xb4/0x1f8)
<0>[227357.203701] [<c068929c>] (clk_core_set_rate_nolock) from [<c068941c>] (clk_set_rate+0x3c/0x170)
<0>[227357.212737] [<c068941c>] (clk_set_rate) from [<c07a9dbc>] (dev_pm_opp_set_rate+0x348/0x674)
<0>[227357.221757] [<c07a9dbc>] (dev_pm_opp_set_rate) from [<c07af00c>] (__cpufreq_driver_target+0x1a0/0x5b4)
<0>[227357.230178] [<c07af00c>] (__cpufreq_driver_target) from [<c07b25a4>] (od_dbs_update+0xcc/0x1a0)
<0>[227357.239291] [<c07b25a4>] (od_dbs_update) from [<c07b328c>] (dbs_work_handler+0x38/0x74)
<0>[227357.248325] [<c07b328c>] (dbs_work_handler) from [<c0338940>] (process_one_work+0x1fc/0x470)
<0>[227357.256395] [<c0338940>] (process_one_work) from [<c0338c28>] (worker_thread+0x74/0x5d4)
<0>[227357.264812] [<c0338c28>] (worker_thread) from [<c033eb0c>] (kthread+0x15c/0x160)
<0>[227357.272970] [<c033eb0c>] (kthread) from [<c0300148>] (ret_from_fork+0x14/0x2c)
<0>[227357.280426] Exception stack(0xc6dcbfb0 to 0xc6dcbff8)

Yes I do. On all the recent 22.03-rc6, master and 21.02 builds. They all have the most recent ath10k firmware. I don’t know how the kernel interacts with these closed source firmwares (NSS and ath10k) and if some buffer gets overflowed for instance.

one more attempt to make nlbwmon happy:

nlbwmon's buffer size is set to:
netlink_buffer_size='1048576'

  • match whatever is set in @ACwifidude 's overall configuration.

at bootup I'm seeing another buffer alloc issue:

Mon Sep  5 12:20:57 2022 daemon.err nlbwmon[1870]: The netlink receive buffer size of 1048576 bytes will be capped to 180224 bytes
Mon Sep  5 12:20:57 2022 daemon.err nlbwmon[1870]: by the kernel. The net.core.rmem_max sysctl limit needs to be raised to
Mon Sep  5 12:20:57 2022 daemon.err nlbwmon[1870]: at least 1048576 in order to sucessfully set the desired receive buffer size!
Mon Sep  5 12:20:57 2022 user.notice firewall: Reloading firewall due to ifup of lan (br-lan)

I looked at what's set in uci, no section 'net', 'network' has no 'core' section and I can't see where in the kernel it may be set, if at all.

what does nlbwmon support (what's it used for?) ?

I cannot find any SWBA entries in the syslog of five R7800s or at least I haven't seen them yet.

You don't need nlbwmon. Just stop and disable it with /etc/init.d/nlbwmon stop and /etc/init.d/nlbwmon disable.

It's just a tool to give you a general idea of the amounts of download and upload traffic for each connected client.

IMHO, nlbwmon is a very buggy tool. No matter of how much recv buffer you give it, it will still eventually complain of being out of memory. The more clients and the more traffic going on in your network, the sooner it will run out of memory. Any properly designed piece of software needs to allocate and release memory accordingly so it does not run out of memory. Nblwmon is like a goldfish without stomach. You can feed it Tetra flakes (aka memory) as much as you want and it will devour them all while polluting the water through its other end :-).

If you search Google, this SWBA overrun problem has been seen by lots of people, and it's seen with both ath10k and ath10k-ct firmware/drivers. I believe the occurrence frequency of this problem depends on the numbers and types of WIFI clients in your WIFI network, so some people may encounter it frequently, while it's barely seen for others.

According to this link, the root cause for "SWBA overrun on vdev 0, skipped old beacon" is as follows:
https://lore.kernel.org/all/CA+BoTQm735L-TQ0ZjSZ2gHud_AwvP+kZAh+Wc4iQtaGSiC2GLw@mail.gmail.com/

This is the dreaded tx credit starvation.

In some cases if disassoc+deauth is sent and target station is asleep
and unresponsive it'll cause firmware to stall causing ath10k timeouts
during sta_state station removal. Due to insufficient credits beacons
can't be sent for ~10 seconds, sta_state station removal fails causing
mac80211 call trace splat and later spurious kickout events because
peer was never removed from firmware.
1 Like

As I'm reading the thread in that conversation, the -ct variant should have fixes for that according to its author. In 2015 already. But others also mention that it should also be fixed in mainline ath10k around the same time.

And then I also see a reply from kvalo (who's Github repo we use to find the latest mainline ath10k firmware):

Unfortunately the firmware team does not provide me release notes for
new firmware releases.

I hope this has changed for the better... :face_with_spiral_eyes:

Some report that using the -ct-htt or the -ct-full-htt firmware this goes away.
From searching most results I saw seem to hint towards the firmware being used as the culprit. Both on mainline ath10k and the -ct variant. Throughout the years this "SWBA overrun" problem comes and goes apparently, I haven't found fix in the results. Some issues on this are still open, some issues are closed but sometimes also because the stalebot closed it. I can't reboot now, but I've got the previous ath10k firmware lined up for testing. I don't see the "SWBA overrun" every day so I need a few days to see if it comes back.

I'm going to try 22.03 finally, latest version of acwifidude, thanks! pppoe works correctly at last!! I will wait for the days to pass. Regarding my previous test on 22.03, I had 37 days of activity, and only in restart hours when activating promiscuous in brlan, which is certified to block the router in NSS images. on the other hand adjusting the time in wifi 2.4ghz added a lot of stability in my equipment, so I understand that the wifi driver is very important in the stability that we are suffering.

1 Like

I’ve stopped using nlbwmon. It spams the log and doesn’t seem to be keeping accurate numbers. I have it turned off in startup on my three r7800s. Hope they fix it. Might consider removing it next build.

I’m using ath10k-ct, 2 SSIDs total, cpu default settings for the build, and the stock VLAN setup - getting no crashes. Elimination of extras should help to find what is the culprit for the crashes.

2 Likes

the only 'extra' I had running was nlbwmon ... I removed that and added a patch to a

while( something )
;

-- to cond_resched(); there instead - deep inside some kernel filesystem code.

been up for 5 hours now. if it stays up a while I'll either restart nlbwmon or remove the patch ...

1 Like

for that matter, maybe you could change your whole build to exclude nlbwmon ...

2 Likes

When you search Google for issues relating to Wifi, please keep in mind that:

  • There are many WIFI chipsets, each with its own firmware and driver (some may use the same driver).
  • A bug / resolution for a certain WIFI chipset does not mean it is applicable for others.
  • Depending on the number of WIFI clients and/or their types in a network, some may observe an issue while others may never do.

@Mpilon @D43m0n

I got two more of the random reboots (very hot day 110 F, no darn AC) with these clean logs in console-ramoops (no SWBA errors or anything before and after)

[26206.532451] rcu: INFO: rcu_sched self-detected stall on CPU
[26206.532504] rcu: 	1-...!: (1 GPs behind) idle=436/1/0x40000002 softirq=279305/279306 fqs=93 
[26206.536863] rcu: rcu_sched kthread starved for 1914 jiffies! g761821 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[26206.545525] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[26206.555677] rcu: RCU grace-period kthread stack dump:

In light of this post: https://lore.kernel.org/lkml/20150114221228.GV9719@linux.vnet.ibm.com/T/

I think irqbalance with only 2 CPU cores (R7800's 2 Krait cores) would cause lots of memory context switching between them when you run multiple other processes (hostapd, nlbwmon, collected, VPN, SMB file sharing etc.) in addition to processing WAN and WIFI traffic. That will adversely affect the overall operations taken as a whole, rather than benefiting from it. CPU context-switching can only be executed once RCU read operations are completed.

Let's disable irqbalance (make sure you stop/start the irqbalance service after disabling it, or just reboot the router) to see if this RCU stalling issue still happen with relatively high frequency (every few hours or less than a day)

2 Likes