Ipq806x NSS build (Netgear R7800 / TP-Link C2600 / Linksys EA8500)

ACwifidude · September 6, 2022, 9:52pm

I’ve stopped using nlbwmon. It spams the log and doesn’t seem to be keeping accurate numbers. I have it turned off in startup on my three r7800s. Hope they fix it. Might consider removing it next build.

I’m using ath10k-ct, 2 SSIDs total, cpu default settings for the build, and the stock VLAN setup - getting no crashes. Elimination of extras should help to find what is the culprit for the crashes.

Mpilon · September 6, 2022, 11:22pm

the only 'extra' I had running was nlbwmon ... I removed that and added a patch to a

while( something )
;

-- to cond_resched(); there instead - deep inside some kernel filesystem code.

been up for 5 hours now. if it stays up a while I'll either restart nlbwmon or remove the patch ...

Mpilon · September 6, 2022, 11:24pm

for that matter, maybe you could change your whole build to exclude nlbwmon ...

vochong · September 7, 2022, 3:11am

When you search Google for issues relating to Wifi, please keep in mind that:

There are many WIFI chipsets, each with its own firmware and driver (some may use the same driver).
A bug / resolution for a certain WIFI chipset does not mean it is applicable for others.
Depending on the number of WIFI clients and/or their types in a network, some may observe an issue while others may never do.

vochong · September 7, 2022, 4:01am

@Mpilon @D43m0n

I got two more of the random reboots (very hot day 110 F, no darn AC) with these clean logs in console-ramoops (no SWBA errors or anything before and after)

[26206.532451] rcu: INFO: rcu_sched self-detected stall on CPU
[26206.532504] rcu: 	1-...!: (1 GPs behind) idle=436/1/0x40000002 softirq=279305/279306 fqs=93 
[26206.536863] rcu: rcu_sched kthread starved for 1914 jiffies! g761821 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[26206.545525] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[26206.555677] rcu: RCU grace-period kthread stack dump:

In light of this post: https://lore.kernel.org/lkml/20150114221228.GV9719@linux.vnet.ibm.com/T/

I think irqbalance with only 2 CPU cores (R7800's 2 Krait cores) would cause lots of memory context switching between them when you run multiple other processes (hostapd, nlbwmon, collected, VPN, SMB file sharing etc.) in addition to processing WAN and WIFI traffic. That will adversely affect the overall operations taken as a whole, rather than benefiting from it. CPU context-switching can only be executed once RCU read operations are completed.

Let's disable irqbalance (make sure you stop/start the irqbalance service after disabling it, or just reboot the router) to see if this RCU stalling issue still happen with relatively high frequency (every few hours or less than a day)

SiXX · September 7, 2022, 4:02am

I agree with removing nlbwmon. One of those things that is great on paper, but in reality just gives little headaches. I've disabled it as well.

GollyJer · September 7, 2022, 4:38am

Does kernel 5.10 have issues with pppoe hardware offloading?

With my gigabit connection, I was getting around 800MB/s and 90%+ CPU load. Today I updated to the latest build and am getting much slower speeds and less CPU load.

vochong · September 7, 2022, 4:50am

Were you running the netperf client in the router itself? Most of the CPU utilizations ought to come from the client itself.

Please run your netperf client on a PC connected to the router's LAN port. That's when you will experience the benefit of NSS-offloading of your NAT overloaded connections.

vochong · September 7, 2022, 4:54am

And do not trust the throughput numbers when using these public netperf* servers at all. They're not dedicated and distributed servers like the ones used by speedtest.

Mpilon · September 7, 2022, 4:57am

It's a thing to try - I've been using irqbalance without apparent problems with 21.02.1. if irqbalance is a problem I'd expect it's in the realm of cache coherency issues.

The fastest path to figuring this out could be @vochong look at all the differences between his 3-week old build and the latest. Not config differences, C code diffs, line by line.

This kernel is widely used all over the place - I expect something in NSS code or driver is mis-using a syscall, other kernel resource, or similar. That's as close as I've gotten and it's still guesswork.

vlan_dev_get_stats64 seems to be consistently close to the center of the failures. Looking at the call stack with that func at the top, multiple tasks seem to be converging there.

And I think all it's doing is updating /proc.

vochong · September 7, 2022, 5:01am

D43m0n got the same RCU stalling crash using 21.02.x after 5 days, and I'm pretty sure he was using irqbalance in his setup.

vochong · September 7, 2022, 5:11am

My previous 22.03 image had the kernel 5.10.136 while the latest has 5.10.138.

Since it's very easy to see the crash on your setup, you can just try to load the official (non-NSS) 22.03.0 image to see how it goes.

https://downloads.openwrt.org/releases/22.03.0/targets/ipq806x/generic/openwrt-22.03.0-ipq806x-generic-netgear_r7800-squashfs-sysupgrade.bin

GollyJer · September 7, 2022, 5:11am

Yes. I run it from a client.

More than the speeds, it seems the CPU load is much different, which is why I asked the question.

I've seen mention of NSS with kernel 5.10 not working when using pppoe but it doesn't seem to be confirmed anywhere.

vochong · September 7, 2022, 5:16am

PPPoE offloading in 5.10 was confirmed to work with the latest patches. A few people have confirmed it.

From a PC connected to your R7800 LAN port, run a speedtest https://www.speedtest.net/

When the speedtest is in progress, issue uptime and check the first load average. It should stay close to 0.00, assuming you don't run any other CPU heavy app during that time.

uptime
22:15:22 up 6:44, load average: 0.00 <-(this one), 0.01, 0.02

Mpilon · September 7, 2022, 5:28am

configured how? irqbalance or not?

Mpilon · September 7, 2022, 5:34am

I just loaded the release R7800 - to see if it tolerates irqbalance I'll enable that.

if it doesn't, I'll disable it.
mmmv,
M.

vochong · September 7, 2022, 5:39am

I think the extreme heat in the Grizzly Bear state is also wreaking some havoc on my R7800. Just another SWBA overrun and RCU stalling crash. That's a total 3 downs for today.

The R7800 CPU temperatures are normally around 52C. They're hovering around 65-67 today.

vochong · September 7, 2022, 5:57am

I've just built and loaded a new rebased 22.03 image (and updated feeds) with "make clean" before make. I've disabled irqbalance as well.

Tomorrow will be another crazy hot day. I'll hide in the ice cold office and let the "guy" suffer from heat torture the whole day. If he's alive throughout tomorrow, then irqbalance may have some role in it

vochong · September 7, 2022, 6:22am

I have a remote Linksys EA8500 that ran a 22.03 NSS image I built around the same time for R7800 yesterday. It has also just crashed with some RCU stalling. Before that it was running an older private NSS 22.03 image (kernel 5.10.136) with an uptime of about 3 weeks.

1.374808] [<c04abae0>] (seq_read_iter) from [<c04abd1c>] (seq_read+0xe8/0x124)
[ 6111.382712] [<c04abd1c>] (seq_read) from [<c0500740>] (proc_reg_read+0xcc/0x104)
[ 6111.390175] [<c0500740>] (proc_reg_read) from [<c047f15c>] (vfs_read+0xac/0x2e8)
[ 6111.397548] [<c047f15c>] (vfs_read) from [<c047f7e8>] (ksys_read+0x58/0xd4)
[ 6111.404925] [<c047f7e8>] (ksys_read) from [<c0300060>] (ret_fast_syscall+0x0/0x54)
[ 6111.411603] Exception stack(0xc7b49fa8 to 0xc7b49ff0)
[ 6111.419254] 9fa0:                   b6f40ab0 00000001 00000003 b6f40b48 00000400 00000000
[ 6111.424385] 9fc0: b6f40ab0 00000001 bea9d68f 00000003 000001ff 00000000 0001c083 0001aa9a
[ 6111.432535] 9fe0: 0002eea4 bea9d660 b6f07c68 b6f079ac
[ 6174.184761] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6174.184812] rcu: 	1-....: (1 GPs behind) idle=106/1/0x40000002 softirq=74096/74097 fqs=4188 
[ 6174.189154] 	(t=8404 jiffies g=143045 q=490)
[ 6174.197828] NMI backtrace for cpu 1
[ 6174.202084] CPU: 1 PID: 4150 Comm: vnstatd Not tainted 5.10.138 #0
[ 6174.205295] Hardware name: Generic DT based system
[ 6174.211564] [<c030e4ac>] (unwind_backtrace) from [<c030a22c>] (show_stack+0x14/0x20)
[ 6174.216335] [<c030a22c>] (show_stack) from [<c0634328>] (dump_stack+0x94/0xa8)
[ 6174.224233] [<c0634328>] (dump_stack) from [<c063c9b0>] (nmi_cpu_backtrace+0xdc/0x108)
[ 6174.231262] [<c063c9b0>] (nmi_cpu_backtrace) from [<c063cafc>] (nmi_trigger_cpumask_backtrace+0x120/0x158)
[ 6174.239170] [<c063cafc>] (nmi_trigger_cpumask_backtrace) from [<c03819f0>] (rcu_dump_cpu_stacks+0xe8/0x118)
[ 6174.248804] [<c03819f0>] (rcu_dump_cpu_stacks) from [<c0387504>] (rcu_sched_clock_irq+0x728/0x8f8)
[ 6174.258443] [<c0387504>] (rcu_sched_clock_irq) from [<c038e624>] (update_process_times+0x64/0x90)
[ 6174.267474] [<c038e624>] (update_process_times) from [<c03a1064>] (tick_sched_timer+0x88/0x130)
[ 6174.276407] [<c03a1064>] (tick_sched_timer) from [<c038ec28>] (__hrtimer_run_queues+0x184/0x254)
[ 6174.284912] [<c038ec28>] (__hrtimer_run_queues) from [<c038fc60>] (hrtimer_interrupt+0x130/0x374)
[ 6174.293939] [<c038fc60>] (hrtimer_interrupt) from [<c07e90a0>] (msm_timer_interrupt+0x3c/0x4c)
[ 6174.302712] [<c07e90a0>] (msm_timer_interrupt) from [<c0377cb4>] (handle_percpu_devid_irq+0x84/0x178)
[ 6174.311215] [<c0377cb4>] (handle_percpu_devid_irq) from [<c037178c>] (__handle_domain_irq+0x90/0xf4)
[ 6174.320502] [<c037178c>] (__handle_domain_irq) from [<c064d6bc>] (gic_handle_irq+0x90/0xb8)
[ 6174.329702] [<c064d6bc>] (gic_handle_irq) from [<c0300b8c>] (__irq_svc+0x6c/0x90)
[ 6174.337763] Exception stack(0xc7b49c40 to 0xc7b49c88)
[ 6174.345418] 9c40: 00000000 00000002 0000b22b ff7edd98 c7b49d68 c0d04f1c 00000000 00000000
[ 6174.350458] 9c60: c0d05038 c0d04f1c c0d04fe0 ffffffff 00000000 c7b49c90 c09c74fc c09c751c
[ 6174.358603] 9c80: 00000013 ffffffff
[ 6174.366764] [<c0300b8c>] (__irq_svc) from [<c09c751c>] (vlan_dev_get_stats64+0x74/0x19c)
[ 6174.370076] [<c09c751c>] (vlan_dev_get_stats64) from [<c081aa80>] (dev_get_stats+0x44/0xdc)
[ 6174.378408] [<c081aa80>] (dev_get_stats) from [<c085c8f0>] (dev_seq_printf_stats+0x20/0x14c)
[ 6174.386474] [<c085c8f0>] (dev_seq_printf_stats) from [<c085ca30>] (dev_seq_show+0x14/0x40)
[ 6174.395155] [<c085ca30>] (dev_seq_show) from [<c04abae0>] (seq_read_iter+0x3f0/0x544)
[ 6174.403224] [<c04abae0>] (seq_read_iter) from [<c04abd1c>] (seq_read+0xe8/0x124)
[ 6174.411128] [<c04abd1c>] (seq_read) from [<c0500740>] (proc_reg_read+0xcc/0x104)
[ 6174.418590] [<c0500740>] (proc_reg_read) from [<c047f15c>] (vfs_read+0xac/0x2e8)
[ 6174.425965] [<c047f15c>] (vfs_read) from [<c047f7e8>] (ksys_read+0x58/0xd4)
[ 6174.433342] [<c047f7e8>] (ksys_read) from [<c0300060>] (ret_fast_syscall+0x0/0x54)
[ 6174.440019] Exception stack(0xc7b49fa8 to 0xc7b49ff0)
[ 6174.447670] 9fa0:                   b6f40ab0 00000001 00000003 b6f40b48 00000400 00000000
[ 6174.452802] 9fc0: b6f40ab0 00000001 bea9d68f 00000003 000001ff 00000000 0001c083 0001aa9a
[ 6174.460951] 9fe0: 0002eea4 bea9d660 b6f07c68 b6f079ac
[ 6198.345720] br-lan: received packet on wlan1 with own address as source address (addr:36:c7:eb:65:e6:df, vlan:0)
[ 6198.375732] br-lan: received packet on wlan1 with own address as source address (addr:36:c7:eb:65:e6:df, vlan:0)
[ 6248.413004] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6248.413058] rcu: 	1-....: (2099 ticks this GP) idle=e02/1/0x40000002 softirq=74252/74252 fqs=1032 
[ 6248.417399] 	(t=2100 jiffies g=143393 q=26)
[ 6248.426422] NMI backtrace for cpu 1
[ 6248.430499] CPU: 1 PID: 4150 Comm: vnstatd Not tainted 5.10.138 #0
[ 6248.433972] Hardware name: Generic DT based system
[ 6248.440243] [<c030e4ac>] (unwind_backtrace) from [<c030a22c>] (show_stack+0x14/0x20)
[ 6248.445016] [<c030a22c>] (show_stack) from [<c0634328>] (dump_stack+0x94/0xa8)
[ 6248.452913] [<c0634328>] (dump_stack) from [<c063c9b0>] (nmi_cpu_backtrace+0xdc/0x108)
[ 6248.459940] [<c063c9b0>] (nmi_cpu_backtrace) from [<c063cafc>] (nmi_trigger_cpumask_backtrace+0x120/0x158)
[ 6248.467851] [<c063cafc>] (nmi_trigger_cpumask_backtrace) from [<c03819f0>] (rcu_dump_cpu_stacks+0xe8/0x118)
[ 6248.477484] [<c03819f0>] (rcu_dump_cpu_stacks) from [<c0387504>] (rcu_sched_clock_irq+0x728/0x8f8)
[ 6248.487122] [<c0387504>] (rcu_sched_clock_irq) from [<c038e624>] (update_process_times+0x64/0x90)
[ 6248.496151] [<c038e624>] (update_process_times) from [<c03a1064>] (tick_sched_timer+0x88/0x130)
[ 6248.505088] [<c03a1064>] (tick_sched_timer) from [<c038ec28>] (__hrtimer_run_queues+0x184/0x254)
[ 6248.513590] [<c038ec28>] (__hrtimer_run_queues) from [<c038fc60>] (hrtimer_interrupt+0x130/0x374)
[ 6248.522616] [<c038fc60>] (hrtimer_interrupt) from [<c07e90a0>] (msm_timer_interrupt+0x3c/0x4c)
[ 6248.531393] [<c07e90a0>] (msm_timer_interrupt) from [<c0377cb4>] (handle_percpu_devid_irq+0x84/0x178)
[ 6248.539892] [<c0377cb4>] (handle_percpu_devid_irq) from [<c037178c>] (__handle_domain_irq+0x90/0xf4)
[ 6248.549178] [<c037178c>] (__handle_domain_irq) from [<c064d6bc>] (gic_handle_irq+0x90/0xb8)
[ 6248.558378] [<c064d6bc>] (gic_handle_irq) from [<c0300b8c>] (__irq_svc+0x6c/0x90)
[ 6248.566441] Exception stack(0xc7b49c40 to 0xc7b49c88)
[ 6248.574096] 9c40: 00000000 00000002 0000b267 ff7edd98 c7b49d68 c0d04f1c 00000000 00000000
[ 6248.579134] 9c60: c0d05038 c0d04f1c c0d04fe0 ffffffff 00000000 c7b49c90 c09c74fc c09c751c
[ 6248.587279] 9c80: 00000013 ffffffff
[ 6248.595445] [<c0300b8c>] (__irq_svc) from [<c09c751c>] (vlan_dev_get_stats64+0x74/0x19c)
[ 6248.598753] [<c09c751c>] (vlan_dev_get_stats64) from [<c081aa80>] (dev_get_stats+0x44/0xdc)
[ 6248.607088] [<c081aa80>] (dev_get_stats) from [<c085c8f0>] (dev_seq_printf_stats+0x20/0x14c)
[ 6248.615154] [<c085c8f0>] (dev_seq_printf_stats) from [<c085ca30>] (dev_seq_show+0x14/0x40)
[ 6248.623832] [<c085ca30>] (dev_seq_show) from [<c04abae0>] (seq_read_iter+0x3f0/0x544)
[ 6248.631902] [<c04abae0>] (seq_read_iter) from [<c04abd1c>] (seq_read+0xe8/0x124)
[ 6248.639806] [<c04abd1c>] (seq_read) from [<c0500740>] (proc_reg_read+0xcc/0x104)
[ 6248.647268] [<c0500740>] (proc_reg_read) from [<c047f15c>] (vfs_read+0xac/0x2e8)
[ 6248.654642] [<c047f15c>] (vfs_read) from [<c047f7e8>] (ksys_read+0x58/0xd4)
[ 6248.662018] [<c047f7e8>] (ksys_read) from [<c0300060>] (ret_fast_syscall+0x0/0x54)
[ 6248.668697] Exception stack(0xc7b49fa8 to 0xc7b49ff0)
[ 6248.676348] 9fa0:                   b6f40040 00000001 00000003 b6f400d8 00000400 00000000
[ 6248.681480] 9fc0: b6f40040 00000001 bea9d68f 00000003 000001ff 00000000 0001c083 0001aa9a
[ 6248.689628] 9fe0: 0002eea4 bea9d660 b6f07c68 b6f079ac
[ 6311.441578] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6311.441630] rcu: 	1-....: (8375 ticks this GP) idle=e02/1/0x40000002 softirq=74252/74252 fqs=4137 
[ 6311.445972] 	(t=8403 jiffies g=143393 q=150)
[ 6311.454992] NMI backtrace for cpu 1
[ 6311.459421] CPU: 1 PID: 4150 Comm: vnstatd Not tainted 5.10.138 #0
[ 6311.462630] Hardware name: Generic DT based system
[ 6311.468900] [<c030e4ac>] (unwind_backtrace) from [<c030a22c>] (show_stack+0x14/0x20)
[ 6311.473672] [<c030a22c>] (show_stack) from [<c0634328>] (dump_stack+0x94/0xa8)
[ 6311.481568] [<c0634328>] (dump_stack) from [<c063c9b0>] (nmi_cpu_backtrace+0xdc/0x108)
[ 6311.488599] [<c063c9b0>] (nmi_cpu_backtrace) from [<c063cafc>] (nmi_trigger_cpumask_backtrace+0x120/0x158)
[ 6311.496507] [<c063cafc>] (nmi_trigger_cpumask_backtrace) from [<c03819f0>] (rcu_dump_cpu_stacks+0xe8/0x118)
[ 6311.506141] [<c03819f0>] (rcu_dump_cpu_stacks) from [<c0387504>] (rcu_sched_clock_irq+0x728/0x8f8)
[ 6311.515779] [<c0387504>] (rcu_sched_clock_irq) from [<c038e624>] (update_process_times+0x64/0x90)
[ 6311.524812] [<c038e624>] (update_process_times) from [<c03a1064>] (tick_sched_timer+0x88/0x130)
[ 6311.533744] [<c03a1064>] (tick_sched_timer) from [<c038ec28>] (__hrtimer_run_queues+0x184/0x254)
[ 6311.542252] [<c038ec28>] (__hrtimer_run_queues) from [<c038fc60>] (hrtimer_interrupt+0x130/0x374)
[ 6311.551277] [<c038fc60>] (hrtimer_interrupt) from [<c07e90a0>] (msm_timer_interrupt+0x3c/0x4c)
[ 6311.560056] [<c07e90a0>] (msm_timer_interrupt) from [<c0377cb4>] (handle_percpu_devid_irq+0x84/0x178)
[ 6311.568556] [<c0377cb4>] (handle_percpu_devid_irq) from [<c037178c>] (__handle_domain_irq+0x90/0xf4)
[ 6311.577843] [<c037178c>] (__handle_domain_irq) from [<c064d6bc>] (gic_handle_irq+0x90/0xb8)
[ 6311.587042] [<c064d6bc>] (gic_handle_irq) from [<c0300b8c>] (__irq_svc+0x6c/0x90)
[ 6311.595104] Exception stack(0xc7b49c40 to 0xc7b49c88)
[ 6311.602758] 9c40: 00000000 00000002 0000b267 ff7edd98 c7b49d68 c0d04f1c 00000000 00000000
[ 6311.607796] 9c60: c0d05038 c0d04f1c c0d04fe0 ffffffff 00000000 c7b49c90 c09c74fc c09c751c
[ 6311.615939] 9c80: 00000013 ffffffff
[ 6311.624102] [<c0300b8c>] (__irq_svc) from [<c09c751c>] (vlan_dev_get_stats64+0x74/0x19c)
[ 6311.627414] [<c09c751c>] (vlan_dev_get_stats64) from [<c081aa80>] (dev_get_stats+0x44/0xdc)
[ 6311.635744] [<c081aa80>] (dev_get_stats) from [<c085c8f0>] (dev_seq_printf_stats+0x20/0x14c)
[ 6311.643811] [<c085c8f0>] (dev_seq_printf_stats) from [<c085ca30>] (dev_seq_show+0x14/0x40)
[ 6311.652492] [<c085ca30>] (dev_seq_show) from [<c04abae0>] (seq_read_iter+0x3f0/0x544)
[ 6311.660561] [<c04abae0>] (seq_read_iter) from [<c04abd1c>] (seq_read+0xe8/0x124)
[ 6311.668464] [<c04abd1c>] (seq_read) from [<c0500740>] (proc_reg_read+0xcc/0x104)
[ 6311.675927] [<c0500740>] (proc_reg_read) from [<c047f15c>] (vfs_read+0xac/0x2e8)
[ 6311.683303] [<c047f15c>] (vfs_read) from [<c047f7e8>] (ksys_read+0x58/0xd4)
[ 6311.690679] [<c047f7e8>] (ksys_read) from [<c0300060>] (ret_fast_syscall+0x0/0x54)
[ 6311.697357] Exception stack(0xc7b49fa8 to 0xc7b49ff0)
[ 6311.705006] 9fa0:                   b6f40040 00000001 00000003 b6f400d8 00000400 00000000
[ 6311.710139] 9fc0: b6f40040 00000001 bea9d68f 00000003 000001ff 00000000 0001c083 0001aa9a
[ 6311.718288] 9fe0: 0002eea4 bea9d660 b6f07c68 b6f079ac
[ 6345.470937] ath10k_warn: 56 callbacks suppressed
[ 6345.470952] ath10k_pci 0001:01:00.0: wmi command 36892 timeout, restarting hardware
[ 6345.474714] ath10k_pci 0001:01:00.0: failed to recalculate rts/cts prot for vdev 0: -11
[ 6348.510875] ath10k_pci 0001:01:00.0: failed to set cts protection for vdev 0: -108
[ 6348.510934] ath10k_pci 0001:01:00.0: failed to set preamble for vdev 0: -108
[ 6348.517398] ath10k_pci 0001:01:00.0: failed to set mgmt tx rate -108
[ 6351.550808] ath10k_pci 0000:01:00.0: wmi command 36892 timeout, restarting hardware
[ 6351.550869] ath10k_pci 0000:01:00.0: failed to recalculate rts/cts prot for vdev 0: -11
[ 6354.590747] ath10k_pci 0000:01:00.0: failed to set cts protection for vdev 0: -108
[ 6354.590805] ath10k_pci 0000:01:00.0: failed to set erp slot for vdev 0: -108
[ 6354.597278] ath10k_pci 0000:01:00.0: failed to set preamble for vdev 0: -108
[ 6354.604543] ath10k_pci 0000:01:00.0: failed to set mgmt tx rate -108
[ 6354.615936] ath10k_pci 0001:01:00.0: failed to set beacon mode for vdev 0: -108
[ 6354.617826] ath10k_pci 0001:01:00.0: failed to set dtim period for vdev 0: -108
[ 6354.625902] ath10k_pci 0001:01:00.0: failed to recalculate rts/cts prot for vdev 0: -108
[ 6354.632246] ath10k_pci 0001:01:00.0: failed to set cts protection for vdev 0: -108
[ 6360.621145] ath10k_warn: 2 callbacks suppressed
[ 6360.621161] ath10k_pci 0000:01:00.0: failed to set beacon mode for vdev 0: -108
[ 6360.624837] ath10k_pci 0000:01:00.0: failed to set dtim period for vdev 0: -108
[ 6360.633042] ath10k_pci 0000:01:00.0: failed to recalculate rts/cts prot for vdev 0: -108
[ 6360.639150] ath10k_pci 0000:01:00.0: failed to set cts protection for vdev 0: -108
[ 6360.647541] ath10k_pci 0000:01:00.0: failed to set erp slot for vdev 0: -108
[ 6360.654890] ath10k_pci 0000:01:00.0: failed to set preamble for vdev 0: -108
[ 6360.662107] ath10k_pci 0000:01:00.0: failed to set mgmt tx rate -108
[ 6360.670910] ath10k_pci 0001:01:00.0: failed to set beacon mode for vdev 0: -108
[ 6360.675434] ath10k_pci 0001:01:00.0: failed to set dtim period for vdev 0: -108
[ 6360.683527] ath10k_pci 0001:01:00.0: failed to recalculate rts/cts prot for vdev 0: -108
[ 6363.150455] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6363.150491] rcu: 	0-....: (1 GPs behind) idle=262/1/0x40000002 softirq=93018/93019 fqs=1048 
[ 6363.154836] 	(t=2101 jiffies g=143449 q=67)
[ 6363.163513] NMI backtrace for cpu 0
[ 6363.167418] CPU: 0 PID: 6580 Comm: dropbear Not tainted 5.10.138 #0
[ 6363.170890] Hardware name: Generic DT based system
[ 6363.177155] [<c030e4ac>] (unwind_backtrace) from [<c030a22c>] (show_stack+0x14/0x20)
[ 6363.182012] [<c030a22c>] (show_stack) from [<c0634328>] (dump_stack+0x94/0xa8)
[ 6363.189911] [<c0634328>] (dump_stack) from [<c063c9b0>] (nmi_cpu_backtrace+0xdc/0x108)
[ 6363.196941] [<c063c9b0>] (nmi_cpu_backtrace) from [<c063cafc>] (nmi_trigger_cpumask_backtrace+0x120/0x158)
[ 6363.204846] [<c063cafc>] (nmi_trigger_cpumask_backtrace) from [<c03819f0>] (rcu_dump_cpu_stacks+0xe8/0x118)
[ 6363.214480] [<c03819f0>] (rcu_dump_cpu_stacks) from [<c0387504>] (rcu_sched_clock_irq+0x728/0x8f8)
[ 6363.224116] [<c0387504>] (rcu_sched_clock_irq) from [<c038e624>] (update_process_times+0x64/0x90)
[ 6363.233150] [<c038e624>] (update_process_times) from [<c03a1064>] (tick_sched_timer+0x88/0x130)
[ 6363.242083] [<c03a1064>] (tick_sched_timer) from [<c038ec28>] (__hrtimer_run_queues+0x184/0x254)
[ 6363.250588] [<c038ec28>] (__hrtimer_run_queues) from [<c038fc60>] (hrtimer_interrupt+0x130/0x374)
[ 6363.259615] [<c038fc60>] (hrtimer_interrupt) from [<c07e90a0>] (msm_timer_interrupt+0x3c/0x4c)
[ 6363.268388] [<c07e90a0>] (msm_timer_interrupt) from [<c0377cb4>] (handle_percpu_devid_irq+0x84/0x178)
[ 6363.276891] [<c0377cb4>] (handle_percpu_devid_irq) from [<c037178c>] (__handle_domain_irq+0x90/0xf4)
[ 6363.286179] [<c037178c>] (__handle_domain_irq) from [<c064d6bc>] (gic_handle_irq+0x90/0xb8)
[ 6363.295379] [<c064d6bc>] (gic_handle_irq) from [<c0300b8c>] (__irq_svc+0x6c/0x90)
[ 6363.303445] Exception stack(0xc15d1c40 to 0xc15d1c88)
[ 6363.311093] 1c40: 00000000 00000002 0000b271 ff7edd98 c15d1d68 c0d04f1c 00000000 00000000
[ 6363.316131] 1c60: c0d05038 c0d04f1c c0d04fe0 ffffffff 00000000 c15d1c90 c09c74fc c09c751c
[ 6363.324282] 1c80: 00000013 ffffffff
[ 6363.332441] [<c0300b8c>] (__irq_svc) from [<c09c751c>] (vlan_dev_get_stats64+0x74/0x19c)
[ 6363.335751] [<c09c751c>] (vlan_dev_get_stats64) from [<c081aa80>] (dev_get_stats+0x44/0xdc)
[ 6363.344082] [<c081aa80>] (dev_get_stats) from [<c085c8f0>] (dev_seq_printf_stats+0x20/0x14c)
[ 6363.352149] [<c085c8f0>] (dev_seq_printf_stats) from [<c085ca30>] (dev_seq_show+0x14/0x40)
[ 6363.360830] [<c085ca30>] (dev_seq_show) from [<c04abae0>] (seq_read_iter+0x3f0/0x544)
[ 6363.368901] [<c04abae0>] (seq_read_iter) from [<c04abd1c>] (seq_read+0xe8/0x124)
[ 6363.376805] [<c04abd1c>] (seq_read) from [<c0500740>] (proc_reg_read+0xcc/0x104)
[ 6363.384267] [<c0500740>] (proc_reg_read) from [<c047f15c>] (vfs_read+0xac/0x2e8)
[ 6363.391644] [<c047f15c>] (vfs_read) from [<c047f7e8>] (ksys_read+0x58/0xd4)
[ 6363.399020] [<c047f7e8>] (ksys_read) from [<c0300060>] (ret_fast_syscall+0x0/0x54)
[ 6363.405701] Exception stack(0xc15d1fa8 to 0xc15d1ff0)
[ 6363.413346] 1fa0:                   00000000 00000000 00000003 befc9ba4 00001000 00000000
[ 6363.418474] 1fc0: 00000000 00000000 00000000 00000003 befcac10 befc9ba4 00000fff befcae74
[ 6363.426629] 1fe0: befc9b60 befc9b50 b6f03350 b6f02830

Mpilon · September 7, 2022, 6:52am

vochong:

6363.332441] [<c0300b8c>] (__irq_svc) from [<c09c751c>] (vlan_dev_get_stats64+0x74/0x19c)
[ 6363.335751] [<c09c751c>] (vlan_dev_get_stats64) from [<c081aa80>] (dev_get_stats+0x44/0xdc)
[ 6363.344082] [<c081aa80>] (dev_get_stats) from [<c085c8f0>] (dev_seq_printf_stats+0x20/0x14c)
[ 6363.352149] [<c085c8f0>] (dev_seq_printf_stats) from [<c085ca30>] (dev_seq_show+0x14/0x40)
[ 6363.360830] [<c085ca30>] (dev_seq_show) from [<c04abae0>] (seq_read_iter+0x3f0/0x544)
[ 6363.368901] [<c04abae0>] (seq_read_iter) from [<c04abd1c>] (seq_read+0xe8/0x124)
[ 6363.376805] [<c04abd1c>] (seq_read) from [<c0500740>] (proc_reg_read+0xcc/0x104)
[ 6363.384267] [<c0500740>] (proc_reg_read) from [<c047f15c>] (vfs_read+0xac/0x2e8)
[ 6363.391644] [<c047f15c>] (vfs_read) from [<c047f7e8>] (ksys_read+0x58/0xd4)
[ 6363.399020] [<c047f7e8>] (ksys_read) from [<c0300060>] (ret_fast_syscall

@vochong
Note the 2nd line in the backtrack: vlan_dev_get_stats64 is called ultimately from ksys_read - which is core to the linux filesystem; it updates read, write stats.

I've seen this same failure, as well as others, all ending up in vlan_dev_get_stats64. I don't know if it's the cause or an innocent bystander.

It's down to what devices are executing this very common core code. A couple of my crashes were from kernel service of the LEDs, calling this very same code.