i could be wrong but i'm pretty sure the nss cores in this thing aren't smart enough for that
Interesting. I haven’t seen any similar problems in the logs. Both CPU 0 and 1 are being used on my devices without those hiccups. Let me know if you find a source. Interesting that 22.03 works but 23.05 isn’t.
I think one reason may be that 21 seconds of no scheduling activity before logging a fail is a really high bar. I dropped this timeout in the kernel to 3 sec and it happens more frequently.
Second, the only overt symptom on the outside is a hang. Things stop for a bit - but that's not readily apparent unless you're watching.
Iirc i seem to remember reports of brief stuttering ... not enough to throw a 21 second rcu splat.
Given that RCU stops for whole seconds suggests a lock without an unlock. I'd look in the nss code for that kind of issue but don't know where that code lives in the build tree.
(The stack traces show cpu0 waiting in idle when the wakeup-NMI pokes it.)
(Not picking on nss here, if it's in a widely shared driver or kernel it'd have attracted more attention. )
I'm using a usb+sdcard drive on my r7800, I'll remove that from the mix.
I'll see if i can capture the callers of the rcu lock, unlock funcs and dump them with the rcu splat dump.
Have you tried the master snapshot? Considering the 23.05 and master are currently really close as a code if I'm not wrong.
I'm using the @ACwifidude git branch:
Under the 'branches' pulldown on that page:
-- I don't see a master branch.
Perhaps I could use a sanity check:
Am I looking in the right place? I'm downloading the .zip under the "Code" pulldown.
I then build by doing:
./scripts/feeds update -a && \ ./scripts/feeds install -a && \ ./scripts/feeds install libpam libnetsnmp liblzma && \ cp diffconfig .config && \ make defconfig && \ ./scripts/getver.sh then just 'make'
If I'm skipping a step, or need to use a different git tree, please advise! Where does the R7800 NSS master live?
Master, 23.05, and 22.03 links are here:
Master is current as of when master is still 5.15 kernel. Don’t have a 6.1 kernel master NSS build.
I went with the 23.05 tree as it had been rebased a week ago; the kernel 5.15 tree hadn't been updated in 2 mos. If you think a recent change broke something I'm happy to pull that older tree.
I'm trying that 23.05 nss git tree with one change: i configured the kernel for low-latency preemptable. The linux notes on debugging RCU stalls noted one failure mode occurs when a thread loops too long without invoking schedule() - in non-preempt kernels.
2 changes actually: i also reset the rcu stall timeout to 3sec, down from the linux default of 21.
It'll be interesting if this issue changes or disappears. Still going to be a pig to find.
(Working from memory here, don't kill me if I'm mangling some details...)