Ipq806x NSS build (Netgear R7800 / TP-Link C2600 / Linksys EA8500)

i could be wrong but i'm pretty sure the nss cores in this thing aren't smart enough for that

Interesting. I haven’t seen any similar problems in the logs. Both CPU 0 and 1 are being used on my devices without those hiccups. Let me know if you find a source. Interesting that 22.03 works but 23.05 isn’t.

I think one reason may be that 21 seconds of no scheduling activity before logging a fail is a really high bar. I dropped this timeout in the kernel to 3 sec and it happens more frequently.

Second, the only overt symptom on the outside is a hang. Things stop for a bit - but that's not readily apparent unless you're watching.

Iirc i seem to remember reports of brief stuttering ... not enough to throw a 21 second rcu splat.

Given that RCU stops for whole seconds suggests a lock without an unlock. I'd look in the nss code for that kind of issue but don't know where that code lives in the build tree.

(The stack traces show cpu0 waiting in idle when the wakeup-NMI pokes it.)

(Not picking on nss here, if it's in a widely shared driver or kernel it'd have attracted more attention. )

I'm using a usb+sdcard drive on my r7800, I'll remove that from the mix.

I'll see if i can capture the callers of the rcu lock, unlock funcs and dump them with the rcu splat dump.

Ideas, thoughts?

@Mpilon
Have you tried the master snapshot? Considering the 23.05 and master are currently really close as a code if I'm not wrong.

I'm using the @ACwifidude git branch:

https://github.com/ACwifidude/openwrt/tree/openwrt-23.05-nss-qsdk11

Under the 'branches' pulldown on that page:
https://github.com/ACwifidude/openwrt/tree/openwrt-23.05-nss-qsdk11

-- I don't see a master branch.
Perhaps I could use a sanity check:

Am I looking in the right place? I'm downloading the .zip under the "Code" pulldown.

I then build by doing:

./scripts/feeds update -a && \
 ./scripts/feeds install -a && \
 ./scripts/feeds install libpam libnetsnmp liblzma && \
 cp diffconfig .config && \
 make defconfig && \
 ./scripts/getver.sh

then just 'make'

If I'm skipping a step, or need to use a different git tree, please advise! Where does the R7800 NSS master live?

Master, 23.05, and 22.03 links are here:

Master is current as of when master is still 5.15 kernel. Don’t have a 6.1 kernel master NSS build.

I went with the 23.05 tree as it had been rebased a week ago; the kernel 5.15 tree hadn't been updated in 2 mos. If you think a recent change broke something I'm happy to pull that older tree.

I'm trying that 23.05 nss git tree with one change: i configured the kernel for low-latency preemptable. The linux notes on debugging RCU stalls noted one failure mode occurs when a thread loops too long without invoking schedule() - in non-preempt kernels.

2 changes actually: i also reset the rcu stall timeout to 3sec, down from the linux default of 21.

It'll be interesting if this issue changes or disappears. Still going to be a pig to find.

(Working from memory here, don't kill me if I'm mangling some details...)

1 Like

Hey @ACwifidude, is it possible to rebase your image with the latest stable Openwrt version? For example with 23.03.5 rather than with 23.03-SNAPSHOT? That would solve the kmod packages installation issue too.

I think I got it

After rebasing using your commands

#Remove “rebase” commit (this gives you a clean build environment - it deletes the final bin content and diffconfig files, I’d copy the diffconfig to a separate folder before running this command)
git reset --hard HEAD~1

git remote add upstream https://git.openwrt.org/openwrt/openwrt.git

I do

git fetch upstream && git rebase --onto v22.03.5 upstream/openwrt-22.03

Is that ok?

The NSS packages are kmods and modify the kernel to make the main OpenWrt repositories unusable for installing new kmods (secondary to the kernel version checking mechanism for updating software). The software checking mechanism is good (prevents making a disaster of your router) and bad (makes custom builds that are unable to get kmod software updates). With that said, I recommend staying with 23.05 snapshot instead of a fixed version number. The fixed number is an arbitrary fixed point but not particularly focused on one platform. The stable branches have more conservative updates that are focused on stability and snapshot probably offers the best option.

The only way to install new kmods is to build from scratch using the instructions in post #2 above. The software feature does not offer a clean method to install kmods on custom builds. The build has what the majority of people would use to get their router working. With more specialized packages / setups, it may seem daunting but building from scratch isn’t too hard following the instructions above. With the question you are posing - you probably have the skills needed to build an image exactly like you want it. :sunglasses:

2 Likes

@Mpilon, you may try to disable CONFIG_ARM_QCOM_SPM_CPUIDLE in the kernel config and build your image again to see whether it helps your issue wrt CPU0 not taking any scheduling/clock interrupts for a long time

Great! What does it actually do/mean?

Thanks for the suggestion,
M.

is anyone trying to get this working? or are we abandoning ipq806x? :frowning:

Currently running 22.03.5 stable on my 3x R7800s being used as APs.

Is there any benefit to me (performance or stability wise) using this NSS build or any of hnyman's builds when using the R7800s as access points, compared to using generic OpenWrt 22.03 or 23.05 builds.

WiFi is a hit or miss for me both with and without the ath10k-ct firmware on 22.03.5, especially with 802.11r enabled. No matter what I try, clients stay on APs with poorer signal for way too long until I manually disconnect them and turn WiFi on/off (on the client) and then it'll connect to a nearby AP with stronger signal.

Ok, a little clarity on the rcu stall issue.

  1. it's not gone.
  2. configuring the kernel for preemptable resolved / masked the issue.
  3. i tried @vochong suggestion to disable CONFIG_ARM_QCOM_SPM_CPUIDLE and reverted to the base scheduler - which failed again for me. I.e. disablinng this option didn't fix the issue.

[An aside - I'd run long enough with the preemptable scheduler without problems that i was feeling spooked that I'd encountered a 1-off problem. So relieved when returning to the default scheduler brought it back!]

So. @ACwifidude - my strong suggestion is to make the default kernel preemption model low-latency preemptable - if for no other reason than it may make for better net response.

I'm going to mull howto bisect this further - this is a pretty large universe of possibilities to carve down to size.

Right after i return to my low-latency build.

Cheers,
M.

1 Like

If you have gig internet service or want 500mbps+ wifi - NSS will be of benefit to get max speed out of your devices.

For your wifi I’d suggest some additional signal level testing your environment. I recommend:

  1. put all your faster devices on 5ghz only (phones, tablets, laptops, etc)
  2. put your slower internet of things devices on 2.4ghz only
  3. I’d position your r7800’s far enough away that there is predictable signal overlap for their given maximum transmit power. There are several free or cheap apps that measure dBm signal levels. Often surprisingly turning down your transmit power helps greatly to help with roaming as the client is the one that determines when to roam. Apple devices like to roam at -70dBm, other wifi devices are similar / sometimes tweakable. Tune your transmit power and access point physical positioning so that your access points overlap at about -70dBm for 5ghz. Depending on your clients they might like a higher threshold but -70dBm is a good starting point. I have my transmit power at 20 for reference.

Unfortunately no gig here. I've got 100/100 from my ISP.

I'll try your suggestions out and reduce power on all APs today, but ever since I upgraded to 22.03.5, 5 GHz behaviour has been very weird and it drops out a lot. Especially with Intel WiFi cards (AX200, AX201 and AX210).

It'll connect to the nearest AP for 15-20 seconds and then shortly after will switch to a further away AP and will refuse to connect to the closer AP no matter what I try. I've been using 2.4 GHz on all devices till I figure out how to fix this. I thought it had to do with 802.11r initially, so I did a clean install of OpenWrt and set up my 5 GHz network without 802.11r and I have the same issues.

This only happens with OpenWrt, if I run stock firmware on my R7800s, all clients will connect to the closest AP by default.

RCU stall issue: Asking for @ACwifidude help here!

OK, am officially puzzled, looking at the .config files for what works, doesn't, kernel's .configs for make_menuconfig vs not.

Thinking I was going to go some golden long-term state, I built a non-ct img w/o some of the packages included in the defconfig, did a kernel_makemenuconfig to adjust the rcu stall tmout down to 3 seconds and set the preemption model to ... preempt.

and I got an RCU TMOUT in the early hours of my morning, when not much should have been going on.

I thought I'd set the RCU tmout to 3 from 21 in my 'working' tree but didn't. the scheduler model change for preempt took ...

The kernel .config for my custom build vs the 'working' tree --- custom had ~ 4.8K lines w/ a header:
# Automatically generated file; DO NOT EDIT. - also had section comments throughout.

the 'working' kernel .config had 7.8K lines, no header, no section somments ...

What is the correct sequence to unzip a git 23.05 archive, make kernel menuconfig, and build? I'm stepping on something here.

Going back to diffconfig (no -ath10k), w/o deleting packages, build the whole thing and then look at the kernel .config -- where I'll manually edit the preemption model and RCU timeout. I think.

I'm going to write something to spit out what's unique in terms of CONFIG_ defiles in the kernel .config files.

1 Like

our story so far - kernel .config changes made with 'make kernel_menuconfig' or with a manual edit aren't preserved in at least 2 items:

type or CONFIG_RCU_CPU_STALL_TIMEOUT=21
``` -- reset from manual edit to  3,

CONFIG_PREEMPT_NONE=y
 -- set again instead of being commented out.

Does anyone know if this reset-to-defaults (automatically generated kernel .config) is Openwrt -wide or specific to this build?

[I've never encountered this before -- all hints to the underlying mechanism gratefully accepted!]

Thanks!

One thought, the RCU stall log messages are just warnings. Warnings don’t always mean there is an issue / are not always associated with any on going issues. RCU issues are not prevalent in OpenWrt. Usually when seen there is a software (usually an additional package or a setting) or a hardware issue (bad RAM? Bad CPU?) if you are having freezes. I’d simplify your setup and peel back to parcel out if something (usually a package or a setting) is causing your issues.