Ipq806x NSS build (Netgear R7800 / TP-Link C2600 / Linksys EA8500)

Mpilon · August 31, 2022, 4:54pm

repeatedly log top-most process 'top' finds ...
you can stick this at the bottom of your /etc/rc.local (above the exit 0 line !)

... or scp it and run it from the command line ...

#!/bin/ash
set +e # ignore invoked command errors
while sleep 5; do

    top_task=`top -b -n 1 | head -5 | tail -1`
    top_pid=`echo $top_task | awk '{print $1;}'`
    if [ -d /proc/$top_pid ]; then
        logger "$top_task"
        logger `cat /proc/$top_pid/status`
    #   logger `date`
    fi

done

if you exec it from the cmd line, chmod +x the file you write this to ...

Mpilon · August 31, 2022, 5:12pm

@ACwifidude - getting a compile error for rpc-mod-luci ... in brief:

in luci.c:

1086 |                                 if (nret & IWINFO_CIPHER_CCMP256)
      |                                            ^~~~~~~~~~~~~~~~~~~~~
      |                                            IWINFO_CIPHER_CCMP

 IWINFO_CIPHER_CCMP256 not defined ...

FULL compile line:

FAILED: CMakeFiles/rpcd-mod-luci.dir/luci.c.o 
/home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/staging_dir/toolchain-arm_cortex-a15+neon-vfpv4_gcc-11.3.0_musl_eabi/bin/arm-openwrt-linux-muslgnueabi-gcc -Drpcd_mod_luci_EXPORTS  -Os -pipe -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -mfloat-abi=hard -fmacro-prefix-map=/home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/rpcd-mod-luci-20210614=rpcd-mod-luci-20210614 -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -I/home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/staging_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/usr/include/libnl-tiny -I/home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/staging_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/usr/include -DNDEBUG -fPIC   -Os -Wall -Werror --std=gnu99 -g3 -Wmissing-declarations -MD -MT CMakeFiles/rpcd-mod-luci.dir/luci.c.o -MF CMakeFiles/rpcd-mod-luci.dir/luci.c.o.d -o CMakeFiles/rpcd-mod-luci.dir/luci.c.o -c /home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/rpcd-mod-luci-20210614/luci.c
/home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/rpcd-mod-luci-20210614/luci.c: In function 'rpc_luci_get_iwinfo':
/home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/rpcd-mod-luci-20210614/luci.c:1086:44: error: 'IWINFO_CIPHER_CCMP256' undeclared (first use in this function); did you mean 'IWINFO_CIPHER_CCMP'?
 1086 |                                 if (nret & IWINFO_CIPHER_CCMP256)
      |                                            ^~~~~~~~~~~~~~~~~~~~~
      |                                            IWINFO_CIPHER_CCMP
/home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/rpcd-mod-luci-20210614/luci.c:1086:44: note: each undeclared identifier is reported only once for each function it appears in
/home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/rpcd-mod-luci-20210614/luci.c:1092:44: error: 'IWINFO_CIPHER_GCMP256' undeclared (first use in this function); did you mean 'IWINFO_CIPHER_GCMP'?
 1092 |                                 if (nret & IWINFO_CIPHER_GCMP256)
      |                                            ^~~~~~~~~~~~~~~~~~~~~
      |                                            IWINFO_CIPHER_GCMP
ninja: build stopped: subcommand failed.

for my local build I'll just hack it to bypass these 2 checks - thoughts?
M.

Mpilon · August 31, 2022, 5:23pm

while I'm here - @ACwifidude -- prior builds had traditionally named packages in the bin/targets ... dir as well as some named R7800-20220820-MasterNSS-ath10k-sysupgrade.bin

what's the difference?
my latest build efforts have only yielded the traditionally named, -- am I getting hardware nss support with these?

noblem · August 31, 2022, 5:37pm

From what I remember, the dts has all the necessary definitions to support pstore, but I think it was left to whoever was compiling the firmware to enable the necessary kernel features to support it. The fact that you don't have console-ramoops-0 probably means that option for the kernel log wasn't enabled.

Crash logging is a separate option, so it's entirely possible one is enabled and working without the other, I don't think there's any easy way to tell without having ikconfig enable, which isn't standard either, but exposes all the kernel config via /proc/config.gz

I think it might be a case of asking the person who built the image to add it in, if that's @ACwifidude then I'm sure he wants to fix the random reboots as much as everyone else and CONFIG_PSTORE_CONSOLE may help...

Mpilon · August 31, 2022, 5:55pm

iirc, ntp only will set system time if it's not too far different from what's expected ... once it deviates too far it stays that way. this is a distant memory here ...

mmmv,
M.

pattagghiu · August 31, 2022, 6:06pm

same error here, i was trying to build a master with pppoe to test..

btw, my 21 is

Uptime	21d 11h 51m 44s

i'd really REALLY like to have wifi fixes on this version..

noblem · August 31, 2022, 6:10pm

ntp always used to default to slew rather than step and that never worked well when the clock was too far out, I think some implementations may have better logic, but I doubt the default busybox implementation is one of them...

Having something touch a file in /etc should mean that sysfixtime will get the clock somewhere close to what it should, allowing ntp to get things to where they need to be, but I think a hotplug script to forcibly step time to the right value as soon as the wan comes up is the safer option as you aren't constantly writing to flash

Mpilon · August 31, 2022, 7:23pm

I changed file:
/home/2TB_EXT/projects/OpenWRT/openwrt-22.03/ACwifidude-master/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/rpcd-mod-luci-20210614/luci.c

at lines: 1086, 1092 like so:

                ///// HACK vvvvvvvvvvvvvvvvv was: IWINFO_CIPHER_CCMP256
				if (nret & IWINFO_CIPHER_CCMP/*256*/)
					blobmsg_add_string(buf, NULL, "ccmp-256");

				if (nret & IWINFO_CIPHER_GCMP)
					blobmsg_add_string(buf, NULL, "gcmp");
                ///// HACK vvvvvvvvvvvvvvvvv
				if (nret & IWINFO_CIPHER_GCMP/*256*/)
					blobmsg_add_string(buf, NULL, "gcmp-256");

and the compile succeeded -- no idea if/what I've screwed up but since it's a luci module, maybe it's just reporting via the web interface.
mmmv,
M.

Mpilon · August 31, 2022, 11:36pm

I've built a new nss openwrt tree, with a few changes from default; "desktop" version of BusyBox (haven't verified that's what I actually have) - because why not?

Ath10k Not -ct.

And I may have been causing some of my problems because I missed the comment about nss sqm not being compatible with its luci interface page.

I un-configured the luci sqm module so I'm forced to command-line config sqm.

Going to live with this a while ... It's already much more stable than my crashing build. Time to establish a base line.

ACwifidude · September 1, 2022, 3:11am

Let us know what you find out.

pattagghiu · September 1, 2022, 6:13am

ok i'm up with a master with pppoe, totally clean image (so i'll need to take this out quite soon), but let's see what happens

NSS works

sppmaster · September 1, 2022, 7:16am

I see that the ds-lite & l2tpv2 packages exist in your last master build (at least in your diffconfig file).
Are you going to remove them in your next build as it was stated that they may interfere with NSS-pppoe.

tishipp · September 1, 2022, 8:36am

DS-Lite is stable.

i have testing ACwifidude's master branch and my patches with simple .config.
pppoe and ds-lite have connecting for 5 days with nss offloading.
i will report if it something happen.

sppmaster · September 1, 2022, 8:38am

Then it's just l2tpv2, isn't it.
I run an older snapshot without ds-lite & l2tpv2 and it's current state is excellent. But it doesn't use pppoe.

pattagghiu · September 1, 2022, 11:05am

waiting for the end

pattagghiu · September 1, 2022, 1:06pm

While i wait for a reboot, have you any idea on this ksmbd error on startup?
i had never seen anything like this

Thu Sep  1 15:03:21 2022 daemon.notice ksmbd: Stopping Ksmbd userspace service.
Thu Sep  1 15:03:21 2022 kern.err kernel: [25019.344445] Module has invalid ELF structures
Thu Sep  1 15:03:21 2022 daemon.err ksmbd: modprobe of ksmbd module failed, can\'t start ksmbd!

ksmbd was built with the image, not installed after

JayJax · September 2, 2022, 1:59pm

For some reason the NBG6817-20220831-MasterNSS-ath10k-sysupgrade.bin does not seem to work in the wifi area. I can't seem to start the radios at all.

Mpilon · September 2, 2022, 5:12pm

Another crash after I changed 1 thing -- had been up for 24+ hours.

don't know if this is significant or an example of me being an idiot.
@ACwifidude -- I'd appreciate an overview. of how this works; of me being an idiot is optional.

Seeing the nblwmon/collectd out of memory msgs in the syslog I did the sensible (!) thing and disabled nlbwmon.

soon after, a reboot occurred.

./scripts/getver.sh ---> r20385-e972c6aee5

I don't know how deep the dependencies for nlbwmon go but syslog showed a couple of errors which happened just before prior reboots:

the logged error: NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!

and the relevant bit of syslog -- NineNet is the R7800, zorro its log host.

Sep  2 09:27:17 NineNet dnsmasq[1]: possible DNS-rebind attack detected: tag.1rx.io
Sep  2 09:27:40 NineNet kernel: [  856.649025] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
Sep  2 09:27:40 NineNet kernel: [  856.649056] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
Sep  2 09:27:47 NineNet dnsmasq[1]: possible DNS-rebind attack detected: ap.lijit.com
.
.
.
Sep  2 09:29:17 NineNet dnsmasq[1]: possible DNS-rebind attack detected: ap.lijit.com
Sep  2 09:29:17 NineNet dnsmasq[1]: possible DNS-rebind attack detected: htlb.casalemedia.com
Sep  2 09:29:17 NineNet dnsmasq[1]: possible DNS-rebind attack detected: fastlane.rubiconproject.com
Sep  2 09:29:29 NineNet kernel: [  965.022070] ath10k_pci 0000:01:00.0: wmi command 36967 timeout, restarting hardware
Sep  2 09:29:29 NineNet kernel: [  965.022121] ath10k_pci 0001:01:00.0: wmi command 36967 timeout, restarting hardware
Sep  2 09:29:30 NineNet kernel: [  965.852749] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  2 09:29:30 NineNet kernel: [  965.852793] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  2 09:29:30 NineNet kernel: [  965.858879] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  2 09:29:30 NineNet kernel: [  965.866332] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  2 09:29:30 NineNet kernel: [  965.873529] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  2 09:29:30 NineNet kernel: [  965.880755] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  2 09:29:30 NineNet kernel: [  965.888182] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  2 09:29:30 NineNet kernel: [  965.895396] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep  2 09:29:30 NineNet kernel: [  966.035385] ieee80211 phy0: Hardware restart was requested
Sep  2 09:29:30 NineNet kernel: [  966.038956] ieee80211 phy1: Hardware restart was requested
Sep  2 09:29:36 NineNet kernel: [  972.277503] ath10k_warn: 113 callbacks suppressed
Sep  2 09:29:36 NineNet kernel: [  972.277519] ath10k_pci 0000:01:00.0: Unknown eventid: 36933
Sep  2 09:30:01 zorro CRON[3249555]: (root) CMD ([ -x /etc/init.d/anacron ] && if [ ! -d /run/systemd/system ]; then /usr/sbin/invoke-rc.d anacron start >/dev/null; fi)
Sep  2 09:30:15 zorro kernel: [79884.582395] r8169 0000:05:00.0 enp5s0: Link is Down

I don't know if this gets us closer to understanding the crashes or if this is something I explicitly (unwittingly) caused - if so, it may be a candidate for being made more idiot-proof, even if I'm the idiot. my 0.000002

I don't have enough big-picture of this to attempt a fix but have re-enabled nlbwmon and fixed the buffer size in its config file.

Thoughts?

Mpilon · September 2, 2022, 5:38pm

AND setting the buffersize correctly didn't help - it rebooted again. Those were the only changes in over a day of operation. I'm resetting nlbwmon's config size to 524288

vochong · September 2, 2022, 6:45pm

Please always disable packet steering. Packet steering serves no purpose in the context of NSS offloading for regular home users. Only the first few packets of a new connection are punted to the Krait cores for netfilter processing. Once a connection's tuples characteristics and connection states have been established, all subsequent packets are offloaded to NSS cores for processing. Enabling packet steering also causes this annoying error to keep showing up in dmesg very frequently:

**NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!*" to keep showing up in dmesg very frequently.