Ipq806x NSS build (Netgear R7800 / TP-Link C2600 / Linksys EA8500)

2 ideas -
0) try it

  1. update your archive's ideas of feeds, packages:
./scripts/feeds update -a
./scripts/feeds install -a
./scripts/feeds install libpam libnetsnmp liblzma

shouldn't libpam libnetsnmp liblzma be installed with the install -a command?

probably ... this is what I have in my notes. although if they're coming up missing, maybe install -a would miss them too ...

doesn't seem making sense

massi@greenbook:~/rutto/22$ ./openwrt/scripts/feeds install libpam libnetsnmp liblzma
Collecting package info: done
Collecting target info: done
WARNING: Makefile 'package/utils/busybox/Makefile' has a dependency on 'libpam', which does not exist
WARNING: Makefile 'package/utils/busybox/Makefile' has a dependency on 'libpam', which does not exist
WARNING: Makefile 'package/utils/busybox/Makefile' has a build dependency on 'libpam', which does not exist
WARNING: Makefile 'package/boot/kexec-tools/Makefile' has a dependency on 'liblzma', which does not exist
WARNING: Makefile 'package/network/services/lldpd/Makefile' has a dependency on 'libnetsnmp', which does not exist
WARNING: Makefile 'package/utils/policycoreutils/Makefile' has a dependency on 'libpam', which does not exist
WARNING: Makefile 'package/utils/policycoreutils/Makefile' has a dependency on 'libpam', which does not exist
WARNING: Makefile 'package/utils/policycoreutils/Makefile' has a build dependency on 'libpam', which does not exist
Installing package 'libpam' from packages
Installing package 'net-snmp' from packages
Installing package 'pciutils' from packages
Installing package 'kmod' from packages
Installing package 'hwdata' from packages
Installing package 'xz' from packages

seems a consecutio issue, isn't it?
(i'm starting from an empty folder every try..)

OK, @ACwifidude and anyone else interested in the crashes/reboots going on - first update.

running package: R7800-20220820-MasterNSS-ath10k-sysupgrade.bin
built from cloned tree: ./scripts/getver.sh --> r20385-e972c6aee5
settings generally as recommended in @ACwifidude post #2, performance cpu governor, no _min _max limits specified.

initial instance of a hang and watchdog reboot as described above in THIS post:

A kworker thread freerunning:

PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
13258     2 root     RW       0   0%  50% [kworker/1:2+eve]

I added a simple script to dump the most-active task as seen by top to my syslog - to dump that task's /proc/[pid]/status ... unfortunately only the top-most task ...

caught multiple instances of rpcd running (apparently) continuously for multiple seconds, using all of one core:

Aug 29 12:16:52 NineNet root:  1313     1 root     R     2028   0%  46% /sbin/rpcd -s /var/run/ubus/ubus.sock -t 30

Aug 29 12:16:52 NineNet root: Name: rpcd
Umask: 0077 State: R (running) Tgid: 1313 Ngid: 0 Pid: 1313 PPid: 1
...

46% in one sample, 45 in others, at this point enough cpu to allow the writes to the log server.

I suspect there's an ISR / kernel worker thread spinning as well but I'll change my dump-top script to show the top 2 or 3 ...

this was accompanied by multiples of:

Aug 29 12:16:52 NineNet root: Mon Aug 29 12:16:52 PDT 2022
Aug 29 12:16:53 NineNet kernel: [  897.477931] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Aug 29 12:16:53 NineNet kernel: [  897.477985] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Aug 29 12:16:53 NineNet kernel: [  897.484076] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon

I don't know if SWBA overrun is a cause or an effect.

When this happens, I suspect a power cycle reboot is necessary as system timekeeping breaks - router shows 1 time, syslog server shows another.

It seems the @ACwifidude build uses busybox's ntp client ... I'm going to see if there's a package for the full client and server.

At this point I've dropped my backup Archer A6 into my network. Even though it's a bit slower on my net, I'll reconfigure my R7800 to serve just my development area and see if that smaller universe can still trigger crashes.

mmmv,
M.

5 Likes

That’s interesting. If we isolate issues to the ntp client that would be potentially an easy fix!

again, don't know if that's a cause or effect ... we have some pretty low-level stuff running amok that I want to track down.

Depending on the sizeof the code base, I want to walk thru rpcd and look where it might be spinning on some syscall ... almost looking for an unwise while (1) there.

I'm on the fence about doing any more, after living with my backup Archer A6 for an hour of web browsing. half tempted to wipe it clean, reinstall 21.02.1 and restore a prior config.

And half tempted to set the R7800 up as an access point for my development area and see if I can dump what else is looping when things go south.

This smells like multiple bugs; getting much closer to these is going to be a pig.

where / how would I build or load ntp client and luci ntp config packages??

I let my OpenWRT devices send their log to a remote syslog server. In the logs on that remote server I see the date/time going back in time right after a random reboot. AFAIK this is expected behavior since during startup syslog is able to send data to a remote syslog server sooner than ntp is able to synchronize time with a ntp-server on the Internet.

Therefore I'd guess the time going back and forth is an effect of a random reboot. I believe these are the details on how that goes:

As soon as OpenWRT is started (booted) it will determine the current system time based on the latest moment in time a certain file has been written to. Until ntp has contacted a ntp-server, OpenWRT will count the time based on that point in time. This is what causes unlogical time points on a remote syslog server.

But interesting find about rpcd and ubus. I checked the objects on my R7800 (running as router) and see that it really depends on your local configuration how many objects you have active on ubus.

ubus list example
root@OpenWrt:/usr/libexec# ubus list
dawn
dhcp
dnsmasq
file
hostapd
hostapd.wlan0
hostapd.wlan0-1
hostapd.wlan1
hostapd.wlan1-1
hostapd.wlan1-2
hotplug.block
hotplug.dhcp
hotplug.firmware
hotplug.ieee80211
hotplug.iface
hotplug.neigh
hotplug.net
hotplug.ntp
hotplug.tftp
iwinfo
log
luci
luci-rpc
luci.wireguard
network
network.device
network.interface
network.interface.IOT
network.interface.IPTV_WAN
network.interface.guest
network.interface.lan
network.interface.loopback
network.interface.wan
network.interface.wan_6
network.rrdns
network.wireless
rc
service
session
system
uci
umdns
wpa_supplicant

I ran a ubus monitor for a minute and it is quite chatty although system load is low Load average: 0.10 0.06 0.04. I'd nominate the dawn package to be a suspect since it is not a default OpenWRT package besides the rest of the list, but I'm guessing others have the same random reboot behavior without dawn?

Thanks for the thoughts ...
as far as ntp goes, /usr/sbin/ntpd --> busybox. depending on how busybox implements ntp it may not update system time once the system time is too different from what's expected ... so the time just falls farther and farther back with each reboot.

I don't know if it has a "force" option.

This may well have an impact on system services that depend on accurate timing of longer intervals. maybe not.

As far as system configuration cases / issues causing or preventing reboots, my 0.00003 is:

  1. Those corner cases are not to be avoided, but fixed - they're bugs.
    B) If an optional setting is actually required for system stability it should be moved to an R7800 (as appropos) -specific module to run at startup and forcibly set them.

i.e. take the required settings out of the realm of folklore and make them transparently set.

I'm backing away from debugging these reboots which don't leave oops dumps - something is locking up until the system grinds to a halt. a bit 'o hardware, probably close to the NSS cores or ath10k.

I need to think of a way to get a kernel backtrace and hardware status dump when the watchdog bites.

thanks again,
M.

probably i'm wrong, but isn't time starting from 0 at every boot, since our router has no RTC connected? i'm not 100% sure about this, but i can remember some time ago looking to sys and kernel log that in the startup log there is a list of times close to 0 until (i assume) NTPD syncs with the time server, doesn't seem a strange behavior (since, as above, we have no RTC)
I think it's much more interesting to try to understand what locks up system processes.. but sadly i'm not good enough for this, i'm just a banking guy having fun with it :slight_smile:

From memory, I don't think it starts from 0, I seem to recall it's some what close to what it should be, although far enough out to cause problems and also to not go back in sync easily when ntp tries to slew the clock. I've no idea how it gets a value aprox right though, given the lack of RTC...

I added the below hotplug script a while back, it gets called as soon as the lan comes up, with the -b (step) option basically forcing the correct data. I'm running bind on my self and dns resolution doesn't work with the wrong time, hence having a hard coded IP....

/etc/hotplug.d/iface
router:> cat 25-synctime
#!/bin/sh

if [ "$ACTION" = "ifup" -a "$INTERFACE" = "wan" ]; then
  /usr/sbin/ntpdate -s -b -u -t 2 <NTP Server IP>
fi

It's been a while since I've done this, but I seen to recall ntpdate needs to be specifically installed. I'm also running chrony rather than ntpd on my system, I don't get random reboots, so might be worth someone having that issue making the switch. I can't see it making any difference, but it would at least rule it out as the trigger...

Note: Edited the script to reference wan rather than lan, which would be the correct interface for a public ntp server. Simply swap to lan if you have a suitable internal ntp server, time should sync up a little quicker...

I guess it would also make sense then to pick a NTP server that doesn't require you to have a working WAN connection then? I need to use PPPoE. I could run a NTP server on my NAS? But I won't be doing this until the weekend.

I think you've just spotted a bug in my script, I'm using a public ntp server, so INTERFACE should probably be WAN, not lan. It's obviously been working by luck up to now...

A local ntp server would probably get thinks up and running quicker and would be a good choice if you've got something suitable, some sort of USB GPS device would be perfect, although I've yet to find one for a sensible price, or figure out an simply way of repurposing an old mobile to perform the task...

The clock seems to pick up where it left off ... but add'l time has elapsed thru the hang and reboot and it accumulates. I think a file with last current time is written which gets read at boot ... although it's a pretty coarse update interval.

it'd be nice if someone made that forced time sync a boot-time option.

@noblem thanks for the word on ntpdate ... I'll add that to my startup.

There are 2 services that are involved in setting up the system time in /etc/init.d. These are sysfixtime and sysntpd. This first service contains a function that searches for files in /etc and their modification times. The most recent modified file gets compared with the current system time. In case this is less, than the system time is updated with this most recent file:

sysfixtime
#!/bin/sh /etc/rc.common
# Copyright (C) 2013-2014 OpenWrt.org

START=00
STOP=90

RTC_DEV=/dev/rtc0
HWCLOCK=/sbin/hwclock

boot() {
	start && exit 0

	local maxtime="$(maxtime)"
	local curtime="$(date +%s)"
	[ $curtime -lt $maxtime ] && date -s @$maxtime
}

start() {
	[ -e "$RTC_DEV" ] && [ -e "$HWCLOCK" ] && $HWCLOCK -s -u -f $RTC_DEV
}

stop() {
	[ -e "$RTC_DEV" ] && [ -e "$HWCLOCK" ] && $HWCLOCK -w -u -f $RTC_DEV && \
		logger -t sysfixtime "saved '$(date)' to $RTC_DEV"
}

maxtime() {
	local file newest

	for file in $( find /etc -type f ) ; do
		[ -z "$newest" -o "$newest" -ot "$file" ] && newest=$file
	done
	[ "$newest" ] && date -r "$newest" +%s
}

It's hard to say which file in /etc gets modified on a regular basis that serves as a starting point for the system time at bootup.

This script is executed early in bootup. At a later point during bootup the script sysntpd is executed. This does some more sanity checks and only continues to run if it is enabled. If timing really is an issue going back and forth in system/rpcd/ubus communication, I think the most conclusive way to rule that out is to disable the NTP client (sysntpd service) all together. In doing that, during bootup time won't go back and forth between the latest modified file in /etc and later on being synced to any available NTP server. Logging by time will be offset perhaps quite a bit so tracing back what happend at a certain point in time will be challenging, but at least from the services/kernel perspective that would have trouble with time going back and forth, the system time will be consistent.

But I can't imagine why a for a short period of time a difference in system time would cause random reboots. Those of us who experience random reboots haven't found a similar time or action that would be linked to a random reboot. It is suspicious off course that rpcd is taking multiple seconds of a core to do something...

@Mpilon would you mind sharing your simple script so others can run that too? Perhaps we can find some leads to which process is causing the random reboot? I can save outputs of that script to a USB stick I've got mounted in my R7800. Or perhaps even send it remotely to a different host using netcat.

Just a thought on the random reboots, but do those having them have CONFIG_PSTORE_CONSOLE enabled in the kernel config? I'm not sure how you can check from router CLI if it's enabled or not, but I believe /sys/fs/pstore/console-ramoops-0 will exist if enabled and that it should be a copy of the kernel log buffer from prior to the reboot. In theory, if the reboots are triggered by a watchdog then it would be logged here...

If it's not enabled then it can be enabled via make kernel_menuconfig, then navingating to File Systems, Miscellaneous filesystems, Log kernel console messages via the menu system
If it's already enabled then just ignore me :slight_smile:

On a side note, I came across this GPS guide in the docs last night, the first device looks like a pretty cheap GPS based clock source, I should hopefully have one by the weekend to have a play with...

AFAIK the R7800 should have ramoops enabled since 22.03 was branched of master, or perhaps a little earlier already. I have the /sys/fs/pstore filesystem mounted:

root@ap-R7800:/sys/fs/pstore# mount
/dev/root on /rom type squashfs (ro,relatime)
...
pstore on /sys/fs/pstore type pstore (rw,noatime)

But I don't see any file (yet) in that directory:

root@ap-R7800:~# ls -la /sys/fs/pstore/
drwxr-x---    2 root     root             0 Aug 25 21:12 .
drwxr-xr-x   10 root     root             0 Jan  1  1970 ..

Does this mean that although the pstore filesystem is available and mounted, there is no reservation somewhere for the latest kernel memory? I'm not familiair with enabling ramoops...

maybe I say nonsense, but if we create a script that rewrites an empty file in etc every 5 seconds for example? it would always be the most recent file and the most recent time. If the problem is the jet lag, I could solve it. how to create that script? I don't know, I guess with crontab and a mk? no idea sorry

remember you cannot endlessly because you can wear the flash, but you use something similar to:

/etc/init.d/writetime

#!/bin/sh /etc/rc.common
USE_PROCD=1
START=90
STOP=21

boot() {
    echo "placeholder" > /dev/null
}

stop_service() {
    chronyc tracking | grep -q "Normal" && touch /etc/config/last-reboot
}

you need change "chronyc tracking | grep -q Normal" for the ntpd equivalent, because I use chronyd

but the above line do is if the time is correct touch the file in /etc at reboot

or you can put the same line in a crontab running every day, thats depends in your requirements.