Another crash after I changed 1 thing -- had been up for 24+ hours.
don't know if this is significant or an example of me being an idiot.
@ACwifidude -- I'd appreciate an overview. of how this works; of me being an idiot is optional.
Seeing the nblwmon/collectd out of memory msgs in the syslog I did the sensible (!) thing and disabled nlbwmon.
soon after, a reboot occurred.
./scripts/getver.sh ---> r20385-e972c6aee5
I don't know how deep the dependencies for nlbwmon go but syslog showed a couple of errors which happened just before prior reboots:
the logged error: NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
and the relevant bit of syslog -- NineNet is the R7800, zorro its log host.
Sep 2 09:27:17 NineNet dnsmasq[1]: possible DNS-rebind attack detected: tag.1rx.io
Sep 2 09:27:40 NineNet kernel: [ 856.649025] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
Sep 2 09:27:40 NineNet kernel: [ 856.649056] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
Sep 2 09:27:47 NineNet dnsmasq[1]: possible DNS-rebind attack detected: ap.lijit.com
.
.
.
Sep 2 09:29:17 NineNet dnsmasq[1]: possible DNS-rebind attack detected: ap.lijit.com
Sep 2 09:29:17 NineNet dnsmasq[1]: possible DNS-rebind attack detected: htlb.casalemedia.com
Sep 2 09:29:17 NineNet dnsmasq[1]: possible DNS-rebind attack detected: fastlane.rubiconproject.com
Sep 2 09:29:29 NineNet kernel: [ 965.022070] ath10k_pci 0000:01:00.0: wmi command 36967 timeout, restarting hardware
Sep 2 09:29:29 NineNet kernel: [ 965.022121] ath10k_pci 0001:01:00.0: wmi command 36967 timeout, restarting hardware
Sep 2 09:29:30 NineNet kernel: [ 965.852749] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep 2 09:29:30 NineNet kernel: [ 965.852793] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep 2 09:29:30 NineNet kernel: [ 965.858879] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep 2 09:29:30 NineNet kernel: [ 965.866332] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep 2 09:29:30 NineNet kernel: [ 965.873529] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep 2 09:29:30 NineNet kernel: [ 965.880755] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep 2 09:29:30 NineNet kernel: [ 965.888182] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep 2 09:29:30 NineNet kernel: [ 965.895396] ath10k_pci 0001:01:00.0: SWBA overrun on vdev 0, skipped old beacon
Sep 2 09:29:30 NineNet kernel: [ 966.035385] ieee80211 phy0: Hardware restart was requested
Sep 2 09:29:30 NineNet kernel: [ 966.038956] ieee80211 phy1: Hardware restart was requested
Sep 2 09:29:36 NineNet kernel: [ 972.277503] ath10k_warn: 113 callbacks suppressed
Sep 2 09:29:36 NineNet kernel: [ 972.277519] ath10k_pci 0000:01:00.0: Unknown eventid: 36933
Sep 2 09:30:01 zorro CRON[3249555]: (root) CMD ([ -x /etc/init.d/anacron ] && if [ ! -d /run/systemd/system ]; then /usr/sbin/invoke-rc.d anacron start >/dev/null; fi)
Sep 2 09:30:15 zorro kernel: [79884.582395] r8169 0000:05:00.0 enp5s0: Link is Down
I don't know if this gets us closer to understanding the crashes or if this is something I explicitly (unwittingly) caused - if so, it may be a candidate for being made more idiot-proof, even if I'm the idiot. my 0.000002
I don't have enough big-picture of this to attempt a fix but have re-enabled nlbwmon and fixed the buffer size in its config file.
Thoughts?