Using NUT for UPS Monitoring - calling NUT experts!

supersebbo · June 12, 2019, 10:08pm

Hi folks,

I'm trying to get to grips with NUT. Thedocumentation for the UCI enabled OpenWRT package is quite... rudimentary. And theofficial NUT documentaion makes a lot of assumptions around configuration if you installed from the distro packages (for Debian etc).

I've got as far as configuring the driver and setting up the server and users. Also got the local monitor working and I can query the UPS using upsc. Also have the nut client configured on my other devices (NAS, file-server and security cam recorder) and able to communicate with the NUT server. All good.

My biggest confusion now is how to get this all working in beautiful harmony. The NUT config guide lists nearly all the UCI parameters as 'required', but then shows 'working examples' with them all omitted. There is no description of the 'default' behaviour if it's left like this. Maybe it does nothing, maybe it auto-magically shuts everyting down when the power fails. I just don't know.

What I want to ultimately acheive is this, upon a power failure:

High power devices (NAS and file-server) are commanded to shutdown cleanly immediately. I think this is done by putting an 'EXEC' script in the upsmon clients on these devices to run "/sbin/shutdown now" on the 'ONBATT' notification. Easy enough.
Upon 'low battery' warning from the UPS my security cam recorder is commanded to shut down. This is the most default behaviour, and I think just needs "/sbin/shutdown now" as the SHUTDOWNCMD for this upsmon client.
After the FINALDELAY timeout the OpenWRT router (which is the UPS master) should:

Write the current logd ring-buffer to file (logread > /root/power-fail.log)
Make-safe the file system (go read-only)
Finally command the UPS to cut the load-power (switching itself off in the process).

I think I understand how to acheive 1 and 2. 3 is a little trickier. The NUT user guide implies some of this is auto-magic 'You don't need to do this if installing from pacakges' but I'm not sure this applies to the OpenWRT package version. So do I need to write my own shell 'shutdown' script? The step of making the file system 'safe and read-only' is frequently covered in the NUT documentation as being something obvious and easy to acheive, and it frequently pops up in sentences like: "You should configure your system to power down the UPS after the filesystems are remounted read-only." What would this look like on an OpenWRT system? You can't simply call the 'halt' command here because you need to call the 'upsdrvctl shutdown' command to remove the power after the system is made 'safe'.

Now, a further issue I have with this, is that if the power is restored between steps 1 and 3, we enter an error state because the NAS and server will be powered off, but power will have been restored to the UPS - they wont power-cycle because there was no real power interruption. I think I can get around this by scripting Wake-On-Lan magic packets into a script to be called upon the ONLINE event - however this is going to be suseptible to race conditions.

I'd apprecaite any advice or input on this!

jeff · June 12, 2019, 10:24pm

What's wrong with something like

#!/bin/sh
logread > /root/power-fail.log
sync
mount -o remount,ro /
do_kill_power

overlayfs:/overlay on / type overlay (ro,noatime,lowerdir=/,upperdir=/overlay/upper,workdir=/overlay/work)

supersebbo · June 13, 2019, 7:16am

Probably nothing, I just didn't know how to do this!

Is the last line on the overlay FS needed also?

Thanks!

jeff · June 13, 2019, 9:21am

Last line, if I understood your reference, just shows that it marked the overlay as read-only. Dealing with overlays can be tricky.

supersebbo · June 13, 2019, 10:51am

Ah ok, I wasn't sure if that was meant to be part of the shell script.

supersebbo · June 13, 2019, 11:25am

So I think I have this worked out, more scripting than I was expecting to have to do. NUT does have some default behaviours, which IMHO are not very clearly documented. It will command 'slave' upsmon clients to call their SHUTDOWNCMD when the UPS system goes 'critical' by issuing an 'FSD' event. If you just have one UPS, this is defined as when the UPS reports 'On Battery' and 'Low Battery' simultaneously. However this only calls the hooks into the scripts defined into the configs, so if they are not defined, nothing actually happens. The base design of a nut system is that you run everything until the UPS reports 'low batt' then shut everything down immediately. As I want to do it slightly differently and shutdown high-drain devices sooner to allow my security cam devices to run longer, it's all 'custom'.

So, to achieve what I want, I have a custom notify script on the master, which is fairly standard, except it includes some logic to send wake-on-lan packets to the high-drain devices that might have already powered off.

#!/bin/sh
# Script called by NUT for EXEC notifications. 
EVENT = "${NOTIFYTYPE}"
case $EVENT in
    "ONLINE")
        logger -s "UPS: [ONLINE] Primary power has been restored."
        logger -s "UPS: [ONLINE] Removing kill-power flag."
        rm -f /etc/killpower
        # Some systems may have already shutdown due to the ONBATT condition.  Send them Wake-On-Lan packets.  
        # First we wait some time to eliminate race conditions as dependent systems may have just been told to shutdown and still in the process of going offline. 
        sleep 120
        logger -s "UPS: [ONLINE] Sending Wake-On-Lan packets to dependent systems."
        wakeonlan xx:yy:zz:11:22:33
        wakeonlan xx:yy:zz:11:22:33
        wakeonlan xx:yy:zz:11:22:33
        ;;
    "ONBATT")
        logger -s "UPS: [ONBATT] Primary power has failed.  System running on battery."
        ;;
    "LOWBATT")
        logger -s "UPS: [LOWBATT] Primary power has failed.  Battery is low."
        ;;
    "FSD")
        logger -s "UPS: [FSD] Primary power has failed.  Battery is critical.  Notifying dependent systems and preparing to kill power."
        ;;
    "COMMOK")
        ;;
    "COMMBAD")
        ;;
    "SHUTDOWN")
        logger -s "UPS: [SHUTDOWN] Requesting system shutdown."
        ;;
    "REPLBATT")
        logger -s "UPS: [REPLBATT] The UPS battery needs replacing."
        ;;
    "NOCOMM")
        ;;
    *)
        logger -s "UPS: Notify script was called but no notification type was found.  Odd."
        ;;
esac

Then my shutdown script on the master (SHUTDOWNCMD in NUT config) is:

#!/bin/sh
# Script called by NUT when shutting down this system (UPS Master). 
logger -s "SHUTDOWN: System is shutting down due to power failure."
logger -s "SHUTDOWN: Writing syslog to file."
logread > /root/power-fail.log
logger -s "SHUTDOWN: Syncing file-systems and re-mounting RO."
sync
mount -o remount,ro /
logger -s "SHUTDOWN: Checking for kill-power flag."
if (test -f /etc/killpower)
then
    logger -s "SHUTDOWN: Kill-power flag found.  Commanding UPS to disconnect load."
    upsdrvctl shutdown
    # waiting 60 seconds to die because the UPS switches the load off. 
    sleep 60
    # If power is still on now, primary power was likely restored before the UPS shut the supply off, so reboot cleanly.
    logger -s "SHUTDOWN: UPS load failed to disconnect. Power may have been restored.  Rebooting System."
    reboot
else
    # the kill-power flag was cleared sometime during the execution of this script, power was likely restored, so reboot cleanly. 
    logger -s "SHUTDOWN: No kill-power flag found.  Power may have been restored.  Rebooting System."
    reboot
fi

On the 'high power' clients that I want to turn off immediately, I have this custom notify script, it waits 60 seconds to avoid needless reboots on transient power glitches.

#!/bin/sh
# Script called by NUT for EXEC notifications. 
EVENT = "${NOTIFYTYPE}"
case $EVENT in
    "ONLINE")
        logger -s "UPS: [ONLINE] Primary power has been restored.  Clearing the shutdown flag"
        rm -f /etc/shutdown-flag
        ;;
    "ONBATT")
        logger -s "UPS: [ONBATT] Primary power has failed.  System running on battery."
        logger -s "UPS: [ONBATT] Setting shutdown flag and waiting 60 seconds to see if this is transient."
        touch /etc/shutdown-flag
        sleep 60
        if (test -f /etc/shutdown-flag)
        then
            logger -s "UPS [ONBATT] 60 seconds on battery have elapsed.  Shutting down this system."
            rm -f /etc/shutdown-flag
            shutdown now
        else
            logger -s "UPS: [ONBATT] The shutdown flag cleared.  Power was likely restored."
        fi
        ;;
    "LOWBATT")
        logger -s "UPS: [LOWBATT] Primary power has failed.  Battery is low."
        ;;
    "FSD")
        logger -s "UPS: [FSD] Primary power has failed.  Battery is critical."
        ;;
    "COMMOK")
        ;;
    "COMMBAD")
        ;;
    "SHUTDOWN")
        logger -s "UPS: [SHUTDOWN] Requesting system shutdown."
        ;;
    "REPLBATT")
        logger -s "UPS: [REPLBATT] The UPS battery needs replacing."
        ;;
    "NOCOMM")
        ;;
    *)
        logger -s "UPS: Notify script was called but no notification type was found.  Odd."
        ;;
esac

The rest of the clients just use the default NUT behaviour, calling 'shutdown now' as their SHUTDOWNCMD when they get the critical FSD event from the NUT server.

The only other thing needed was a startup script on the router to send wake-on-lan packets to the dependent devices incase there is a race condition somewhere between them shutting down and power being restored.

#!/bin/sh
logger -s "STARTUP: System is starting up."
sleep 120
logger -s "STARTUP: Sending Wake-On-LAN packets to dependent systems."
        wakeonlan xx:yy:zz:11:22:33
        wakeonlan xx:yy:zz:11:22:33
        wakeonlan xx:yy:zz:11:22:33

Haven't had chance to fully test yet, will update as I do.