Sometimes my main tunnel goes down and it may take some time to get it back reliably.
In that case would like to have some solution that will start prepared interface (not a peer, for various reasons - mainly configurations may differ) and will keep the connection (through the backup Wireguard) running, while also periodically starting the main one to check if it's already functional. If it's, it will stop the backup and continue with main one.
Manual set up of the main and backup is needed to distinguish interfaces.
There is @egc Watchdog but that does iterate upwards, indefinitely, not checking the state of tunnel that failed to return to it eventually. So that is not exactly a fit for my needs - but started here and iterated, ending so far:
# Configuration
MAIN_INTERFACE="WG"
BACKUP_INTERFACE="WG_failover"
CHECK_INTERVAL=80 # Seconds between checks in normal operation
RETRY_INTERVAL=300 # Seconds between checks when in backup mode
MAX_FAILURES=5 # Consecutive ping failures before switching
PING_ATTEMPTS=2 # Number of ping packets to send
PING_TIMEOUT=6 # Timeout in seconds for each ping attempt
PINGIP=1.1.1.1
# Initialize interface states
current_interface="$MAIN_INTERFACE"
#ifup "$MAIN_INTERFACE" - not necessary if bring on boot enabled for the main one
fail_count=0
sleep 120 # Delay after start for first run to avoid having unprepared interfaces to deal with
# Main monitoring loop
while true; do
if [ "$current_interface" = "$MAIN_INTERFACE" ]; then
# Check main interface connectivity
ping -c "$PING_ATTEMPTS" -W "$PING_TIMEOUT" -I "$MAIN_INTERFACE" "$PINGIP" >/dev/null 2>&1
if [ $? -eq 0 ]; then
fail_count=0
else
fail_count=$((fail_count + 1))
if [ $fail_count -eq $((MAX_FAILURES -2)) ]; then
echo "WG_checker: beware ${MAIN_INTERFACE} in troubles, ${fail_count}/$MAIN_INTERFACE consecutive ping failed"
fi
if [ $fail_count -ge $MAX_FAILURES ]; then
# Switch to backup interface
ifdown "$MAIN_INTERFACE"
sleep 25
ifup "$BACKUP_INTERFACE"
current_interface="$BACKUP_INTERFACE"
fail_count=0
echo "WG_checker: tunnel ${MAIN_INTERFACE} is DOWN, starting backup tunnel"
fi
fi
if [ "$current_interface" = "$MAIN_INTERFACE" ]; then
echo "WG_checker: WG main still up, going to sleep for ${CHECK_INTERVAL}."
sleep "$CHECK_INTERVAL"
fi
else
# In backup mode - check main interface periodically
sleep "$RETRY_INTERVAL"
# Test main interface temporarily
ifup "$MAIN_INTERFACE"
sleep 20
ping -c "$PING_ATTEMPTS" -W "$PING_TIMEOUT" -I "$MAIN_INTERFACE" "$IPPING" >/dev/null 2>&1
if [ $? -eq 0 ]; then
# Switch back to main interface
ifdown "$BACKUP_INTERFACE"
current_interface="$MAIN_INTERFACE"
echo "WG_checker: main tunnel ${MAIN_INTERFACE} is UP, ifdown backup now"
else
# Keep using backup interface
ifdown "$MAIN_INTERFACE"
echo "WG_checker: WG main still down, on ${BACKUP_INTERFACE}, going to sleep for ${RETRY_INTERVAL}."
fi
fi
done
)
It's shortened just to main function.
What bothers me when testing this is to be seen here:
--below is the switch for failover
22:55:11 WR WG_checker[0]: + '[' 5 -ge 5 ']'
22:55:11 WR WG_checker[0]: + ifdown WG
22:55:11 WR WG_checker[0]: + sleep 25
22:55:36 WR WG_checker[0]: + ifup WG_failover
22:55:36 WR netifd: Interface 'WG_failover' is setting up now
22:55:36 WR WG_checker[0]: + current_interface=WG_failover
22:55:36 WR WG_checker[0]: + fail_count=0
22:55:36 WR WG_checker[0]: + echo 'WG_checker: tunnel WG is DOWN, starting backup tunnel'
22:55:36 WR WG_checker[0]: WG_checker: tunnel WG is DOWN, starting backup tunnel
22:55:36 WR WG_checker[0]: + '[' WG_failover = WG ']'
22:55:36 WR WG_checker[0]: + true
22:55:36 WR WG_checker[0]: + '[' WG_failover = WG ']'
22:55:36 WR WG_checker[0]: + sleep 300
22:55:36 WR netifd: Interface 'WG_failover' is now down
22:56:08 WR pbr: Reloading pbr WG_failover interface routing due to ifdown of WG_failover ()
23:00:36 WR WG_checker[0]: + ifup WG
23:00:36 WR WG_checker[0]: + sleep 20
23:00:56 WR WG_checker[0]: + ping -c 2 -W 6 -I WG ''
23:00:56 WR WG_checker[0]: + '[' 2 -eq 0 ']'
23:00:56 WR WG_checker[0]: + ifdown WG
23:00:56 WR WG_checker[0]: + echo 'WG_checker: WG main still down, on WG_failover, going to sleep for 300.'
23:00:56 WR WG_checker[0]: WG_checker: WG main still down, on WG_failover, going to sleep for 300.
23:00:56 WR WG_checker[0]: + true
23:00:56 WR WG_checker[0]: + '[' WG_failover = WG ']'
23:00:56 WR WG_checker[0]: + sleep 300
23:01:54 WR netifd: Interface 'WG_failover' is setting up now
23:01:54 WR netifd: Interface 'WG_failover' is now up
23:01:54 WR netifd: Network device 'WG_failover' link is up
23:01:57 WR pbr: Setting up routing for 'WG_failover/10.2.0.2' [✓]
23:01:57 WR pbr: service monitoring interfaces: wan WG WG_failover
To point out:
22:55:36 WR netifd: Interface 'WG_failover' is setting up now
22:55:36 WR netifd: Interface 'WG_failover' is now down
- basically immediately; but I do not call any 'ifdown' - it's so fast, the interface have no chance to get up and stay running for the subsequent ping test to be performed
23:01:54 WR netifd: Interface 'WG_failover' is setting up now
23:01:54 WR netifd: Interface 'WG_failover' is now up
23:01:54 WR netifd: Network device 'WG_failover' link is up
23:01:57 WR pbr: Setting up routing for 'WG_failover/10.2.0.2' [✓]
23:01:57 WR pbr: service monitoring interfaces: wan WG WG_failover
Testing this basically pulling the WAN cable out, expecting it stuck on WG_failover until the cycle start again the main WG - should be possible after connecting the WAN back.
But this is not happening.
To get it back working, the 'service network restart' must be performed.
I'm quite inexperienced with all of this.
So far expecting the problem is:
- have some bug in it - as looking at it too long may it be invisible to me.
- pbr steps in and it somehow interfere with switching(?)
- if/up+down is not enough and must be accompanied with some more uci(?)
If someone can poke me a bit, that would be great.
Or if it's already possible and I'm just re-doing it, let me know.
(Watchcat is great for restarting but that won't solve the hassle at all.)