Bash script for automatic change between 2 Wireguard tunnels

Sometimes my main tunnel goes down and it may take some time to get it back reliably.

In that case would like to have some solution that will start prepared interface (not a peer, for various reasons - mainly configurations may differ) and will keep the connection (through the backup Wireguard) running, while also periodically starting the main one to check if it's already functional. If it's, it will stop the backup and continue with main one.

Manual set up of the main and backup is needed to distinguish interfaces.

There is @egc Watchdog but that does iterate upwards, indefinitely, not checking the state of tunnel that failed to return to it eventually. So that is not exactly a fit for my needs - but started here and iterated, ending so far:

# Configuration
MAIN_INTERFACE="WG"
BACKUP_INTERFACE="WG_failover"
CHECK_INTERVAL=80       # Seconds between checks in normal operation
RETRY_INTERVAL=300      # Seconds between checks when in backup mode
MAX_FAILURES=5          # Consecutive ping failures before switching
PING_ATTEMPTS=2         # Number of ping packets to send
PING_TIMEOUT=6          # Timeout in seconds for each ping attempt
PINGIP=1.1.1.1


# Initialize interface states
current_interface="$MAIN_INTERFACE"
#ifup "$MAIN_INTERFACE" - not necessary if bring on boot enabled for the main one
fail_count=0

sleep 120 # Delay after start for first run to avoid having unprepared interfaces to deal with

# Main monitoring loop
while true; do
    if [ "$current_interface" = "$MAIN_INTERFACE" ]; then
        # Check main interface connectivity
        ping -c "$PING_ATTEMPTS" -W "$PING_TIMEOUT" -I "$MAIN_INTERFACE" "$PINGIP" >/dev/null 2>&1

        if [ $? -eq 0 ]; then
            fail_count=0
        else
            fail_count=$((fail_count + 1))
	if [ $fail_count -eq $((MAX_FAILURES -2)) ]; then
		echo "WG_checker: beware ${MAIN_INTERFACE} in troubles, ${fail_count}/$MAIN_INTERFACE consecutive ping failed"
	fi
            if [ $fail_count -ge $MAX_FAILURES ]; then
                # Switch to backup interface
                ifdown "$MAIN_INTERFACE"
                sleep 25
                ifup "$BACKUP_INTERFACE"
                current_interface="$BACKUP_INTERFACE"
                fail_count=0
                echo "WG_checker: tunnel ${MAIN_INTERFACE} is DOWN, starting backup tunnel"
            fi
        fi
        if [ "$current_interface" = "$MAIN_INTERFACE" ]; then
		echo "WG_checker: WG main still up, going to sleep for ${CHECK_INTERVAL}."
        	sleep "$CHECK_INTERVAL"
	fi
    else
        # In backup mode - check main interface periodically
        sleep "$RETRY_INTERVAL"

        # Test main interface temporarily
        ifup "$MAIN_INTERFACE"
        sleep 20
        ping -c "$PING_ATTEMPTS" -W "$PING_TIMEOUT" -I "$MAIN_INTERFACE" "$IPPING" >/dev/null 2>&1

        if [ $? -eq 0 ]; then
            # Switch back to main interface
            ifdown "$BACKUP_INTERFACE"
            current_interface="$MAIN_INTERFACE"
        	echo "WG_checker: main tunnel ${MAIN_INTERFACE} is UP, ifdown backup now"
        else
            # Keep using backup interface
            ifdown "$MAIN_INTERFACE"
		echo "WG_checker: WG main still down, on ${BACKUP_INTERFACE}, going to sleep for ${RETRY_INTERVAL}."
        fi
    fi
done

)

It's shortened just to main function.

What bothers me when testing this is to be seen here:

 --below is the switch for failover
22:55:11 WR WG_checker[0]: + '[' 5 -ge 5 ']'
22:55:11 WR WG_checker[0]: + ifdown WG
22:55:11 WR WG_checker[0]: + sleep 25
22:55:36 WR WG_checker[0]: + ifup WG_failover
22:55:36 WR netifd: Interface 'WG_failover' is setting up now
22:55:36 WR WG_checker[0]: + current_interface=WG_failover
22:55:36 WR WG_checker[0]: + fail_count=0
22:55:36 WR WG_checker[0]: + echo 'WG_checker: tunnel WG is DOWN, starting backup tunnel'
22:55:36 WR WG_checker[0]: WG_checker: tunnel WG is DOWN, starting backup tunnel
22:55:36 WR WG_checker[0]: + '[' WG_failover = WG ']'
22:55:36 WR WG_checker[0]: + true
22:55:36 WR WG_checker[0]: + '[' WG_failover = WG ']'
22:55:36 WR WG_checker[0]: + sleep 300
22:55:36 WR netifd: Interface 'WG_failover' is now down
22:56:08 WR pbr: Reloading pbr WG_failover interface routing due to ifdown of WG_failover ()
23:00:36 WR WG_checker[0]: + ifup WG
23:00:36 WR WG_checker[0]: + sleep 20
23:00:56 WR WG_checker[0]: + ping -c 2 -W 6 -I WG ''
23:00:56 WR WG_checker[0]: + '[' 2 -eq 0 ']'
23:00:56 WR WG_checker[0]: + ifdown WG
23:00:56 WR WG_checker[0]: + echo 'WG_checker: WG main still down, on WG_failover, going to sleep for 300.'
23:00:56 WR WG_checker[0]: WG_checker: WG main still down, on WG_failover, going to sleep for 300.
23:00:56 WR WG_checker[0]: + true
23:00:56 WR WG_checker[0]: + '[' WG_failover = WG ']'
23:00:56 WR WG_checker[0]: + sleep 300
23:01:54 WR netifd: Interface 'WG_failover' is setting up now
23:01:54 WR netifd: Interface 'WG_failover' is now up
23:01:54 WR netifd: Network device 'WG_failover' link is up
23:01:57 WR pbr: Setting up routing for 'WG_failover/10.2.0.2' [✓]
23:01:57 WR pbr: service monitoring interfaces: wan WG WG_failover 

To point out:

22:55:36 WR netifd: Interface 'WG_failover' is setting up now
22:55:36 WR netifd: Interface 'WG_failover' is now down
  • basically immediately; but I do not call any 'ifdown' - it's so fast, the interface have no chance to get up and stay running for the subsequent ping test to be performed
23:01:54 WR netifd: Interface 'WG_failover' is setting up now
23:01:54 WR netifd: Interface 'WG_failover' is now up
23:01:54 WR netifd: Network device 'WG_failover' link is up
23:01:57 WR pbr: Setting up routing for 'WG_failover/10.2.0.2' [✓]
23:01:57 WR pbr: service monitoring interfaces: wan WG WG_failover

Testing this basically pulling the WAN cable out, expecting it stuck on WG_failover until the cycle start again the main WG - should be possible after connecting the WAN back.

But this is not happening.

To get it back working, the 'service network restart' must be performed.

I'm quite inexperienced with all of this.

So far expecting the problem is:

  1. have some bug in it - as looking at it too long may it be invisible to me.
  2. pbr steps in and it somehow interfere with switching(?)
  3. if/up+down is not enough and must be accompanied with some more uci(?)

If someone can poke me a bit, that would be great.

Or if it's already possible and I'm just re-doing it, let me know.
(Watchcat is great for restarting but that won't solve the hassle at all.)

Small addition, my script iterate upwards until all tunnels are used then it starts from the beginning, but I understand what you want.

But do you really need what you want?
Many providers have different servers even in the same place so it does not matter which one you use.
Of course if you are not connecting to a VPN provider things maybe different.

If I have more time later, I will have a look at your script

1 Like

I though it's what the MWAN3 package was for ...

3 Likes

Thanks for initial reaction. Ended up rewriting the post introduction, as you suggested it's unclear a bit. And before can thoroughly check the links you posted, the post itself is now deleted.
(Don't know if it's possible to see it; as the forum won't let me see edit history - no problem with that, preferring this option as well.)

Afaik we misunderstood each other as your script probably does what Watchacat can.

Thank you (also for considering the help).

Tag you mainly if someone would like to remind me of the existence and also, just in case, you working on it again that this could be useful addition - if my code start working later.

Yes - that's true, did realize that. Maybe initially not well written.

Yes, it's moreover general idea how to tackle: staying on the 'main' and going for backup only temporarily.

Some people have more than one VPN provider (or it can be 1 month payment just to have the backup temporarily), VPS, or some RPi, router maybe, listening with Wireguard as server.

For my purpose just two is enough, could be easily changed for N more later, it's just simple way, not getting me into setting up more complicated solutions.

That's why not peer based, rather interface based - as different settings can be started and tried after initial preparation, just in case.

--
I'm doing this basically because ended up with really distanced exit from my location so if that kick in (metrics take over as have it so far on all the time) the noticeable lag force me check the IP/jitter. And this eventually pushing me to see if there is maintenance or some problem with the machine/VPS hosting, setup, route.

But it's manual intervention, not pleasant solution, to go back and start the first one, (metrics again taking traffic now through longtime preferable location).

The distance serves two purposes - it's not one location, thus possibly avoiding (as you saying also about the 'same place') the total outage of location and also the lag induced.

Also the necessity to have 24/7 the second interface up and running could be avoided as finding no purpose in running unused connection all the time.

Know about the existence of this package.

But upon reading docs, the first point is: 'Outbound WAN traffic load balancing or fail-over with multiple WAN interfaces'

See, I have only one so far - no need to backup with two WAN because not willing to have the LTE or any different technology as backup, money wise.

Continuing reading the guide, for this package, there is:

"You will need a minimum of two WAN interfaces for mwan3 to work effectively."

OK, following there is this small amount of hope given to me in from of "While mwan3 is primarily designed for physical and independent WAN connections it can also be used with logical interfaces like OpenVPN or Wireguard." - dealing with DSA is beyond my experience so far.

Also because: "The simplest way to create more WAN interfaces is to have a VLAN-capable router. This will allow you to convert existing LAN ports into individual ports to become its own separate interface and act as a WAN." - have all occupied already so this mean obtaining switch at least.

And me being on user level "I can set up the Wireguard interfaces" did not came to conclusion that somehow this is doable with this mwan3 package.

So not yet google/search here after reading through the mwan3 docs.

Should I, you think, @frollic ?

It doesn't matter how many available WAN interfaces you have, you can always use mwan3. The point is that for there to be any sense in using the app, you need at least two of them - physical, virtual, or a combination of both.

In your case you don't need to use mwan3 to monitor the wan interface, if it doesn't work, nothing works.

However, you can use mwan3 to monitor the wireguard connections.
If the main tunnel goes down, mwan3 will automatically switch all traffic to the backup one.
If / when the main tunnel is restored, traffic will be rerouted through it again.

Since you are obviously not familiar with the package, here is a sample configuration that works. You can fine-tune it after reading the manual carefully.

config globals 'globals'
        option mmx_mask '0x3F00'

config interface 'wg0'
        option enabled '1'
        list track_ip '8.8.8.8'
        list track_ip '1.1.1.1'
        option family 'ipv4'
        option reliability '1'

config interface 'wg1'
        option enabled '1'
        list track_ip '8.8.8.8'
        list track_ip '1.1.1.1'
        option family 'ipv4'
        option reliability '1'

config member 'wg0_m1_w3'
        option interface 'wg0'
        option metric '1'
        option weight '3'

config member 'wg1_m2_w3'
        option interface 'wg1'
        option metric '2'
        option weight '3'

config policy 'wg0_to_wg1'
        list use_member 'wg0_m1_w3'
        list use_member 'wg1_m2_w3'

config rule 'default_rule_v4'
        option dest_ip '0.0.0.0/0'
        option use_policy 'wg0_to_wg1'
        option family 'ipv4'

Some additional tips in case you decide to install the package.

  1. Set different metric values for wan and the two wireguard interfaces in /etc/config/network.
  2. Replace iptables(6)-zz-legacy with iptables-nft and reboot the device.
opkg remove iptables-zz-legacy ip6tables-zz-legacy --force-depends
opkg install iptables-nft
  1. Make sure that the interface names used in /etc/config/mwan3 are exactly the same as those defined in /etc/config/network.

Good luck.

2 Likes

And that's great!

Did this according to your defaults.

Yes - looking now in LuCI UI for mwan3 the: Interface down / Interface up options. Great!

Awesome, thank you very much!
Now I'm a bit more familiar, it works so far.

  1. having the WAN 1024 and the WG interfaces 2048 for both of them. (Because one off / one on all the time is only possible so guess no need to distinguish more since I'm not now neither balancing nor trying to prefer one of them as mwan3 does the job).

This point is a bit unclear to me, as the WAN must be lowest(? - there is nothing in guide) all the time - because that is the main way out.

Edit: in mwan3 docs there is: "The default (primary) WAN interface should have the lowest metric (e.g. 10) and each additional WAN interface a higher metric (e.g. 20, 30, etc.). Values are not important, but should always be unique."

Now I'm even more confused - if they are not important, why they should be? (My understanding so far is that they are used for the priority of the interface.)

In mwan doc is: Member interfaces with lower metrics are used first - but that is etc/config/mwan3 - so do I have it right after all?

  1. did that

  2. Yes - below:

Just to note if anyone will go this way - it took me quite a while, unfortunately, figuring out the Policy - "Names must be 15 characters or less" mistake I made.

Damn, with my defaults predating the installation, ended up few more easily and not seeing it in vi while editing config.

It doesn't work all the time, until the logs checked and here we go.
So don't forget the logread -e mwan3 and take a look even after this successfully working.

And the Policy assigned may be needed so added that as well.

--
A quick reminder for people needing VPN with the IP that should really not leak.

Did tested this dozen times and it take quite a while until the traffic goes through the WG iface and not straight to the WAN. 30sec and more.

Don't have the FW kill-switch (meaning no WAN removal from lan zone) because PBR can't work alongside/with that.

So keep this in mind.

Not sure if that is because there is a delay between WG and WAN ifup (PBR also loading..):

16:48:44 mwan3-hotplug[19427]: mwan3 hotplug on lan not called because interface disabled
16:48:55 mwan3-hotplug[23113]: mwan3 hotplug on loopback not called because interface disabled
16:48:57 mwan3-hotplug[23631]: Execute ifup event on interface WG (WG)
16:48:58 mwan3track[4340]: Detect ifup event on interface WG ()
16:48:58 mwan3track[4340]: Interface WG (WG) is online
16:49:11 mwan3-hotplug[28884]: Execute ifup event on interface wan (eth0)
16:49:11 mwan3track[4339]: Detect ifup event on interface wan ()
16:49:11 mwan3track[4339]: Interface wan (eth0) is online

mwan3 is quite low numbered with the start prio within the initscripts (19), yet it is to be seen later on - while what is to be later (50+) is in log sooner.

Where does your tunnel terminate? A VPN provider or your own box?

A dynamic routing protocol like ospf or better bgp, is far more suited to make routing decisions then fiddling with static routes based on scripts. But yes, if you have no control over the remote end then you have no other choice...

Tested this a bit and it does work. This solution however is not what would like to have in general.

The main hassle is that:

the restoration of tunnel is not a capability of mwan3.

The Watchcat still needed for reload the main one WG if check fail otherwise it stuck with the backup WG indefinitely.

Also this does require all the WG_X considered for future use to be 'Bring up on boot' otherwise with manual load (Restart option in LuCI) the traffic later won't go through.

(Or at least for my testing:

misconfiguring the main WG + reload, so there is no handshake and starting the WG backup, because the mwan3 doesn't do that, does not let the traffic go through.

But doing the same when the WG backup is already on with 'Bring on boot' works all good.)

So basically it doesn't achieve in full scope what I initially would like to have.

That is a good question, basically anywhere - it may be my machine, or, that is most of the time, the commercial VPNs.

And that is the biggest hassle because they have different approach for the:

IP addresses of the WireGuard interface., or the Interface address, or whatever it's represented in the UI.

So that doesn't work because in that case there is conflict detected for routing table. Only one can be in use.

That is why this must be done (I think) in that case:

Proton does that, having all configs generated with only the 10.2.0.2/32.

Others not so much (Mullvad and IVPN iirc doesn't).

So, in general, don't want care, what it is (what combination).

I'm aiming to be it capable keeping only one connection alive with eventual switch for different (that is to be expected for few dozen hrs a year) whatever is prepared in router.

So one doesn't need to deal with complicated set up - whatever is already here could stay and this can just monitor and start/stop backup / main WG interfaces.

Also:

Thinking that majority of users don't need the second route at all except for the time the main one is unusable.

If this is such a big problem for you that both tunnels are running in parallel, disable the automatic start of the backup wireguard interface and add this to /etc/mwan3.user.

# wg0 main interface
# wg1 backup interface

if [ "${ACTION}" = "disconnected" ] && [ "${INTERFACE}" = "wg0" ]; then
        ifup wg1
fi

if [ "${ACTION}" = "connected" ] && [ "${INTERFACE}" = "wg0" ]; then
        ifdown wg1
fi

Set the correct wireguard interface names and make sure the wan interface has the lowest metric value in /etc/config/network.

1 Like

Tried this with few different interfaces-peers/endpoints, so far even working with the providers using the same interface address.

(Tested: revocation of the key-pair provided - or just unplugging the peer from ethernet and letting Watchcat and mwan3 switching both on/off.)

Now just Watchcat will start back the main WG checking few times per day just in case.

Thanks, I really appreciate it.

1 Like

Well today noticed (should test it more while checking from rest of LAN devices) a lag on streaming service after this solution switched to different route.

It turns out, as I'm on 22.03 with PBR using nft - it is incompatible with mwan3.

(Should put that info in first post, but didn't know atm. Have some feelings with the 'opkg install iptables-nft' as it's some previously used solution, possibly from PBRs docs. but didn't fully realize it until today.)

As noted here (same thread both info):

In that case possibly only new installation of PBR with the iptables could work:

Not willing to do that. Likely to stay with something tested, easily maintainable and if there is possibly update of OpenWRT I'm rather on newer version with packages to be installed easily and configs working out of the box.

But for anyone who doesn't use PBR - this still is usable solution.

So must push the script forward.