Improvement to wireguard_watchdog script?

R00 · November 3, 2022, 2:52pm

Previously I got some great help to get my openwrt 'client' configuration for wireguard sorted out - this was discussed in OpenWrt client to linux hosted wireguard. I also blogged about it https://lowtek.ca/roo/2022/openwrt-as-a-wireguard-client

Recently - my home network connection had some weird problem. My main gateway/router needed a reboot. The external IP did not change, but the remote openwrt wireguard "client" got stuck in some way. I was not seeing updated handshakes happen (for hours)

I was pleased to discover that there was a wireguard_watchdog script to use. See these forum threads for setting that up:

I've added this to my configuration to check and 'kick' wireguard if my IP does change or something else happens.

I thought this was going to fix me up, but unfortunately it did not. I did some debug on the script - and it's correctly detecting the issue and trying to fix it with

wg set ${iface} peer ${public_key} endpoint "${endpoint_host}:${endpoint_port}"

While this may help deal with an IP address change - it didn't tickle the interface enough to get my wireguard connection to start working again.

I then found this Restart WireGuard via cli - #10 by satheras which indicates doing

ifdown ${iface}
ifup ${iface}

This when manually issued appears to have been the magic required.

The reason for this post - is to see what folks more knowledgeable than myself think of adding this to the wireguard_watchdog script. Maybe with a brief sleep before it to allow the wg set.. to take effect?

  logger -t "wireguard_monitor" "${iface} endpoint ${endpoint_host}:${endpoint_port} is not responding for ${idle_seconds} seconds, trying to re-resolve hostname"
  wg set ${iface} peer ${public_key} endpoint "${endpoint_host}:${endpoint_port}"
  sleep 1 
  ifdown  ${iface}
  ifup ${iface}

I can go find the code and create an issue and pull request, but would like feedback that this makes sense.

takimata · November 3, 2022, 3:43pm

I will make the wild guess and say you are running the watchdog on the "server". It has to be run on the "client", i.e. the peer initializing the connection. Otherwise it can not work.

Aside from that, ifdown/ifup on the interface is an exceedingly brutal way of reestablishing a single peer connection. It will of course take down all peer connections on the interface, even those that are working fine and may very well be in active use at the time. I would really consider resetting the whole interface a last resort, not a general recommendation.

JFTR: My experience with the wireguard watchdog script is pretty much the opposite of yours. For me, it would occasionally reset/re-resolve healthy peer connections. I believe that the timeout threshold of 150 seconds is just ever so slightly too low: Even if persistent_keepalive is set to the recommended 25 seconds or lower, intervals between handshakes may well exceed 150 seconds even on a healthy connection. I remember looking up the actual handshake interval, and it seems only loosely connected to the persistent_keepalive setting, but I have since forgotten the details. In my case, I increased the timeout threshold to 240 seconds and didn't have problems ever since, and I run WireGuard connections on half a dozen machines spread across the world.

R00 · November 3, 2022, 4:28pm

Hmm.. I think you are mistaken.

I run a wireguard "server" on a linux host. The wireguard is actually a linuxserver.io container which I map the UDP port to. This works great. My phone, laptop, etc all enjoy wireguard VPN when I need that. It gives me the ability to "be at home when I'm away"

In this post - I'm talking about an OpenWRT installation that is 'remote' from my home. Upstream for this is an LTE connection. As documented in OpenWrt client to linux hosted wireguard I struggled a little bit to figure out how to successfully make this OpenWRT into a 'client' of my existing wireguard server.

The wireguard_watchdog script - is running on this remote OpenWRT.

When I tried to describe the scenario that happened - my home system had some network snafu. Packets stop flowing from other systems to the internet (ie: ping from a laptop on that network to 1.1.1.1 would fail). Restarting the home gateway/router fixed this internet problem.

The external IP that my home network has did not change. The wireguard container didn't restart because it lives on a linux server inside the home network. However, resetting the gateway clearly cause the OpenWRT 'remote' client wireguard to get into a bad state (or at least that connection).

I could see by looking at the remote OpenWRT that the last handshake value was climbing and hit multiple hours (8hrs). Similarly if I probed the server side and looked at the wireguard it agreed with the no handshake for hours.

It was then that I connected to the 'remote' OpenWRT that is running the wireguard client - and did a manual ifdown/ifup to cause the connection to resolve itself and data started to flow again.

Maybe I should clarify that because I'm still testing this setup - the remote system - is physically in the same location so it's easy for me to switch between looking at one or the other. The networking components between home and remote are distinct and not physically connected.

wireguard <-> gateway <-> internet <- LTE <-> OpenWRT 
server                                        wireguard client

I do agree that the timeout in the wireguard_watchdog does seem very aggressive

When I deploy this solution - the remote OpenWRT is going to be ~1hr away, and I expect things like janky power and other issues to happen - so the more of a safety net I can build the better.

I appreciate this advice - I'll consider modifying my 'patch' to the wireguard_watchdog to only do the brutal ifdown/ifup IF the last handshake value climbs to a much larger value. If the gentle wg set iface... works to restore connectivity, great. However, in the unlikely scenario where it's stuck for say 30mins -- I'll just go for the bigger kick.

Of course - the other alternative here is to take the approach of doing a full reboot too. There are plenty of "ping has failed, so time to reboot" type scripts. I generally do not like the "reboot regularly" approach, but sometimes turning it off / on again is the right hammer to use.

takimata · November 3, 2022, 4:45pm

I stand, happily, corrected. I also see that your network setup is quite a bit more sophisticated than usual, which probably also introduces some unusual breaking points.

That seems like a sensible approach in your case. I mean, if your remote client only has that one peer connection it does not make much of a difference. But again, I wouldn't recommend it as a general improvement.

R00 · November 3, 2022, 5:36pm

LoL - yeah sophisticated.. that's what I call it.

More like cobbled together and attempting to be as seamless (but useful) as possible. I don't want to stray too far from the beaten path which is why I'm asking questions about the wireguard_watchdog implementation.

Your comments have meant I've locally modified my copy of the watchdog script to have this

  idle_seconds=$(($(date +%s)-${last_handshake}))
  [ ${idle_seconds} -lt 250 ] && return 0;
  logger -t "wireguard_monitor" "${iface} endpoint ${endpoint_host}:${endpoint_port} is not responding for ${idle_seconds} seconds, trying to re-resolve hostname"
  wg set ${iface} peer ${public_key} endpoint "${endpoint_host}:${endpoint_port}"
  if [ ${idle_seconds} -gt 600 ]; then
    ifdown ${iface}
    ifup ${iface}
  fi
}

Two changes here.

Bumped the ignore it cut off to 250 seconds from 150
only if idle_seconds hits more than 600 seconds will I use the hammer to down/up the interface.

theunreal89 · March 4, 2024, 11:26am

Hello,

I had the same experience as R00, the watchdog simply doesn't work for restoring a wireguard VPN connection where the other side has changed address. Restarting the interface with ifdown && ifup is the way to fix it (with the drawback to kill other healthy wireguard connections running on the same WG interface, which is not my case so it's fine for me).

I think a different solution has to be found for wireguard_watchdog script, because otherwise it's useless for this (very common) site-to-site scenario.

takimata · March 4, 2024, 1:14pm

That's almost certainly not the fault of the watchdog script, but rather a kink in the configuration that prevents it from working. Most commonly that's a missing persistent_keepalive. Without that, WireGuard does not keep proper record of the latest handshake and consequently cannot tell if it happened too long ago.

I personally have several site-to-site WireGuard connections running on dynamic addresses, daily reconnected via wireguard_watchdog, for years now without any problems whatsoever.

Sure, what do you propose?

theunreal89 · March 4, 2024, 1:55pm

Most commonly that's a missing persistent_keepalive .

No, the keepalive is enabled otherwise I know the watchdog is never triggered, but also due to the fact my remote wireguard peer (which is the "server") is behind a NAT so the keepalive would be needed anyway.

That's the configuration on OpenWRT side:

root@openwrt:~# ip a s CasaStefanoWG
14: CasaStefanoWG: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN qlen 1000
    link/[65534] 
    inet 10.89.5.3/32 brd 255.255.255.255 scope global CasaStefanoWG
       valid_lft forever preferred_lft forever
root@openwrt:~# wg showconf CasaStefanoWG
[Interface]
ListenPort = 34771
PrivateKey = [hidden]

[Peer]
PublicKey = [hidden]
PresharedKey = [hidden]
AllowedIPs = 192.168.89.0/24
Endpoint = 87.9.25.156:51820
PersistentKeepalive = 25
root@openwrt:~# wg
interface: CasaStefanoWG
  public key: [hidden]
  private key: (hidden)
  listening port: 34771

peer: [hidden]
  preshared key: (hidden)
  endpoint: 87.9.25.156:51820
  allowed ips: 192.168.89.0/24
  latest handshake: 1 minute, 15 seconds ago
  transfer: 7.72 MiB received, 18.25 MiB sent
  persistent keepalive: every 25 seconds

while on remote peer

root@remoteside:~ # wg showconf wg0
[Interface]
ListenPort = 51820
PrivateKey = [hidden]

[Peer]
PublicKey = [hidden]
PresharedKey = [hidden]
AllowedIPs = 10.89.5.3/32, 192.168.0.0/24
Endpoint = 101.56.58.184:1604       <----- this is not present in configuration file
PersistentKeepalive = 25

root@remoteside:~ # ip a s wg0
4276: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none

root@remoteside:~ # wg
interface: wg0
  public key: [hidden]
  private key: (hidden)
  listening port: 51820

peer: [hidden]
  preshared key: (hidden)
  endpoint: 101.56.58.184:1604
  allowed ips: 10.89.5.3/32, 192.168.0.0/24
  latest handshake: 41 seconds ago
  transfer: 515.93 MiB received, 449.25 MiB sent
  persistent keepalive: every 25 seconds

in logs indeed I saw the script is correctly called by Cron and it is correctly updating the running conf with the new resolved IP, that part is working. After that, it is simply not enough to establish the connection again.
OpenWRT side has a dynamic IP which I don't trace with any DDNS at all, it is automatically updated in wg configuration on remote side when OpenWRT is connecting to the remote side, maybe the issue is due to that? Of course wireguard_watchdog runs only on OpenWRT side, the other side is not an OpenWRT router.

Sure, what do you propose?`

The only way I found until now is to restart the interface itself, but I don't know if there is another way for now, if I will be able to find it I'll let you know here

egc · September 6, 2024, 8:49am

I use a more generic approach which also works if the server is down, this script can even start another tunnel in that case see:

agorgl · December 13, 2024, 2:28pm

I have the same issue here.

Single wireguard 'server' with dynamic dns ip, multiple 'clients' (site to site links). Both server and clients are behind NAT, server exposes wireguard listen port with port forwarding, clients have persistent keep alive to 25 and a cron job to run wireguard_watchdog script every minute. I've checked (with the help of a friend that is one of these clients) and the re-resolve of the new/updated server ip seems to happen normally and it being set, as shown in output of wg command in the client.

The problem is that the connection seems to be stuck as no new handshakes are performed from the wireguard client even all the endpoint information seems correct on both sides. Connection seems to heal if I up/down the client interface or after some random time between a minute and 1-2 days.

KSofen · December 13, 2024, 2:47pm

Most of my WireGuard issues are resolved by installing the chrony package. When wireguard isn't working the first thing to check is you real time clock. If it's not exactly correct, wireguard won't work even if everything is setup perfectly. Don't know if this is related to what's happening with your script but throwing this tidbit in just in case.

egc · December 13, 2024, 2:54pm

I think the keepalive is only working if you are initiating the connection.

In a typical site-to-site setup both sides are setup to initiate the connection but only one side, the side which initiates the connections will use the keep alive.

So if the other side initiates the connection this might not help.
In theory if the other side changes its endpoint it should signal this and send its new endpoint but in my testing this does not always work do not no why, hence my own watchdog script