Improvement to wireguard_watchdog script?

Previously I got some great help to get my openwrt 'client' configuration for wireguard sorted out - this was discussed in OpenWrt client to linux hosted wireguard. I also blogged about it https://lowtek.ca/roo/2022/openwrt-as-a-wireguard-client

Recently - my home network connection had some weird problem. My main gateway/router needed a reboot. The external IP did not change, but the remote openwrt wireguard "client" got stuck in some way. I was not seeing updated handshakes happen (for hours)

I was pleased to discover that there was a wireguard_watchdog script to use. See these forum threads for setting that up:

I've added this to my configuration to check and 'kick' wireguard if my IP does change or something else happens.

I thought this was going to fix me up, but unfortunately it did not. I did some debug on the script - and it's correctly detecting the issue and trying to fix it with

wg set ${iface} peer ${public_key} endpoint "${endpoint_host}:${endpoint_port}"

While this may help deal with an IP address change - it didn't tickle the interface enough to get my wireguard connection to start working again.

I then found this Restart WireGuard via cli - #10 by satheras which indicates doing

ifdown ${iface}
ifup ${iface}

This when manually issued appears to have been the magic required.

The reason for this post - is to see what folks more knowledgeable than myself think of adding this to the wireguard_watchdog script. Maybe with a brief sleep before it to allow the wg set.. to take effect?

  logger -t "wireguard_monitor" "${iface} endpoint ${endpoint_host}:${endpoint_port} is not responding for ${idle_seconds} seconds, trying to re-resolve hostname"
  wg set ${iface} peer ${public_key} endpoint "${endpoint_host}:${endpoint_port}"
  sleep 1 
  ifdown  ${iface}
  ifup ${iface} 

I can go find the code and create an issue and pull request, but would like feedback that this makes sense.

I will make the wild guess and say you are running the watchdog on the "server". It has to be run on the "client", i.e. the peer initializing the connection. Otherwise it can not work.

Aside from that, ifdown/ifup on the interface is an exceedingly brutal way of reestablishing a single peer connection. It will of course take down all peer connections on the interface, even those that are working fine and may very well be in active use at the time. I would really consider resetting the whole interface a last resort, not a general recommendation.

JFTR: My experience with the wireguard watchdog script is pretty much the opposite of yours. For me, it would occasionally reset/re-resolve healthy peer connections. I believe that the timeout threshold of 150 seconds is just ever so slightly too low: Even if persistent_keepalive is set to the recommended 25 seconds or lower, intervals between handshakes may well exceed 150 seconds even on a healthy connection. I remember looking up the actual handshake interval, and it seems only loosely connected to the persistent_keepalive setting, but I have since forgotten the details. In my case, I increased the timeout threshold to 240 seconds and didn't have problems ever since, and I run WireGuard connections on half a dozen machines spread across the world.

1 Like

Hmm.. I think you are mistaken.

I run a wireguard "server" on a linux host. The wireguard is actually a linuxserver.io container which I map the UDP port to. This works great. My phone, laptop, etc all enjoy wireguard VPN when I need that. It gives me the ability to "be at home when I'm away"

In this post - I'm talking about an OpenWRT installation that is 'remote' from my home. Upstream for this is an LTE connection. As documented in OpenWrt client to linux hosted wireguard I struggled a little bit to figure out how to successfully make this OpenWRT into a 'client' of my existing wireguard server.

The wireguard_watchdog script - is running on this remote OpenWRT.

When I tried to describe the scenario that happened - my home system had some network snafu. Packets stop flowing from other systems to the internet (ie: ping from a laptop on that network to 1.1.1.1 would fail). Restarting the home gateway/router fixed this internet problem.

The external IP that my home network has did not change. The wireguard container didn't restart because it lives on a linux server inside the home network. However, resetting the gateway clearly cause the OpenWRT 'remote' client wireguard to get into a bad state (or at least that connection).

I could see by looking at the remote OpenWRT that the last handshake value was climbing and hit multiple hours (8hrs). Similarly if I probed the server side and looked at the wireguard it agreed with the no handshake for hours.

It was then that I connected to the 'remote' OpenWRT that is running the wireguard client - and did a manual ifdown/ifup to cause the connection to resolve itself and data started to flow again.

Maybe I should clarify that because I'm still testing this setup - the remote system - is physically in the same location so it's easy for me to switch between looking at one or the other. The networking components between home and remote are distinct and not physically connected.

wireguard <-> gateway <-> internet <- LTE <-> OpenWRT 
server                                        wireguard client

I do agree that the timeout in the wireguard_watchdog does seem very aggressive

When I deploy this solution - the remote OpenWRT is going to be ~1hr away, and I expect things like janky power and other issues to happen - so the more of a safety net I can build the better.

I appreciate this advice - I'll consider modifying my 'patch' to the wireguard_watchdog to only do the brutal ifdown/ifup IF the last handshake value climbs to a much larger value. If the gentle wg set iface... works to restore connectivity, great. However, in the unlikely scenario where it's stuck for say 30mins -- I'll just go for the bigger kick.

Of course - the other alternative here is to take the approach of doing a full reboot too. There are plenty of "ping has failed, so time to reboot" type scripts. I generally do not like the "reboot regularly" approach, but sometimes turning it off / on again is the right hammer to use.

I stand, happily, corrected. I also see that your network setup is quite a bit more sophisticated than usual, which probably also introduces some unusual breaking points.

That seems like a sensible approach in your case. I mean, if your remote client only has that one peer connection it does not make much of a difference. But again, I wouldn't recommend it as a general improvement.

LoL - yeah sophisticated.. that's what I call it.

More like cobbled together and attempting to be as seamless (but useful) as possible. I don't want to stray too far from the beaten path which is why I'm asking questions about the wireguard_watchdog implementation.

Your comments have meant I've locally modified my copy of the watchdog script to have this

  idle_seconds=$(($(date +%s)-${last_handshake}))
  [ ${idle_seconds} -lt 250 ] && return 0;
  logger -t "wireguard_monitor" "${iface} endpoint ${endpoint_host}:${endpoint_port} is not responding for ${idle_seconds} seconds, trying to re-resolve hostname"
  wg set ${iface} peer ${public_key} endpoint "${endpoint_host}:${endpoint_port}"
  if [ ${idle_seconds} -gt 600 ]; then
    ifdown ${iface}
    ifup ${iface}
  fi
}

Two changes here.

  1. Bumped the ignore it cut off to 250 seconds from 150
  2. only if idle_seconds hits more than 600 seconds will I use the hammer to down/up the interface.