I have 2 GL.Inet GL-MT6000 routers with the same identical firmware OpenWRT 24.10.
A separate VPS (Linux) host also runs Wireguard
For the purpose of this post I will name the OpenWrt routers A and B and the VPS host H
H is a wireguard server for multiple hosts and has a public IP. B is a peer of H. H does not have B's host name configured so the connection is initiated by B.
A has one wireguard VPN configured with peer B. A connects to B using B's DDNS host name (FQDN).
B has two wireguard VPNs, one with H and it connect to its public IP and one with A connecting to its DDNS host name (FQDN). Re-connection B---H never fails and it is useful as baseline comparison in my tests.
H --------- B --------- A
When B is rebooted it always reconnects with H automatically, but it fails to reconnect with A.
Manually restarting wireguard device on B does not reconnect the VPN (tested several times).
Manually restarting wireguard on A does reconnect the VPN, however sometimes it requires several attempts before the WG VPN is reconnected.
Having pinned the issue to router A I tried the following:
strip wireguard configuration to a minimum, removing all routed networks but the essential subnet to connect A and B. NO CHANGE
remove from A configuration B's peer host name and port (thus forcing connection initiation from B like in the B to H case) because B has a dynamic IP resolved via DDNS. NO CHANGE
tried persistent keepalive set to 0 or 25 or a combination of the 2 on A and B. NO CHANGE.
I have read that wireguard might start before the date/time is set by the router using the NTP service, however this does not explain why connection from B to H never fails. It does not explain why A requires a wireguard interface down/up cycle instead of B either.
At this point I am not sure what to try. I could implement a script to automatically re-cycle A's wireguard, however this is unpractical because I want to configure A as a Wireguard server for several devices, some of them might not reconnect for a long time (mobile phones for example), and I cannot continuously restart wireguard interface in case one reconnects because it would impact all other connections.
Dedicating one wireguard interface for each potential client is also not an option.
I was wondering if someone with more wireguard knowledge could help.
H is in GMT whilst A and B are in CET timezone. I checked and A and B have the same time to the second.
I noticed too that Luci WireGuard Status page of A reports B peer endpoint with the IP address (B has a dynamic IP) before the reboot. I thought that perhaps A expects a reconnection from the same IP that in 99% of cases never happens. However I noticed that sometimes B re-acquires the same public IP after a reboot (it happens rarely, but it just happened 2 reboots ago). In this case it failed to reconnect too.
However....How do I make A forget the peer endpoint address (I configure the DDNS or nothing, but yet it reports the IP of the latest successful connection), when the peer does not connect?
After more tests I believe the issue might be addressed to the fact that when B is rebooted (and then acquires a new public IP), A maintains the connection showing the endpoint previous IP:port.
In the linux server wireguard implementation I can omit B's endpoint and port, thus making only B initiate the connection to H. In case of OpenWRT if I omit the endpoint FQDN and port the connection does not take place.
There must be a way to timeout the connection on A and forget the endpoint IP.
If I understood your problem correctly, I believe that is what wireguard_watchdog (part of the wireguard-tools package) is for. It triggers WireGuard to re-resolve a connection's endpoint after a certain period without handshake, which is obviously useful if its endpoint is a dyndns host. WireGuard does not do that on its own, once it resolved an endpoint's hostname it permanently hangs on to the IP.
Note: The timeout of 150 seconds in the script is chosen too tight. It can be, and frequently is, exceeded by a healthy connection whose handshake can happen as infrequent as (180 + persistent_keepalive) seconds.