Wireguard dropping randomly (~ once a month)

Hi,

I have a Belkin ax3200 loaded with openwrt OpenWrt 22.03-SNAPSHOT r19575-506432a783. I have a simple setup where all my traffic is routed to a wireguard server. Recently (2nd time in 2 months). I kept getting connection errors to my wireguard server. Basically I get the following message in wireguard server.

[Sat Aug 27 03:40:31 2022] wireguard: wg0: Handshake for peer 3 (xxx:41586) did not complete after 5 seconds, retrying (try 10)
[Sat Aug 27 03:40:31 2022] wireguard: wg0: Sending handshake initiation to peer 3 (xxx:41586)
[Sat Aug 27 03:40:33 2022] wireguard: wg0: Handshake for peer 2 (xxx:50061) did not complete after 5 seconds, retrying (try 5)
[Sat Aug 27 03:40:33 2022] wireguard: wg0: Sending handshake initiation to peer 2 (xxx:50061)
[Sat Aug 27 03:40:34 2022] wireguard: wg0: Invalid handshake initiation from xxx:36159
[Sat Aug 27 03:40:36 2022] wireguard: wg0: Handshake for peer 3 (xxx:41586) did not complete after 5 seconds, retrying (try 11)
[Sat Aug 27 03:40:36 2022] wireguard: wg0: Sending handshake initiation to peer 3 (xxx:41586)

restarting/resetting openwrt does not fix the issue. Neither does changing the wireguard config to use a different key/setting.
The only workaround that I found was to upload my backed up settings to the server. (i sworn i didn't change anything since i backed it up).

Is it possible that the IP address of one peer or the other has changed?

I don't think i change the ip, but occasionally I get these logs in the wg server

[2897553.798909] wireguard: wg0: Keypair 23991 created for peer 1
[2897553.808183] wireguard: wg0: Receiving keepalive packet from peer 1 (xxx:44148)
[2897608.586255] wireguard: wg0: Packet has unallowed src IP (10.0.1.135) from peer 1 (xxx:44148)

have you verified the configs? You could post them here for review. The question would be why peer 1 is suddenly unallowed (did the source IP change)?

network.loopback=interface
network.loopback.device='lo'
network.loopback.proto='static'
network.loopback.ipaddr='127.0.0.1'
network.loopback.netmask='255.0.0.0'
network.globals=globals
network.globals.ula_prefix='xxx'
network.@device[0]=device
network.@device[0].name='br-lan'
network.@device[0].type='bridge'
network.@device[0].ports='lan1' 'lan2' 'lan3' 'lan4'
network.lan=interface
network.lan.device='br-lan'
network.lan.proto='static'
network.lan.ipaddr='192.168.1.1'
network.lan.netmask='255.255.255.0'
network.lan.ip6assign='60'
network.wan=interface
network.wan.device='wan'
network.wan.proto='dhcp'
network.wan6=interface
network.wan6.device='wan'
network.wan6.proto='dhcpv6'
network.wgvpn=interface
network.wgvpn.proto='wireguard'
network.wgvpn.private_key='xxx'
network.wgvpn.addresses='10.7.0.2/32'
network.@wireguard_wgvpn[0]=wireguard_wgvpn
network.@wireguard_wgvpn[0].public_key='xxx'
network.@wireguard_wgvpn[0].preshared_key='xxx'
network.@wireguard_wgvpn[0].allowed_ips='0.0.0.0/0'
network.@wireguard_wgvpn[0].route_allowed_ips='1'
network.@wireguard_wgvpn[0].endpoint_host='xxx'
network.@wireguard_wgvpn[0].endpoint_port='51820'
network.guest_dev=device
network.guest_dev.type='bridge'
network.guest_dev.name='br-guest'
network.guest=interface
network.guest.proto='static'
network.guest.device='br-guest'
network.guest.ipaddr='10.0.1.1'
network.guest.netmask='255.255.255.0'

at first i thought this was the issue

network.wgvpn.addresses='10.7.0.2/32'

it was originally set to 10.7.0.2, but i tried the same config on a windows wireguard client and it works fine.

what does the other side of the config look like?

[Interface]
Address = 10.7.0.1/24
PrivateKey = xxx
ListenPort = 51820

# BEGIN_PEER p1
[Peer]
PublicKey = xxx
PresharedKey = xxx
AllowedIPs = 10.7.0.2/32
# END_PEER p1
# BEGIN_PEER p2
[Peer]
PublicKey = xxx
PresharedKey = xxx
AllowedIPs = 10.7.0.3/32
# END_PEER p2
# BEGIN_PEER p3
[Peer]
PublicKey = xxx
PresharedKey = xxx
AllowedIPs = 10.7.0.4/32
# END_PEER p3

So I don't see anything wrong here. It is possible that the firewall/routing tables are allowing 10.0.1.0/24 traffic to pass into the tunnel, but otherwise I don't see a problem there.

I suspect that one side or the other is getting a new public IP address when the tunnel goes down. Take a look at wireguard-watchdog for a method to fix this issue.

I guess my question is, why does this not fix itself with a reboot. The only fix is to reload the configuration?

A reboot should fix it... but...

If the WG interface comes up before the router performs an NTP (network time protocol) sync, it may not work at all. This may be the situation you are experiencing upon reboot.

Most OpenWrt devices lack a realtime clock, and therefore most of these systems will rely on the NTP sync for accurate time. When OpenWrt boots up, it will use the timestamp from the most recently written file as the 'current' time and then count up from there (this could be days, weeks, even years in the past). Normally, once the network has been established, the system will be able to perform an NTP sync and get the actual time.

When a VPN such as wireguard starts up, it will effectively redirect all of the network traffic through the tunnel (and prevent regular traffic egress via the WAN -- only the tunnel itself traverses the WAN). If the time is not accurate (or at least ballpark), this tunnel will not be able to function, but it will still block egress via WAN, so the result is that NTP cannot sync and the tunnel can never truly be established.

If my hunch is correct about the public IP address changing on one or both sides of the tunnel, this would generally be resolved by a reboot (you'd get a fresh DNS resolution). But the reboot would happen and the time would be off -- likely running ~1 month in the past based on your description.

When you reload the configuration, you are also updating the timestamp of the /etc/config/network configuration file. This is possible because the interface is down and thus WAN egress is possible, meaning NTP can sync, allowing the timestamp of the network file to be accurate. A reboot immediately following this reload would still be sufficiently close to accurate time, and thus would still work.

So how to fix this?

First, if my initial thesis is correct, the watchdog will fix the core of this behavior. Beyond that, you can delay the Wireguard interface startup until after an NTP sync has happened -- this will resolve the issue upon reboot. Search the forums for wireguard NTP and you'll see several threads on the topic.

2 Likes

Thanks for the suggestion, I'll keep an eye on this.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.