Wireguard randome packet drop

Hi all,

I have a very strange behavior come up today. Pinging my router over a wireguard tunnel drops about 70% of ping responses. Most importantly: the exact same connection worked before.

The Details
I have an OpenWrt router Fritz 3370 on version19.07.3.(self compiled) with kernel version: 4.14.180. This one is connected to the internet. On the other side I have a small server running in a data center. When I ping from the server to the router it works for a few seconds, stops working for a few seconds, starts working again and so on...
I am able to tcpdupm on both devices and I can see that the packets always disappear on the way from router to the server. The other way around seems to work. Ping request coming in to the router and being responded but responses does not reach server. But ping request from the router not reaching the server (inside a "ping does not work window").

I found this post describing a similar issue but I am not aware of having Software flow offloading enabled.

Any hints are appreciated and I happily provide additional insight.

edit:
I do have another router (same hardware, different wan). I can not say they are identically configured but that other one does not show the described behavior

If you can see the packets in tcpdump output, and they have the correct headers, it means that the packet was sent to the wire and was not blocked by firewall or something similar.

uci export firewall

:~# uci export firewall | grep soft | wc -l
0

:~# uci export firewall
package firewall

config defaults
	option syn_flood '1'
	option input 'ACCEPT'
	option output 'ACCEPT'
	option forward 'DROP'

config zone
	option name 'lan'
	option input 'ACCEPT'
	option output 'ACCEPT'
	option forward 'ACCEPT'
	option masq '1'
	option network 'lan'

config zone
	option name 'wan'
	option output 'ACCEPT'
	option masq '1'
	option mtu_fix '1'
	option input 'DROP'
	option forward 'DROP'
	option network 'wan wan6'

config rule
	option name 'Allow-DHCP-Renew'
	option src 'wan'
	option proto 'udp'
	option dest_port '68'
	option target 'ACCEPT'
	option family 'ipv4'

config rule
	option name 'Allow-Ping'
	option src 'wan'
	option proto 'icmp'
	option icmp_type 'echo-request'
	option family 'ipv4'
	option target 'ACCEPT'

config rule
	option name 'Allow-IGMP'
	option src 'wan'
	option proto 'igmp'
	option family 'ipv4'
	option target 'ACCEPT'

config rule
	option name 'Allow-DHCPv6'
	option src 'wan'
	option proto 'udp'
	option src_ip 'fc00::/6'
	option dest_ip 'fc00::/6'
	option dest_port '546'
	option family 'ipv6'
	option target 'ACCEPT'

config rule
	option name 'Allow-MLD'
	option src 'wan'
	option proto 'icmp'
	option src_ip 'fe80::/10'
	list icmp_type '130/0'
	list icmp_type '131/0'
	list icmp_type '132/0'
	list icmp_type '143/0'
	option family 'ipv6'
	option target 'ACCEPT'

config rule
	option name 'Allow-ICMPv6-Input'
	option src 'wan'
	option proto 'icmp'
	list icmp_type 'echo-request'
	list icmp_type 'echo-reply'
	list icmp_type 'destination-unreachable'
	list icmp_type 'packet-too-big'
	list icmp_type 'time-exceeded'
	list icmp_type 'bad-header'
	list icmp_type 'unknown-header-type'
	list icmp_type 'router-solicitation'
	list icmp_type 'neighbour-solicitation'
	list icmp_type 'router-advertisement'
	list icmp_type 'neighbour-advertisement'
	option limit '1000/sec'
	option family 'ipv6'
	option target 'ACCEPT'

config rule
	option name 'Allow-ICMPv6-Forward'
	option src 'wan'
	option dest '*'
	option proto 'icmp'
	list icmp_type 'echo-request'
	list icmp_type 'echo-reply'
	list icmp_type 'destination-unreachable'
	list icmp_type 'packet-too-big'
	list icmp_type 'time-exceeded'
	list icmp_type 'bad-header'
	list icmp_type 'unknown-header-type'
	option limit '1000/sec'
	option family 'ipv6'
	option target 'ACCEPT'

config rule
	option name 'Allow-IPSec-ESP'
	option src 'wan'
	option dest 'lan'
	option proto 'esp'
	option target 'ACCEPT'

config rule
	option name 'Allow-ISAKMP'
	option src 'wan'
	option dest 'lan'
	option dest_port '500'
	option proto 'udp'
	option target 'ACCEPT'

config include
	option path '/etc/firewall.user'

config zone
	option name 'guest'
	option output 'ACCEPT'
	option network 'Guest guest'
	option input 'DROP'
	option forward 'DROP'
	option masq '1'

config forwarding
	option dest 'wan'
	option src 'guest'

config rule 'guest_rule_dns'
	option name 'Allow DNS Queries'
	option src 'guest'
	option dest_port '53'
	option proto 'udp'
	option target 'ACCEPT'

config rule 'guest_rule_dhcp'
	option name 'Allow DHCP request'
	option src 'guest'
	option target 'ACCEPT'
	option proto 'udp'
	option dest_port '67 68'

config zone
	option input 'ACCEPT'
	option output 'ACCEPT'
	option name 'backup'
	option forward 'ACCEPT'
	option network 'backup'
	option masq '1'

config zone
	option input 'ACCEPT'
	option output 'ACCEPT'
	option forward 'ACCEPT'
	option name 'wg'
	option network 'wg0'

config forwarding
	option dest 'wan'
	option src 'backup'

config forwarding
	option dest 'lan'
	option src 'backup'

config forwarding
	option dest 'backup'
	option src 'lan'

config forwarding
	option dest 'wan'
	option src 'lan'

config forwarding
	option dest 'wan'
	option src 'wg'

config forwarding
	option dest 'wg'
	option src 'lan'

config forwarding
	option dest 'lan'
	option src 'wg'

config rule
	option target 'ACCEPT'
	option proto 'tcp udp'
	option dest_port '31234'
	option name 'Wireguard'
	option src '*'

I reboot everything a few times but nothing has changed. In fact I can not prove that it is not something on the way (or in) the data center. But if that would be the cases the ping to my second router should be dropped as well?

edit:
I just tried pinging both routers from the server side by side and I really am able to see the pings stopping in one terminal while the other is happily getting its responses.

And do you see the pings arriving at the tcpdump running on the router with the issue?

You are right... I wouldn't believe myself. So I did some real kick ass data science + visualization:
(copy into text editor if you want to get some overview. I do have a libre cal spreadsheet file but the forum seams to dislike non picture uploads)

server-vm:                                                              gate2:                                                                  ping: server-v > gate2
20:51:23.862898	metrics.fewo	>	gate2.fewo:	request,	seq	1		20:51:23.876900	metrics.fewo	>	gate2.fewo:	request,	seq	1					
20:51:23.891472	gate2.fewo		>	metrics.fewo:	reply,	seq	1		20:51:23.877212	gate2.fewo		>	metrics.fewo:	reply,	seq	1		from	gate2.fewo	icmp_seq=1	time=28.6
20:51:24.864589	metrics.fewo	>	gate2.fewo:	request,	seq	2		20:51:24.878305	metrics.fewo	>	gate2.fewo:	request,	seq	2					
																		20:51:24.878578	gate2.fewo		>	metrics.fewo:	reply,	seq	2					
20:51:25.869066	metrics.fewo	>	gate2.fewo:	request,	seq	3		20:51:25.883228	metrics.fewo	>	gate2.fewo:	request,	seq	3					
																		20:51:25.883501	gate2.fewo		>	metrics.fewo:	reply,	seq	3					
20:51:26.893068	metrics.fewo	>	gate2.fewo:	request,	seq	4		20:51:26.907163	metrics.fewo	>	gate2.fewo:	request,	seq	4					
																		20:51:26.907438	gate2.fewo		>	metrics.fewo:	reply,	seq	4					
20:51:27.917075	metrics.fewo	>	gate2.fewo:	request,	seq	5		20:51:27.930767	metrics.fewo	>	gate2.fewo:	request,	seq	5					
																		20:51:27.931144	gate2.fewo		>	metrics.fewo:	reply,	seq	5					
20:51:28.941055	metrics.fewo	>	gate2.fewo:	request,	seq	6		20:51:28.954698	metrics.fewo	>	gate2.fewo:	request,	seq	6					
																		20:51:28.955064	gate2.fewo		>	metrics.fewo:	reply,	seq	6					
20:51:29.965064	metrics.fewo	>	gate2.fewo:	request,	seq	7		20:51:29.979189	metrics.fewo	>	gate2.fewo:	request,	seq	7					
																		20:51:29.979461	gate2.fewo		>	metrics.fewo:	reply,	seq	7					
20:51:30.989131	metrics.fewo	>	gate2.fewo:	request,	seq	8		20:51:31.003321	metrics.fewo	>	gate2.fewo:	request,	seq	8					
																		20:51:31.003588	gate2.fewo		>	metrics.fewo:	reply,	seq	8					
20:51:32.013061	metrics.fewo	>	gate2.fewo:	request,	seq	9		20:51:32.027041	metrics.fewo	>	gate2.fewo:	request,	seq	9					
																		20:51:32.027319	gate2.fewo		>	metrics.fewo:	reply,	seq	9					
20:51:33.037066	metrics.fewo	>	gate2.fewo:	request,	seq	10		20:51:33.050890	metrics.fewo	>	gate2.fewo:	request,	seq	10					
																		20:51:33.051235	gate2.fewo		>	metrics.fewo:	reply,	seq	10					
20:51:34.061049	metrics.fewo	>	gate2.fewo:	request,	seq	11		20:51:34.075158	metrics.fewo	>	gate2.fewo:	request,	seq	11					
																		20:51:34.075430	gate2.fewo		>	metrics.fewo:	reply,	seq	11					
20:51:35.085110	metrics.fewo	>	gate2.fewo:	request,	seq	12		20:51:35.099440	metrics.fewo	>	gate2.fewo:	request,	seq	12					
																		20:51:35.099708	gate2.fewo		>	metrics.fewo:	reply,	seq	12					
20:51:36.109078	metrics.fewo	>	gate2.fewo:	request,	seq	13		20:51:36.123233	metrics.fewo	>	gate2.fewo:	request,	seq	13					
																		20:51:36.123506	gate2.fewo		>	metrics.fewo:	reply,	seq	13					
20:51:37.133041	metrics.fewo	>	gate2.fewo:	request,	seq	14		20:51:37.146863	metrics.fewo	>	gate2.fewo:	request,	seq	14					
																		20:51:37.147299	gate2.fewo		>	metrics.fewo:	reply,	seq	14					
20:51:38.157082	metrics.fewo	>	gate2.fewo:	request,	seq	15		20:51:38.171708	metrics.fewo	>	gate2.fewo:	request,	seq	15					
																		20:51:38.171979	gate2.fewo		>	metrics.fewo:	reply,	seq	15					
20:51:39.181101	metrics.fewo	>	gate2.fewo:	request,	seq	16		20:51:39.195436	metrics.fewo	>	gate2.fewo:	request,	seq	16					
																		20:51:39.195711	gate2.fewo		>	metrics.fewo:	reply,	seq	16					
20:51:40.205114	metrics.fewo	>	gate2.fewo:	request,	seq	17		20:51:40.219226	metrics.fewo	>	gate2.fewo:	request,	seq	17					
20:51:40.234044	gate2.fewo		>	metrics.fewo:	reply,	seq	17		20:51:40.219506	gate2.fewo		>	metrics.fewo:	reply,	seq	17		from	gate2.fewo	icmp_seq=17	time=29.0
20:51:41.206281	metrics.fewo	>	gate2.fewo:	request,	seq	18		20:51:41.220318	metrics.fewo	>	gate2.fewo:	request,	seq	18					
20:51:41.234726	gate2.fewo		>	metrics.fewo:	reply,	seq	18		20:51:41.220592	gate2.fewo		>	metrics.fewo:	reply,	seq	18		from	gate2.fewo	icmp_seq=18	time=28.5
20:51:42.207911	metrics.fewo	>	gate2.fewo:	request,	seq	19		20:51:42.222167	metrics.fewo	>	gate2.fewo:	request,	seq	19					
20:51:42.236982	gate2.fewo		>	metrics.fewo:	reply,	seq	19		20:51:42.222462	gate2.fewo		>	metrics.fewo:	reply,	seq	19		from	gate2.fewo	icmp_seq=19	time=29.1
20:51:43.209229	metrics.fewo	>	gate2.fewo:	request,	seq	20		20:51:43.223361	metrics.fewo	>	gate2.fewo:	request,	seq	20					
20:51:43.237624	gate2.fewo		>	metrics.fewo:	reply,	seq	20		20:51:43.223632	gate2.fewo		>	metrics.fewo:	reply,	seq	20		from	gate2.fewo	icmp_seq=20	time=28.4
20:51:44.210812	metrics.fewo	>	gate2.fewo:	request,	seq	21		20:51:44.225199	metrics.fewo	>	gate2.fewo:	request,	seq	21					
20:51:44.240116	gate2.fewo		>	metrics.fewo:	reply,	seq	21		20:51:44.225487	gate2.fewo		>	metrics.fewo:	reply,	seq	21		from	gate2.fewo	icmp_seq=21	time=29.3
20:51:45.212308	metrics.fewo	>	gate2.fewo:	request,	seq	22		20:51:45.226606	metrics.fewo	>	gate2.fewo:	request,	seq	22					
20:51:45.241390	gate2.fewo		>	metrics.fewo:	reply,	seq	22		20:51:45.226876	gate2.fewo		>	metrics.fewo:	reply,	seq	22		from	gate2.fewo	icmp_seq=22	time=29.1
20:51:46.213965	metrics.fewo	>	gate2.fewo:	request,	seq	23		20:51:46.228277	metrics.fewo	>	gate2.fewo:	request,	seq	23					
																		20:51:46.228579	gate2.fewo		>	metrics.fewo:	reply,	seq	23					
20:51:47.245117	metrics.fewo	>	gate2.fewo:	request,	seq	24		20:51:47.259094	metrics.fewo	>	gate2.fewo:	request,	seq	24					
																		20:51:47.259372	gate2.fewo		>	metrics.fewo:	reply,	seq	24					
20:51:48.269063	metrics.fewo	>	gate2.fewo:	request,	seq	25		20:51:48.283266	metrics.fewo	>	gate2.fewo:	request,	seq	25					
																		20:51:48.283533	gate2.fewo		>	metrics.fewo:	reply,	seq	25					
20:51:49.293076	metrics.fewo	>	gate2.fewo:	request,	seq	26		20:51:49.307336	metrics.fewo	>	gate2.fewo:	request,	seq	26					
																		20:51:49.307603	gate2.fewo		>	metrics.fewo:	reply,	seq	26					
20:51:50.317071	metrics.fewo	>	gate2.fewo:	request,	seq	27		20:51:50.331141	metrics.fewo	>	gate2.fewo:	request,	seq	27					
																		20:51:50.331412	gate2.fewo		>	metrics.fewo:	reply,	seq	27					
20:51:51.341108	metrics.fewo	>	gate2.fewo:	request,	seq	28		20:51:51.355054	metrics.fewo	>	gate2.fewo:	request,	seq	28					
																		20:51:51.355331	gate2.fewo		>	metrics.fewo:	reply,	seq	28					
20:51:52.365704	metrics.fewo	>	gate2.fewo:	request,	seq	29		20:51:52.379853	metrics.fewo	>	gate2.fewo:	request,	seq	29					
																		20:51:52.380127	gate2.fewo		>	metrics.fewo:	reply,	seq	29					

Hope the structure is clear.

  • Left: router tcpdump
  • Middle: server tcpdump
  • Right: ping from server to router

What I see in the pinging terminal:

server-vm:# ping gate2.fewo
PING gate2.fewo(gate2.fewo) 56 data bytes
64 bytes from gate2.fewo: icmp_seq=1 ttl=64 time=28.6 ms
64 bytes from gate2.fewo: icmp_seq=17 ttl=64 time=29.0 ms
64 bytes from gate2.fewo: icmp_seq=18 ttl=64 time=28.5 ms
64 bytes from gate2.fewo: icmp_seq=19 ttl=64 time=29.1 ms
64 bytes from gate2.fewo: icmp_seq=20 ttl=64 time=28.4 ms
64 bytes from gate2.fewo: icmp_seq=21 ttl=64 time=29.3 ms
64 bytes from gate2.fewo: icmp_seq=22 ttl=64 time=29.1 ms
^C
--- gate2.fewo ping statistics ---
29 packets transmitted, 7 received, 75.8621% packet loss, time 28503ms
rtt min/avg/max/mdev = 28.434/28.867/29.344/0.331 ms

Yes it was tricky, but at least it is obvious that the fault is not yours. I am not sure if your provider is cooperative enough to help you troubleshoot.

Facts:

  • Germany
  • Big enterprise company for backbone
  • Big enterprise company as reseller (actual ISP)

Conclusion:

  • What a bright new future you are proclaiming here...

But you got me thinking. At least the wireguard handshake is working. So the tunnel must be up.
I tried executing 2 ping commands at the same time. One over the tunnel and one to the public address used to establish the tunnel (both from router to server). And interestingly the public internet ping works without any trouble. I double checked cpu load if that might be the case but there are 2 cores working at 5% each....

Maybe any other suggestions?

I can only think of throttling on the way. Try to change the port of wireguard into something else.

Okay I will try that.
The router in question is not at my place and I would prefer being near it when I do changes to the vpn. So it may take a bit

Small update:
Since I will not be able to try out a port change in the next days I did something else. Before I had two routers connected via vpn to the server.

gate2 <--> server
gate1 <--> server

This one I changed to route to route the traffic all over gate1 (which is totally not ideal...):

gate2 <--> gate1 <--> server

This actually works better. What is strange because gate1 and server use the same ports.

Hi
Some time has passed and I like to wrap this one up. Since the issue was somewhere behind the configuration and Hardware I do administer I contacted my ISP. Which obviously took I while and multiple calls since I am not the official contract partner to them...

Well it so happened that one time I felt confident of tackling the issue with all set up and pings running to show the issue they still where not willing to talk to me in detail. When I then was about them again with enough permissions (5 days later) the issue was just gone...

Now I configured everything as it's meant to be and it works.
Case closed I guess.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.