WireGuard on phone drops on roaming, regardless of 802.11r

I always have a WG connection live on my phone, and often when roaming between my two APs, the connection will drop. Sometimes it doesn't come back at all until I manually disconnect and reconnect, sometimes it comes back after waiting for a few minutes; rarely, does it seamlessly continue to work without any noticeable drop. The reason the disruptions are brought to my notice is while roaming, a YouTube podcast might be playing on NewPipe or music on a streaming app (both over VPN). For context, I'll describe three things: the WG client apps, my current Wi-Fi setup and couple scenarios where WG didn't drop while roaming.

I've tried three different WireGuard apps, and they behaved identically on raoming:

  • official WireGuard Android app (Go implementation)
  • Cloudflare's for its Warp service
  • Proton VPN

On the WG client, I have the configuration files for the Warp and Proton VPN services.

For my Wi-Fi setup, there are two APs, with same SSID and WPA3:

  • Belkin RT3200; also the router, broadcasting 802.11b/g/n and ac/ax/n on the 2.4GHz and 5GHz radios respectively
  • TP-Link Archer C7 v5; only broadcasting 802.11ac/n on 5GHz, 2.5GHz radio disabled

While I'd love to have a 5GHz-only environment, there are devices which responds abysmally at certain locations whereas other operate just fine. Without 802.11r, roaming on my device seemed to work fine (besides the WG issue), but I don't remember checking how fast it switched or if it retained same IP while switching APs.

I've enabled 802.11r on all SSIDs, and it seems to function fine; I don't know specific apps that might be used to confirm its enablement, but I've observed on WiFIAnalyzer (from F-Droid), that AP switching does seem to happen fast enough, and it retains the same IP while roaming.

Finally, with the WG apps I mentioned, I have experience roaming in other enterprise network deployments where WG connection never dropped. I've either streamed or talked over VoIP in those networks over the active VPN connection. Of the enterprise networks, I know for a fact that one of them was a Cisco-based deployment (gathered from the captive portal) and the other is Xiaomi-based (I know of the AP installation).

I know this was a lengthy write up just to describe the context and my experience, but would any of you happen to know how to figure out the issue? It's not really a deal-breaker for me, it's quite annoying but certainly functional, and I'm all into not having proprietary blackboxes anywhere. At the same time if there's a solution, I'd love for it to work flawlessly.

Ultimately, this apparently has nothing to do with WireGuard and OpenWrt, but the fact that I'm using WPA3 which is—for almost two years—incompatible with fast roaming.

I've been to the place with Xiaomi AP fleet, and noticed transitioning from AP to AP differed from my experience with my own OW APs: on the Xiaomi network, besides signal strength changing, there was no other indication of roaming on my phone, but with my OpenWrt devices, it disconnects for a full couple of seconds and the Wi-Fi symbol disappears on the phone before reconnecting to the next AP. That served as indicator that roaming was nonfunctional.

There are workarounds, but I don't fully understand them. Maybe I'll try implementing those, or maybe I'll simply dial down to WPA2, or just give up on fast roaming altogether. Still pondering my choices.

1 Like

I may have to reassess if this has to do with something more fundamental than 11r.

I was able to solve roaming on WPA3: it can be configured completely manually as I did in each AP's /etc/config/wireless files as detailed here; or you can simply toggle two parameters option ieee80211r '1' and option ft_psk_generate_local '0' (on LuCi under Wireless roaming tab, tick 802.11r Fast Transition checkbox and untick Generate PMK locally) and OpenWrt will do the rest for you. Both methods whether you detail all parameters manually or let OpenWrt do its thing will result in roaming working smoothly. Though for Apple devices you may need to set option reassociation_deadline '20000' (LuCi: Reassociation Deadline)—that is the wisdom of the community and Apple docs, I've no fruit company products to test myself.

Unfortunately, it does not fix the reason for this thread: WireGuard connection still dies upon switching APs.

1 Like

Now that I figured out roaming doesn't fix the WireGuard issue, I've disabled 11r on both my APs for more testing. And, anecdotally, I think VPN hiccup is occurring when roam to my router, Belkin RT3200. It happens even when switching from 5GHz to 2.4GHz, both signals coming from the same device. WG keeps working as I move from RT3200 to Archer C7, but not in reverse. In fact, going back to C7 will let the connection resume as if nothing happened.

This is seriously confusing. I'm not treating this thread as blog post on my own diagnosis on this issue; please help me if you know anything at all.

Try to disable dropping invalid packets and check connection tracking settings:

You may have reason to be currently more confused than earlier, as since three days ago there is a major bug in netifd, both in main/master and 23.05, which causes some wifi SSID to be left out of the lan bridge. That killed my roaming, and can be hard to figure out (as the wifi client will revert back to one AP).

See mailing list discussion http://lists.openwrt.org/pipermail/openwrt-devel/2023-November/041746.html

1 Like

Is the command to re-enable it, uci set firewall.@defaults[0].drop_invalid="1"? How risky is it to mess with this, security-wise?

It sounds like the bug that affects normal, non-VPN traffic. That's not the case on my phones, I confirmed with fast.com speed test while roaming, no drops whatsoever. Also, my issue has been around since 22.03, when I installed current setup.

Small update: I've confirmed same behavior on a Windows 11 machine running WireGaurd (kernel implementation). Roaming to from first AP's (RT3200) ac signal to n and second AP's (Archer C7) ac again, connection was alive; returning to first AP's n connection when I moved back, and WG died.

I'm writing from the same machine now through the VPN, I didn't reconnect/disconnect WG, but let the machine sleep hours ago when I tested. Now I woke it and connection resumed without issue.

I have exciting new update! After the updating to 23.05.2, instead of WireGuard dying when I move to main AP, and reviving when I return to the second, it's the opposite. Roaming to second AP kills the connection now, and returning to main AP revives it.

1 Like

The few days since my last post has been extremely frustrating.

Invalid packet thing did nothing, except maybe sometimes I felt WG connection recovering faster, but fundamentally same behavior of connection dying and no recovery. I'll rollback this change.

As for connection tracking, I've no idea what changes would have what effects, so I left that untouched.

This appears to have been fixed in master at least, but the reply asking for it to be backported to 23.05 was not answered. Was it backported? Still not sure it affects my WG thing though, since I don't see any issue on any connections except for WG.

Next I'll try to play around with trunking and DSA configuration but it's just me grasping at the straws, I don't really think it's has any bearing on my issue.

So I've completely resetted the main AP/router, Belkin RT3200 due to another entirely separate and more frustrating issue regarding IPv6. While the IPv6 issue itself was not solved, it seems the WG connection dropping issue may have been.

While I wanted to do this before, it was not something I could do lightly given re-configuring interfaces, VLANs and the accompanying firewall rules takes time to do, even if it's not that many.

Anyhow, it seems the endeavor has been partially fruitful—I'll observe the behavior a bit more for good measure before marking this topic as solved.