WireGuard on devices drops on roaming, part deux

Continuing the discussion from WireGuard on phone drops on roaming, regardless of 802.11r:

After the problem was fixed with complete reset, it has now come back and have been exhibiting exact same behavior (as described in the previous topic) for the past week or so.

The only setting I've changed since the reset were encrypted DNS, and mounting a flash drive for dnsmasq TFTP for PXE booting. TFTP thing is what I did most recently (9d ago, according to last post I made to that thread), but I don't see the correlation and I don't think I noticed this issue in the immediate aftermath of enabling it. (Not to mentioned I never messed with TFTP prior to reset when I had this problem anyway)

I've made this new thread to keep a loose log of my observation regarding this issue. Of course, any constructive input are most welcome. I may reset the router again, and see if this issue magically recurs, but it's not easy to time it with lots of people being dependent on this connection on different VLANs.

Found the time and opportunity to reset the router again, and it's been more than 24 hours without the WireGuard roaming issue.

So far, with custom build, DNS over HTTPS and Transmission services are installed. For network, there's two 802.11q devices, two default WANs and three different LANs. On the firewall side, added four new rules for DNS and DHCP for the two additional LANs, and one more forwarding to WAN. This is going the be base, known-good configuration going forward.

I think I have definitively found the culprit behind my WireGuard roaming woes: hardware flow offloading. As I alluded to in in another recent post, I thought the RT3200's mediatek/mt7622 platform didn't support hardware offloading and enabling it caused the problem. But according to the device's wiki, it is supported and what I've encountered could be a bug in its implementation.

Toggling hardware flow offloading easily reproduces the behavior I described in OP (to summarize: roaming kills any WG connection, which sometimes recovers after a long time but most often doesn't; non-WG connections appear unaffected).

I'm utterly unfamiliar with firewall voodoo. Can I somehow help debug this, maybe something to do with the conntrack things? Or report to much smarter people developing OpenWrt and hope for the best?

I really hope that this is really the culprit, but I don't think it is. I have been reading your experiments trying to solve this problem across three threads because I'm facing the same problem with Wireguard connection dying.

The reason I don't think hardware flow offloading is the culprit is because I have the same experience even without hardware flow offloading-supported hardware.

My story:

Previously I have a ISP-provided router, and a OpenWRT dumb AP connected by ethernet behind the ISP router. The OpenWRT AP is ASUS RT-AC1300UHP (ipq401x), which doesn't support hardware flow offloading. The Wireguard problem is present in this setup. I thought the reason was that roaming was not seamless, because FT is not configurable on the ISP router, so 802.11r wasn't working between the devices in this setup.

So I went and buy another router, a ASUS RT-AX53U (mt7621at), flashed OpenWRT and have hardware flow offloading enabled. With this new router and the RT-AC1300UHP as dump AP, I have 802.11r setup, the Wireguard issue seems to be working better (happens less frequently, not sure if it's placebo), but not completely eradicated.

I gotta say, I learned a lot from these threads you started/commented on, I'm very grateful for that.

1 Like

Hey, it's great that you find my threads helpful. I've been posting these both as a self-documentation of what I did in response to coming across these issues and for other more knowledgeable people to possibly be able to figure out and fix them.

I think you may've gotten a little detail about offloading wrong here: it does not affect dumb APs at all. Both software and hardware flow offloading is to make NAT more efficient and not hit the anemic processors in consumer networking gear as hard. So whether you enable offloading or not. the dumb AP's behavior remains the same so long as it's only doing L2 packet switching, and not L3 packet routing between your LAN and outside WAN.

My take is your original ISP router was exhibiting similar behavior to my OWrt router with hardware offloading enabled. See, my RT3200 is based on mt7622 platform, as opposed to your RT-AX53U's mt7621at. Meaning, the behavior that I observed could be blamed on the specific hardware offloading implementation for mt7622.

Probably what happened to you could be entirely blamed on the ISP router, and nothing do with OpenWrt at all.

What model was your ISP router, do you know the SoC/platform it's based on? Maybe it's at least tangentially related to my case?

Yes, I know that, my point was that even without hardware offloading the problem is there, but now that you've reminded me, althought the ISP router did not have a switch for hardware offloading, it might be hardcoded to do it. So it's hard to say if there is hardware offloading in my previous setup. I previously assumed that the ISP router does not support hardware offloading, which I now realize it's an assumption based on nothing. So I might be very wrong about that.

I have tried two ISP router, one is Kaonmedia AR2140 with quite beefy specs but I can't find information about the SoC other than 3 ARMv7 cores, 512 MB RAM and 256 MB flash.

Another ISP router is TP-Link Archer C1200 with BCM47189, 128 MB RAM and 16 MB flash. The TP-Link has a NAT boost switch in the web UI, but I'm not sure if it's hardware or software offloading.

I'm currently testing if having hardware offloading on the RT-AX53U (which is doing the NAT) turned off helps.

I assume that since your post from 23 days ago you can now confirm that hardware offloading is the culprit in your case for sure? It would be a bummer if it's also the hardware offloading causing the problem, since it's such a good feature of this SoC.

Yes, that's what I meant. Only with OpenWrt as AP, OWrt's offloading features had no bearing on the results you were observing. That means the problem necessarily originated from your ISP router, which is essentially a blackbox to us, regular users.

AFAIU, generally SoC designed for consumer routers do have hardware acceleration. OpenWrt can only use them on limited platforms, like your mt7621at for now. I think it's safe to assume that your ISP router was definitely using offloading—and now I wonder if your WireGuard problem would go away if you disabled NAT boost. Do you remember if it was enabled when you faced this issue?

Yes, just toggling hardware offloading triggers this issue. Thankfully, software offloading doesn't cause any problems for me.

I tried to look for technical details, but I find no reference to its NAT accelerators, so that seems to be a dead-end.

https://openwrt.org/docs/techref/hardware/soc/soc.broadcom.bcm47xx#ethernet

Broadcom chipsets definitely has some sort of acceleration implemented as a proprietary kernel module. That definitely rules out any connection to the OpenWrt hardware offloading implemented for my chipset...

Yes, on that ISP router I had NAT boost on all the time, since I hadn't realized it was causing the problem until I found your threads.

Now, after more than 24 hours with hardware offloading off on the mt7621at router, I hadn't seen the Wireguard problem resurfacing. I think it's safe to say that my case is the same as yours, that the hardware offloading is causing the Wireguard breaking up problem.

For documentation, my setup is ISP fibre termination box -ethernet cable, to wan-> ASUS RT-AX53U (mt7621at, PPPoE, NAT) -ethernet cable, lan to lan-> ASUS RT-AC1300UHP (ipq401x, dumb AP).

Wireguard connection breaks when the NAT router (mt7621at) has hardware offloading on and devices roam between both ASUS APs.

1 Like

Thank you, I think we can now confirm the following:

  • this bug is not specific to my device/config
  • applies to OpenWrt's generic hardware offloading implementation rather than being platform-specific

I think this merits a bug report. Also, while I don't think it's a factor, do you want to list what kind of WG service you're using? Like I mentioned in my first thread, I faced and tested the problem with free tiers of Cloudflare's WARP and Proton VPN services.