GL-iNet, Wireguard, and dnsmasq-full LAN host resolution weirdness

Edit

I first mistakenly posted this to the dd-wrt forum, because honestly until recently I hadn't noticed that they were different projects. That shows how much I know about them. Anyway, I referenced dd-wrt because I didn't clean up the copy/paste; since it's already been noted in a response, I'm going to do strike-through corrections.

So, this is my hail-mary. I understand that this is a difficult configuration to answer questions about; regardless, I'm hoping someone will say: "oh, that problem" and have an answer.

BLUF: 2-3 times a day, LAN host name resolution will stop working. I have to log into the server and run a command, after which everything works for a while. This always happens overnight; it usually happens at some point during the day, as well.

I have a GL-iNet AX1800 running in router mode; it's running whatever customized OpenWRT GL-iNet puts on their routers -- I have not reflashed this. The router claims it's running OpenWrt 21.02-SNAPSHOT r16399+173-c67509efd7. I have it configured to connect to Mullvad via Wireguard; I'm excluding a small group of IPs from the VPN for other work-related VPN connections. Although I know almost nothing about dd-wrt OpenWRT, I've poked around in the shell to add some LAN cnames.

For a long while, I was struggling to consistently getting named LAN hosts to be resolved by the router; it was inconsistent at best: sometimes, dig sting.lan would work, other times not, and some hosts would just never resolve. During all of this, I think I installed the dnsmasq-full package -- relacing the default dnsmasq -- in any case, dnsmasq-full is what's currently installed. The end result is that I got LAN hosts to reliably resolve. Some time (a few weeks? and maybe a firmware update) passed.

With that explained, to my issue: LAN host resolution now sporadically, but reliably, stops working. It stops working overnight, every night, and then 1-3 times during the day. I've tracked it down enough to know a minimum command to run to fix it, but I don't know why it works. I'm also concerned by the number of dnsmasq instances that are running. WAN resolution never stops working.

When LAN resolution starts failing, my work-around is to ssh into the server and run /etc/init.d/vpnpolicy-apply restart. This may be a GL-iNet script, but it's only a few lines long. What it does is:

  • Sets $mode to uci -q get vpnpolicy.route_policy.proxy_mode
  • Based on the value of $mode, runs either /usr/bin/vpn_domain_update.sh, or /usr/bin/route_policy $mode, or both.
  • In my case, I know that vpnpolicy.route_policy.proxy_mode is "3", because this is the only value that causes both scripts are run, and I know from tracing both are being executed.

One other thing I've noticed is that I have four (4) dnsmasq instances running at once, which seems suspicious: two pairs of identical arguments:

 5668 root      2704 S    /usr/sbin/dnsmasq -C /etc/dnsmasq.conf.vpn -x /var/run/dnsmasq/dnsmasq.vpn.pid --server=193.138.219.228 --no-resolv
 5669 root      2676 S    /usr/sbin/dnsmasq -C /etc/dnsmasq.conf.vpn -x /var/run/dnsmasq/dnsmasq.vpn.pid --server=193.138.219.228 --no-resolv
 6002 dnsmasq   2724 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
 6007 root      2692 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid

The configurations for each are different, although both have the same dns-leasefile value, and does indeed contain the LAN leases and host names. And -- regardless -- even with all 4 processes running, it works... until it doesn't. I know for a fact that the non-vpn config versions are being run by the /usr/bin/route_policy script; I don't know what's starting the vpn-config versions although I suspect /usr/bin/vpn_domain_update.sh. I do know that only the vpn process(es) are necessary for all domain resolution to work.

The issue isn't transient. It works until it doesn't, and then continues not working until I run the service vpnpolicy-apply restart command.

Things I've tried/considered:

  • I've tried uninstalling dnsmasq-full, but that just breaks all client internal and external DNS resolution. I haven't tried uninstalling dnsmasq-full and installing dnsmasq; I don't get the feeling that the problem is in the dnsmasq-full package, and I do have a vague feeling that it was installing -full that caused LAN host resolution to start working.
  • I've tried stopping and disabling the dnsmasq service. Indeed, it kills the non-vpn-config pair, and both LAN and WAN DNS resolution continues to work without them. However, it doesn't prevent the issue occurring, and it just gets started back up by /usr/bin/route_policy when I run service vpnpolicy-apply restart.
  • I've renamed /etc/init.d/dnsmasq. This causes /usr/bin/route_policy to complain, does prevent the second set of dnsmasq instances from running, and it leaves LAN/WAN DNS resolution in a working state -- but it's obviously not a long-term solution nor does it tell me what I'm doing wrong.
  • I've considered just running the damned service vpnpolicy-apply restart command every hour via a cron job, but that's such a horrible OPS-ey solution, I'd really rather figure out what I've got wrong than do that.

I know that GL-iNet isn't "pure" DD-WRT OpenWRT, and that it's a long shot; does anyone see anything in what I've posted that looks obviously misconfigured, or have any suggestions for what I could try to get DNS LAN resolution consistently working?

Thanks,

It appears you are using firmware that is not from the official OpenWrt project.

When using forks/offshoots/vendor-specific builds that are "based on OpenWrt", there may be many differences compared to the official versions (hosted by OpenWrt.org). Some of these customizations may fundamentally change the way that OpenWrt works. You might need help from people with specific/specialized knowledge about the firmware you are using, so it is possible that advice you get here may not be useful.

You may find that the best options are:

  1. Install an official version of OpenWrt, if your device is supported (see https://firmware-selector.openwrt.org).
  2. Ask for help from the maintainer(s) or user community of the specific firmware that you are using.
  3. Provide the source code for the firmware so that users on this forum can understand how your firmware works (OpenWrt forum users are volunteers, so somebody might look at the code if they have time and are interested in your issue).

If you believe that this specific issue is common to generic/official OpenWrt and/or the maintainers of your build have indicated as such, please feel free to clarify.

In addition, you mentioned DD-WRT a few times - DD-WRT is an entirely different project.

1 Like

Yeah, the DD-WRT thing is because I at first posted this to DD-WRT, when it's clearly OpenWRT. During the copy/paste, I failed to clean up the references. It's not relevant to the details, since clearly the router itself is claiming to be based on an OpenWRT build -- it's a confusing PEBKAC error for which I apologize.

And, yes. I thought it'd be a longshot asking here; I've read elsewhere that GL-iNet uses a customized version of OpenWRT; while they do allow access to the lucy GUI with only a couple of clicks and a disclaimer, its certainly not vanilla.

I think I'm in a hot-potato situation, where I've heated the potato myself. GL-iNet is an unlikely source of help, since I've been tinkering. And it's not vanilla OpenWRT, so y'all don't want to handle the potato, either.

I'm unlikely to flash a vanilla OpenWRT -- no shade to lucy, but the GL-iNet UI is far more pleasant to work with. I might do a hard-reset back to GL-iNet's original source and go through the entire reconfiguration process (VPN, VPN bypass, LAN cnames, etc. etc), but I feel as if I'm just going to end up back where I started, where none of the LAN DNS hosts reliably resolve. Which will lead to tinkering, and probably back here.

Frankly, I was hoping for some general confirmation/denial, like:

  • Oh, no, having several dnsmasq's running at the same time is perfectly normal
  • OpenWRT renews leases every X hours, and this might be triggering a partial reconfiguration of dnsmasq
  • Oh, everybody knows about this problem with dnsmasq, it's so common, all you need to do is X and it'll fix it

Finally, my hopes were largely resting on the suspicion that my issue isn't so much with OpenWRT itself, or the mutant offspring that GL-iNet is using, but rather with dnsmasq. Or, I rather suspect, the interaction between the Wireguard extension/plugin and dnsmasq-full. I think I've created a sort of race conflict between the two which I'm not sure how to resolve, and what I don't get is why one of them isn't just outright crashing with a failure to obtain the listening port for DNS. There's nothing suspicious in dmesg, and they're both running -- are they both trying to listen on :5353?

Gl.inet uses a strange hybrid version of openwrt vanilla + soc vendor SDK + self developed packet (wireguard/dnsmasq etc).
I strongly advise you to flash a vanilla openwrt 'cause the gl.inet fork has a lot, and I mean a lot, of quirkness.

1 Like

The issue is no OpenWrt vs DD-WRT, the issue is OpenWrt vs the modified version of OpenWrt issued by GL.iNET.

Your router is running a piece of software known by GL.iNET and nobody else. Either you move to vanilla OpenWrt, or only they will be able to help you.

Ok. I see the comments about flashing to base OpenWRT, and while I appreciate the perspective, I'm disinclined to give up the excellent mobile app and the much more comprehensible web interface.

That said -- and this is now more in case someone else stumbles on this thread -- I've tracked the work-around further. Eventually, I'll be able to reverse engineer what must be going wrong.

Now I know that the only command needed to "fix" resolution is calling /usr/bin/route_policy 3. This looks to be a GL-iNet script, as the only reference to it I can find is in this GL-iNet forum comment (which is where I'll take this question next). While the thread doesn't appear related to my issue, it's a tenuous lead.

What that script is doing (with the argument "3" -- my route policy mode) is:

  1. Clearing and re-configuring the router firewall (likely irrelevant to my issue)
  2. Flushing the ipset with this series of commands: (possibly relevant)
    ipset flush via_vpn_domain
    ipset flush bypass_vpn_domain
    ipset flush via_vpn_mac
    ipset flush bypass_vpn_mac
    
  3. Running a fairly long shell function that mucks around with dnsmasq (very likely relevant)
  4. Reloading the firewall (not relevant)
  5. Running a brief script that fusses around with conntrack, which means nothing to me.

My current guess is that what fixes my issue are steps 3 and/or 2. And maybe 5, because conntrack seems to be related to dnsmasq.

Through tracing, I can see this script actually fork dnsmasq twice, so maybe it is indeed on purpose.

Again, if I get an answer from the GL-iNet forum or track down what's causing the failure, I'll close the loop here for future searchers. Regardless of whether it's something specific to GL-iNet or OpenWRT, providing answers is important.