After hours or days, my internet connection is lost, want to see why

Using a WRX-36, i see that after sometimes hours or days, i lose my internet connection.
I first thought this would be to blame on wireguard, but when disabling wireguard(and changing the firewall to route traffic from wan directly to lan), the internet connection does not immediately come back. I need to reboot.
I thought to check the log, and the following is reported:

Thu Mar 16 11:10:32 2023 daemon.warn odhcpd[1933]: No default route present, overriding ra_lifetime!
Thu Mar 16 11:13:51 2023 daemon.warn dnsmasq[1]: possible DNS-rebind attack detected: dns.msftncsi.com
Thu Mar 16 11:14:55 2023 daemon.warn odhcpd[1933]: No default route present, overriding ra_lifetime!
Thu Mar 16 11:15:07 2023 daemon.notice netifd: wan (3574): udhcpc: sending renew to server -----------
Thu Mar 16 11:16:10 2023 daemon.warn dnsmasq[1]: Maximum number of concurrent DNS queries reached (max: 150)
Thu Mar 16 11:16:16 2023 daemon.warn dnsmasq[1]: Maximum number of concurrent DNS queries reached (max: 150)
Thu Mar 16 11:16:22 2023 daemon.warn dnsmasq[1]: Maximum number of concurrent DNS queries reached (max: 150)
Thu Mar 16 11:16:29 2023 daemon.warn dnsmasq[1]: Maximum number of concurrent DNS queries reached (max: 150)
Thu Mar 16 11:16:43 2023 daemon.warn dnsmasq[1]: Maximum number of concurrent DNS queries reached (max: 150)
Thu Mar 16 11:16:49 2023 daemon.warn dnsmasq[1]: Maximum number of concurrent DNS queries reached (max: 150)
Thu Mar 16 11:16:58 2023 daemon.warn dnsmasq[1]: Maximum number of concurrent DNS queries reached (max: 150)
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: reading /tmp/resolv.conf.d/resolv.conf.auto
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: using nameserver -----------
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: using nameserver -----------
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: using only locally-known addresses for test
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: using only locally-known addresses for onion
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: using only locally-known addresses for localhost
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: using only locally-known addresses for local
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: using only locally-known addresses for invalid
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: using only locally-known addresses for bind
Thu Mar 16 11:17:01 2023 daemon.info dnsmasq[1]: using only locally-known addresses for lan

(Removed the nameserver IP's)
(At approximately 11:14-16 the internet stopped working)

I am using a WRX-36 main router, with 2 extra acting as a dumb AP in a 802.11s mesh network, and am using wireguard.

Help would be appriciated! If more info is required, feel free to ask!

Give routing table for both cases, when Internet is up, and down.

can you please explain where i find that, and what you exactly need?

Output of route command.

What is your ISP connection?

i have a 500mbit connection.
I don't know if i should censor the IP addresses this gives, so only the local ones are visible:

This is with internet working.

What is WAN connection of router?
uci export network

In the below image, it should be noted that all the surfshark.com.conf connections are no longer valid, i stopped my subscription there. Also note that if internet stops working, and i directly route traffic from wan to lan, it still does not work.

WAN is DHCP-client. It may depend on DHCP-server of ISP. Can you specify static IP?

no, internet providers in EU mostly do not allow static IP's.
My provider does not, unless i am a business user.

There are multiple potential things happening here, so it's important to separate them out:

  1. In general, if you have a 'kill-switch' configured via the firewall and your wireguard tunnel (or any VPN) goes down, your firewall will not allow taffic to flow. So, make sure that you have the firewall configured appropriately (i.e. normally lan > wan forwarding is allowed unless you have removed that).
  2. When WG is enabled, if you're sending all traffic through the tunnel (i.e. allowed_ips 0.0.0.0/0), it rewrites the default route on the system. When it is disabled, that is not reverted. Therefore, you can add a metric to your outbound interfaces (i.e. wan, wg) so that the routes stay in the table.
  3. If your WG tunnel is going down because one side or the other has changed IP addresses (often due to a DHCP lease renewal), there are some scripts to detect the condition and bounce the WG interface so that it resolves again.
  4. If, per chance, your DHCP lease on your wan fails to renew at all, you'd obviously lose connectivity until you reboot the router and hopefully get a new IP.

I think #2 is the solution to your issue, as decribed, with the others as other things you can look into, as well.

Thanks for your thoughts!
If 1 were true, i should have gotten internet back as soon as i disable wireguard AND the firewall preventing traffic to flow outside of the wireguard interface? When internet fails, i do the following: 1. disable wireguard interface and disable enable during startup. change the firewall setting from wan>WG to WAN>LAN. Then i am not getting internet back.

I really find point 3 interesting, maybe the issue is indeed that i'm getting a new IP appointed, and that is not working nicely with all the current configuration. What scripts are you refering to? Would love to implement solutions to this.

As for point two, i don't think i understand it...But would love to see if i can fix that as well, how would that be done?

Many thanks!

A common misconception is that the firewall and routes are effectviely the same. A good way to separate them is to think of the firewall as a permissions structure that allows or prohibits traffic flow, where as routes are the actual paths that the traffic can take. They must obviously align for traffic to flow, but they are two different things.

To make an example of this, let's consider the most simple home rotuer situation -- LAN > WAN traffic. If it is not flowing, we could have the following potential causes:

  • The firewall does not allow it
  • There is an incorrect route (that attempts to send traffic to another gateway, rather than via the gateway on the wan)
  • There is no default route (maybe becuase the wan itself is actually down).

If the firewall is corrected to allow lan > wan forwarding, but one of the two other issues persists, traffic will still not flow, even though it has appropriate allowances to do so. However, if the other two are fixed but the firewall does not allow lan > wan forwarding, traffic could flow (insofar as the infrastructure is working, but is forbidden to do so by the firewall).

In #2, I described how the default route gets replaced by WG and is not re-established when WG is disabled. A restart of the WAN interface can correct that (as does a restart of the router as a whole), but adding a metric to the wan will keep that route from being erased, as well.

Putting it all together, 1 and 2 can both be problematic at the same time. Fixing 1 alone doesn't fix the problem with 2.

Your log should hopefully have something to offer here:

logread -e udhcpc

If you see different IP addresses from one lease to the next, that should provide some confirmation. Further, you can see if the timing lines up based on the DHCP lease times (which are expressed in seconds, so 86400 = 24 hours)

See this thread:

Thanks! Not being not super knowledged, i arrive at the following page through the thread you linked:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=20c4819c7baf6f9b91420849caf30e5137bd75d6
I see no root folder in the crontabs directory when i browse my router through SCP.
Should i create a file in there with "echo '* * * * * /usr/bin/wireguard_watchdog' ? Also what does persistent_keepalive refer to, and where can i set that value, as it is required according to that link? Or is this commit already standard available in OpenWRT, as the file /usr/bin/wireguard_watchdog is already existing?

You have said now twice to "Therefore, you can add a metric to your outbound interfaces (i.e. wan, wg) so that the routes stay in the table." but what metric would that be?

EDIT: i have found the keep alive settings, in the peer of the wireguard interface, and i saw that the cron job is automatically added to the crontjob folder as a "root" file. Hopefully this will then make things work better, thanks a lot for your time and assistance, i very much appriciate it! :pray:

Today's internet is quite aggressively linking to external resources, at least on devices with sufficient amount of RAM it does make sense to raise dnsforwardmax (and cachesize) beyond 150 entries, e.g. on my 4 GB RAM device I'm using:

        option dnsforwardmax '1000'
        option cachesize '1000'

(without too much thinking, 150 wasn't enough, 1000 looked like a sensible improvement - and I have enough RAM to play with).

this option wasnt written in config dnsmasq at the file /etc/config/dhcp so i added it, that should do the trick?
Don't have 4gb ram, so i increased it to 320 for now.