I've been a long-time user of mwan3, but as my connectivity has improved, my needs have simplified. I no longer need the complex traffic-balancing and policy based routing, and would prefer a failover solution that works well with other tools like BanIP.
Until a link fails, most of my needs are solved by setting interface metrics in order of priority. However when a link goes down, it rarely results in the actual interface going offline, and given the DHCP lease times from my upstream ISPs, failure usually requires some manual intervention.
To address the issue, I spent a few hours this last week crafting something between mwan3 and the static setups. The script blends some of the approaches from mwan3, notably monitoring gateways and external IP addresses via reachability tests, with the simplicity of the linux routing table. When a link fails, it's demoted to a lower priority and connection tracking is reset. Traffic immediately swaps to a new outbound interface.
I have more work to do to package this fully for OpenWRT, but would appreciate some commentary before I go much further. As of today, it runs via cron locally and has survived much of my local testing. Failover to a different link is nearly immediate.
Are there ways to simplify this approach (aside from being better at writing Lua)?
I believe that you have captured what it takes to set up failover for IPv4. Yet, my opinion also is that such probing and demotion of interface metrics, as well as announcing only the "best" IPv6 upstream, should be core features of OpenWrt and not bolted on (and fighting with DHCP lease renewals).
Connectivity checking is a necessary feature for a router and it needs to be done at Layers 1 and 2. For an Ethernet uplink you don't check random internet addresses, you probe the default gateway. You don't have to do this on Layer 3 or 4 using ICMP, looking at ARP is good enough.
If ARP resolution comes back negative, the link is down. ARP traffic is regular feature of any IPv4 over Ethernet link, so it's no generating "unnecessary traffic". IPv6 features neighbor discovery for the exact same purpose.
The router software just needs be aware of it. For a DHCP interface if the default gateway is gone, it needs to go back into DHCP discovery instead of just being stuck offline until the renewal interval expired.
The output of ip neigh show delivers all necessary information for both IPv4 and IPv6:
192.168.1.1 dev br-lan used 0/0/0 probes 6 FAILED
192.168.16.1 dev eth1 lladdr xx:xx:xx:xx:xx:xx ref 1 used 0/0/0 probes 1 REACHABLE
fe80::xxxx:xxxx:xxxx:xxxx eth1 lladdr xx:xx:xx:xx:xx:xx router ref 1 used 0/0/0 probes 1 REACHABLE
This doesn't work for those who are stuck behind non-bridge-mode ISP routers that cannot be set to bridge mode for technical or legal reasons. OpenWrt will always see that they are up and needs to look at least one hop further.
That is correct behavior for passive connectivity monitoring without active probing. More is not possible that way. If the ISP router is up, there is no reason for going back to DHCP discover as the lease is still valid even with the upstream router's WAN connection down.
@patrakov: I agree it would be great for first-class support for a feature like this! I'm happy for folks to upstream any/all of this if they so choose.
@jtsn: I agree that in a perfect world, ip neigh would be enough. Unfortunately @patrakov is right, and the typical failure for me is upstream of the next hop. My primary fails when the backbone fiber is cut by vandals. My backup fails when the microwave tower runs out of power. In both cases the next hop route is reachable, but the broader Internet is not. The only way to know it's really down is probing something outside of the network.
Barring any others who have suggestions on how to improve this system, my next step will be to package this, migrate configuration to a better source, and look at parallelizing the health checks so the script can run more often. I'll update this thread when ready!
For monitoring the connection to the next hop it is. It's something that can be implemented into the default image, because it's completely passive.
Monitoring ISP equipment upstream is a connection-specific configuration and is usually prohibited by ToS. Your ISP should send ICMP messages, when the "broader Internet" is unreachable. If they just black hole your packets, you are out of luck for passive monitoring. This is where custom mwan3 rules come in and what they are intended for.
I have not read your script in details, but my initial comments:
License: Use a standard named license. It looks similar to GPL, so consider use GPL.
Consider to configure multiple ping/http targets. All targets should fail before failover. In that case, we cont failover due to a missing reply from a target, due to an error on the target-side.
Should it be possible to configure a primary interface (eg I have fiber and cellular. Fiber must be primary. If fiber fails, failover to cellular, but as soon as Fiber works again, failover to fiber).
The current version runs three tests. Two ICMP tests are used for reachability, with the higher latency and loss being used to evaluate health. If this is above the permitted loss threshold, the link is considered down. A third global httping is also run to verify TCP reachability, but I've found that OpenWRT doesn't appear to honor the httping -y command to bind to a specific interface, so that test runs once as informational messaging.
Link priority is set by the native OpenWRT interface metric, which is also important to ensure that on reboot your router picks the right interface before the script runs. (Linux uses the lowest metric value to determine which route to use if multiple exist.) The benefit is this ensures we aren't managing an extra configuration layer just for failover.
The script handles failover by assigning a penalty weight to a given route (default metric + 1000) before flushing connection tracking. When that happens, traffic moves to the next route with the lowest metric. In the (hopefully) unusual circumstance all of your links are down and never had a route metric set, the script goes in order specified in the command line. The minute links come back up, they will automatically revert to OpenWRT's configured metric.
In principle yes, but I have used a single ISP in my entire existence that follows this approach. The approach here is almost identical to what mwan3 does -- ping external IPs of your election to confirm reachability. I'm open to adding even more checks here if you see fit, just comment on the code!
That is something that is not passive and cannot be implemented into the OpenWrt default installation, because it's something an ISP can ban you for due to ToS violation.
However currently OpenWrt doesn't do any passive WAN monitoring. If a WAN link is broken, it just doesn't know. However there are lots of variables to watch for: next-hop reachable? name resolution working? tracked connections connect? Without active probing a router can perfectly detect when names no longer resolve and TCP connections no longer connect. This is the basic failover setup to aim for to make a difference to mwan3.
I just wanted to note that I think "cannot be implemented in the default install [maybe as an optional feature behind a checkbox/boolean, btw] because could be a ToS violation" is a very weak argument against anything that would improve OpenWrt (esp. if it were to be willingly enabled by the user), because with some ISPs, installing/using OpenWrt on or behind their equipment might already be construed a "ToS violation".
My experience with active probing is that it is prone to false negatives, considering WAN links down, because the probe no longer respond (especially when ISP-side filtering for DoS protection kicks in).
However passive monitoring should be implemented and enabled by default on OpenWrt, including GUI feedback to the user: "next-hop is down", "name resolution broken" and so on.
The router doesn't have to send TCP SYNs itself, it has a network behind it to do so. All it needs to do is to passively monitor if these connect by evaluating the connecting tracking it is doing anyway. Any more sophisticated OEM firmware, which is not just a China OpenWrt ripoff implements this to avoid unnecessary support calls.
To be clear: multi-WAN setups with failover are absolutely an "aftermarket" use case. The point of this feature is something simpler than mwan3 that has fewer compatibility issues.
I am curious here, how would you envision this working? Without fancier routing tricks, after a primary link is demoted, traffic flows over the next failover link. If that link never becomes unhealthy, how do we know when to "fail back" to the primary connection?
Again, asking a curious question here, do you know of any other solutions that do this? From my previous experience/research, Mikrotik, Meraki, Ubiquiti and OPNSense all rely on ICMP-based active probing for non-BGP failover scenarios.
Just like any client OS: You actively probe the failed (!) link until it comes back. Then you switch back to passive monitoring.
For example. Microsoft Windows doesn't connect to msftncsi.com every five to ten seconds. It only does so if it thinks your network connection is broken. An before it does, it probes if dns.msftncsi.com resolves.
@jtsn I spent a few minutes reading up on NCSI on Windows. As you rightly point out, if your goal is to test health of the primary route, it's a much less intrusive solution.
What NCIS does not solve is how to detect when the preferred link becomes healthy again. After we flip the route away there is no more traffic flowing on the preferred link, so any sort of firewall or connection tracking would appear to be zero.
The only way to tell is to force some traffic on that interface, which is what this script does. While we could do a DNS or HTTP probe, they are still active checks. ICMP is both simple and can be used to test a specific endpoint (e.g. a VPN concentrator that you need) that can also expose networking issues far outside your ISP's systems.
For the time being, I'm going to keep the active probes and will focus the docs on how to use them responsibly.