Fancy Fallback Routing (keepalived) in home LAN?

I'm the kind of guy who "breaks the internet" quite a bit :wink: mostly while doing things like upgrading software or adding new features like my YouTube quota that keeps my kids from vegging out in front of Minecraft videos all day... In the end it's good, but sometimes things break for a while, could even be a day or something until I figure out which thing is causing the problem.

I have a setup (ATT gig fiber) where my ISP provides fiber to an ONT, and then ethernet to an Arris NVG599 and then from there I have ethernet to a smart switch on a VLAN for WAN, and then my main router, several APs, and some wired branches etc all go somehow out of this switch.

So, in theory my primary AP which is a WRT1900ACS could also double as a "backup/fallback" router because it could also output packets on the WAN VLAN. The main router is a PC that also does other NAS and proxy and etc stuff.

What I'd like is in the case where my main router/PC goes down, the WRT1900ACS (or some other device, could be an espressobin or jalapeno board etc) falls back and starts taking over all the routing. This way if I break something, I can always just unplug the main router, work on it via console, and the internet comes back up within a few seconds and no one sits around staring at me and tapping their foot :wink:

Any suggestions for technologies that could accomplish this? Note several things:

  1. The main router runs a squid proxy that is required to get out via HTTP/HTTPS so I assume I should have the backup device also run a squid proxy and somehow it should "take-over" the IP address of the main router when it loses a heartbeat.
  2. I don't think this is really a routing issue, since we're talking about basically static routes, there is only one ISP connection, and both the main router and the secondary device will be on the same VLANs (both guest and main LAN should be taken over).
  3. I'm ok with having a heartbeat on a separate VLAN if that makes sense.

Thoughts welcome!

2 Likes

Not sure how ospf would work in this context - i think youd need 3 routers for it to work.

  1. All clients point at router A
  2. Router A to Router B (Pc running your smart policy stuff eg.youtube quota) use Ospf
  3. Router A to router C (where the handoff to fiber is) , either with a high cost static route, or also using ospf with a higher path cost than A to B

That should allow you to blow up the B router Pc device at will :wink:

Yes, I could see this working. The only issue is that the squid proxy failover wouldn't work because although routing would work, client devices would still be trying to reach the squid proxy and it'd be on the "blown up" device.

The big issue I have here is that it's gig fiber, and so even just forwarding packets and proxying is a significant load on anything short of an x86. It's fine if it has to throttle back during failover, but it's not fine if it slows down the network during normal operations. So router A would wind up being basically beefy if it has to run squid all the time.

I think I need some kind of failover solution where the failover device takes over the gateway IP (and IPv6).

This is probably more trouble than its worth sigh

Thought: the routing PC runs Debian. I could run all the routing in a container. Then when I want to break things, I could clone the container, change whatever I want in the container, and swap in the new container... if it doesn't work... I can quickly back out to the old container. This seems like a plausible alternative.

Maybe I misunderstood- can you sketch a diagram of what devices you have currently?

If you only have two devices, and assuming router C (fibre endpoint) can assume the critical functions of B during an outage, then you might be able to merge routers A and C by using vlans to still accomplish the 3 paths triangle approach

NVG599 from ATT --- VLAN 99 ---> Switch ---VLAN 99, and others-> Current router + proxy
                                    |
                              VLANs for LANS
                                    |
                               LAN devices

I can either add a device, if it makes sense, or use an existing WRT1900acs which currently is just a dumb AP.

So for example I could point all the LAN devices at the WRT1900acs and it could

  1. if the "main" router is up, just forward packets there
  2. if the "main" router is down, forward packets direct to VLAN 99

but it also needs to squid proxy during the "down" time at least, (it's fine if it just has zero policy and lets everything through during this time) so it needs to take over the IP address of the main router so that packets destined for the proxy arrive at the fallback router.

Also whatever the system is, it has to not slow down the network during everyday usage, and the network is gigabit fiber, so it's a pretty big demand.

Also, the best thing is if there's entirely separate hardware. The container idea doesn't help if I have to open up the main device and replace a hard drive or stuff like that.

Keepalived is available on OpenWrt and seems to at least do the basic ip takeover stuff

https://raymii.org/s/tutorials/Keepalived-Simple-IP-failover-on-Ubuntu.html

with various useful articles on configuration:
https://raymii.org/s/articles/Adding_IPv6_to_a_keepalived_and_haproxy_cluster.html

2 Likes

Nice!! I Would be curious to know how that turns out. Would likely be much faster/seamless failover too , inadditon to solving the other proxy issue

Has Linux implemented the equivalent of CARP yet? Basically it allows two interfaces (typically on two devices) to failover a given IP address, so that the rest of the network has "no idea" that anything happened when a hot spare takes over for the main, or vice versa.

I think that's close to what keepalived is supposed to do.

Correct:

@lleachii any experience with using it? It sounds like the right solution, I don't want to distribute different possible routes around, I just want a device to take-over the duty when I have to break my main router during debugging etc :wink:

Also my main router uses a 3-NIC bonded interface, i wonder how the requirement to adjust the MAC will work with that.

No experience with VRRP directly.

Regarding the interface, wouldn't it be the MAC of the VLAN?

Well this idea has convinced me to get an espressobin. Since it has an SD card slot and RAM in the gigs, and it needs to replicate my firewall in nftables, I'll probably run armbian on it, but will try to configure keepalived and report back how it works.

This will also be great for keeping the internet going during power outages, since the UPS will keep it running much longer than the x86 with RAID NAS.

HSRP was one cisco protocol that i really liked. Simple to setup and well designed and it would be good just to read about it's operation even though it's proprietary.

Brakes all established NAT et. al. but in your case probably not so critical. From memory, was pretty snappy on some lil 837's.... give or take 10 secs and apart from the broken nat and bandwidth disparity smooth as.....

These things generally always raise another hoop and are rarely a be all and end all in themselves, so it helps to have multiple links into a site.... and then decide to set it up that's for sure.

You can definitely use keepalived to implement VRRP between two OpenWrt routers. I am using it in production at a 400 users set-up.

Additionally, conntrackd can keep conntrack states in sync on both the routers. And custom scripts can be trigerreed from a hook in keepalived whenever state of a device changes from Master to Backup or vice-versa.

1 Like

Thanks for the confirmation. I will definitely do that. It's fine if conntrack breaks, this isn't so mission critical that a few connections can't be broken, but if I can take down the main router and within seconds the backup router starts up some services and starts routing everything... that's going to be awesome!

2 Likes

Ok, well I can report that my espressobin Tottenham (the hot-spare :slight_smile: ) was able to failover using keepalived between my main router and the tiny SBC in just a few seconds of downtime. It takes over the IP, fires up a squid proxy, fires up freeRADIUS to take over the WiFi authentication, and does a small variety of other things, and within seconds my phone is able to run a speedtest at full WiFi speeds.

I'm not running OpenWRT on it because it's got RAM and SD card space enough to have a more full featured distro (Armbian in this case) but OpenWrt does have keepalived and keepalived does work for this kind of thing. If you have an OpenWrt router that you want to failover, look into it!

5 Likes

Congratulations, can you share a basic setup so other's can benefit when they have a same case ?

Sure, the keepalived config looks like this on the spare, it unicast pings the main router, the main router looks symmetric more or less. One thing to be careful of is that the "virtual_router_id" needs to be the SAME on both devices, so that they know they are paired with each other. That took me at least a few hours to figure out because they were both insisting on being MASTER despite receiving the packets, since one had ID 20 and one had ID 21

global_defs {

# turn off iptables
vrrp_iptables 
}

vrrp_instance PICOINT4 {

	state BACKUP
	interface lan0.1
	virtual_router_id 21
	priority 20
	advert_int 1
	unicast_peer {
	10.x.x.8
	}
	virtual_ipaddress {
	10.x.x.10
	}

	notify /etc/keepalived/notifyscript.sh
}


vrrp_instance PICOINT6 {

	state BACKUP
	interface lan0.1
	virtual_router_id 21
	priority 20
	advert_int 1
	unicast_peer {
	fdxx:xxxx:xxxx:1::8
	}
	virtual_ipaddress {
	fdxx:xxxx:xxxx:1::a/64
	}
}


the notifyscript.sh just starts up or shuts down the services you want it to handle, whatever should happen when your device becomes the MASTER.

#!/bin/sh

TYPE=$1
NAME=$2
TRANS=$3

case $TRANS in 
    "MASTER")
        ip route del default via 10.x.x.10 dev lan0.1
        ifup lan0.99
        systemctl restart squid dnsmasq freeradius
        ;;
    "BACKUP")
        ifdown --force lan0.99
        ip route add default via 10.x.x.10 dev lan0.1
        systemctl stop squid dnsmasq freeradius
        ;;

    "FAULT")
        ifdown --force lan0.99
        ip route add default via 10.x.x.10 dev lan0.1
        systemctl stop squid dnsmasq freeradius
        ;;


I'm fine with some stuff breaking, I just want people to be able to surf within less than a minute of taking the main router down, which does in fact happen.

2 Likes