Best practice for IPv6 failover

jtsn · April 3, 2025, 1:19pm

I currently run OpenWrt 24.10.0 with two WAN interfaces.

root@wrt:~# ubus call system board
{
        "kernel": "6.6.73",
        "hostname": "wrt",
        "system": "ARMv7 Processor rev 1 (v7l)",
        "model": "Linksys WRT1200AC",
        "board_name": "linksys,wrt1200ac",
        "rootfs_type": "squashfs",
        "release": {
                "distribution": "OpenWrt",
                "version": "24.10.0",
                "revision": "r28427-6df0e3d02a",
                "target": "mvebu/cortexa9",
                "description": "OpenWrt 24.10.0 r28427-6df0e3d02a",
                "builddate": "1738624177"
        }
}

It's a simple setup with a default route metric configured for the second (backup) interface:

network.wan=interface
network.wan.device='wan.7'
network.wan.proto='pppoe'
network.wan.ipv6='1'

network.wwan=interface
network.wwan.proto='dhcp'
network.wwan.device='eth1'
network.wwan.metric='128'

So wwan should never be used unless wan doesn't have a default route.

This works fine for IPv4, however IPv6 is a different story:

network.wan6=interface
network.wan6.device='@wan'
network.wan6.proto='dhcpv6'
network.wan6.reqaddress='try'
network.wan6.reqprefix='auto'
network.wan6.norelease='1'
network.wan6.metric='512'

network.wwan6=interface
network.wwan6.device='eth1'
network.wwan6.proto='dhcpv6'
network.wwan6.reqaddress='try'
network.wwan6.reqprefix='auto'
network.wwan6.norelease='1'
network.wwan6.metric='4096'

The metric is completely ignored unless I hack /lib/netifd/dhcpv6.script:

setup_interface () {
[...]
        for entry in $RA_ROUTES; do
                local duplicate=$NOSOURCEFILTER
                local addr="${entry%%/*}"
                entry="${entry#*/}"
                local mask="${entry%%,*}"
                entry="${entry#*,}"
                local gw="${entry%%,*}"
                entry="${entry#*,}"
                local valid="${entry%%,*}"
                entry="${entry#*,}"
                #local metric="${entry%%,*}"
[...]

There is still source routing of course. So that only works when disabling source routing and doing NAT66. If I don't filter prefixes on lan clients tend to chose their own prefix instead of going with preferred one.

What's the best practice to apply here?

_bernd · April 3, 2025, 2:04pm

Right now the shitty state is to use only ULA and do NPT at the edge but in case of failover this cuts most of your connections anyway.
There is an RFC in the making iirc that admits at least that this is a huge problem for home and small business owner.
Not everyone is getting provider independent address space and can speak bgp to the ISP which would make this issue a no brainier because then we had no issue.... Sorry for the half rant.

jtsn · April 3, 2025, 2:37pm

This has to be done manually, right?

The expectation is that IPv6 routing behaves like IPv4 routing: Once the default route on the lower metric interface drops, forwarding seamlessly switches over the next default route on the higher metric interface.

With IPv6 source routing enabled, it is about which delegated IPv6 prefixes are announced on the downstream interfaces. The default is that all prefixes from all upstream interfaces are announced simultaneously which is not useful.

Instead only the IPv6 prefix from the interface with the lowest metric should be announced, once it goes down, the prefix from the next higher metric upstream interface should take over.

jtsn · April 3, 2025, 3:06pm

Looks like AI hallucinates a /bin/bash that OpenWrt doesn't have by default.

Thing is: I can ask LLMs myself if I wanted computer generated text for an answer.

This question is actually directed at real humans and their opinions on best practices for IPv6 failover running OpenWrt. Hopefully people who already did this on their OpenWrt setup. Thanks for your understanding.

jtsn · April 3, 2025, 3:35pm

It's not your solution. It's Grok's "solution". And Grok's solution is wrong.

trendy · April 3, 2025, 4:22pm

Have you considered mwan3? It should at least help with the dead link detection. I don't think it will solve the problem that you'll still need to do NAT66.

jtsn · April 3, 2025, 5:01pm

I don't need dead link detection, as the PPPoE session has this neatly built in and the backup interface doesn't need it. All addresses and default routes just get removed when the primary link drops and come back when it's connected again. For IPv4 this just works after correctly setting the interface metric.

NAT66 is not a problem, it's enabled with either a simple custom NAT rule per interface or globally and works out of the box.

The question is what I do with OpenWrt's IPv6 source routing. By default it announces multiple delegated WAN prefixes to the LAN and that doesn't look like a best practice. It makes clients pick & choose the IPv6-address (and therefore WAN upstream) by themselves, which is not what you want for an IPv6 failover router.

patrakov · April 3, 2025, 6:52pm

As the main connection is PPPoE, you are welcome to reuse my fail-over script:

jtsn · April 3, 2025, 8:07pm

Great, that that's at least a workable configuration. However I'm afraid thatuci commit writes the NOR flash, so it's going to brick my router when automated.

It's a bummer that OpenWrt doesn't support fail-over, as it is a standard feature in OEM firmware even in consumer routers. Like any 4G/5G router with a WAN port allows to use that in fail-over mode, with either priority (wireless/wired).

Currently I'm stuck with hacking the dhcpv6.script, because otherwise it overwrites the IPv6 default route metric, then I have to do:

network.wan6.device='@wan'
network.wan6.metric='512'
network.wan6.sourcefilter='0'

network.wwan6.device='eth1'
network.wwan6.metric='4096'
network.wwan6.sourcefilter='0'
network.wwan6.delegate='0'

firewall.@nat[0]=nat
firewall.@nat[0].name='MASQ6'
firewall.@nat[0].family='ipv6'
firewall.@nat[0].proto='all'
firewall.@nat[0].src='wan'
firewall.@nat[0].target='MASQUERADE'
firewall.@nat[0].device='eth1'

ip -6 route

default via fe80::xxxx dev pppoe-wan  metric 512
default via fe80::xxxx dev eth1  metric 4096

Announcing the correct prefix instead of NAT66 would be better of course.

patrakov · April 4, 2025, 5:37am

How often does the fail-over happen? If less than once per week, I wouldn't worry.

jtsn · April 4, 2025, 6:05am

Never, when things run smoothly. When something breaks, the connection might flip every few minutes. This is one of reasons why I like to avoid some scripted detection like mwan3. It's usually the PPPoE session breaking down intermittently due to various issues at the ISP's end, not "the line itself breaks".

Existing connections staying on the backup is fine, because the priority is on preventing noticeable interruptions.

_bernd · April 4, 2025, 9:21am

Pardon? What is not supported?
There is no OOTB solution because failover on end consumer lines are bad in any case. The best you can get that you have connectivity but most connections will break if your source addresses chances.
Because every network is slightly different or has other constraints you need to make decisions and implement it by yourself anyway.

jtsn · April 4, 2025, 11:58am

It's bog standard feature of the FTTH consumer router delivered by my ISP. Of course there is no way to replace its OEM firmware, that is why I won't buy it.

All it needs to do is routing through a different interface, when one interface goes down, rewriting the source address and renumbering the LAN if necessary. Rewriting and renumbering is necessary anyway every time the PPPoE connection reconnects, so switching over to backup is no different beside WAN addresses being from a different AS.

@patrakov did it by updating the NOR flash, which I want to avoid.

Phones and consumer Internet lines constantly change their source address, that is a given! Nobody cares.

patrakov · April 4, 2025, 1:48pm

See also this ISP router:

_bernd · April 4, 2025, 4:04pm

Good luck. Most of a time it's not the interface which goes down so have fun to detect a failure on a line.

I guess you have never touched an network with bgp, right?

Any way if you are happy with nat and renumberig then do so.

jtsn · April 4, 2025, 4:40pm

A PPPoE interface reliably disappears when the LCP layer fails. The protocol has a built-in line failure detection, because it was designed to charge Ethernet per minute. You can't charge Ethernet per minute, if you don't know if your line is still up.

That is not the scope of OpenWrt.

Have you ever used Consumer Internet? Consumer devices like smartphones renumber and deal with NAT all the time. Pretty much all consumer devices are running a variant of Android, iOS or Windows, which are able to deal with that kind of connectivity and roam seamlessly across multiple WAN networks (including mobiles ones), changing NAT endpoints and dynamic IPv6 prefixes.

Just what they don't deal well with is a router announcing multiple preferred IPv6 prefixes with a default route. The latter is what OpenWrt does by default with multiple WAN interfaces connected. And that is where things break.

jtsn · April 5, 2025, 4:37am

Needless to say, that brain dead IPv6 specifications like RFC 3484/6724 for Default Address Selection using a "Common Prefix Length" are the main reason why things break with the default way OpenWrt does IPv6. (Of course nobody ever implemented committee fantasies like static "Mobility Addresses".)

The secondary issue is that nobody considers announcing multiple preferred IPv6 prefixes a bug, while in reality it is one and shouldn't happen.

patrakov · April 6, 2025, 8:41am

The good news is that the correct RFC exists and doesn't need any extra work on the end devices. It specifically talks about not announcing multiple prefixes when fail-over is the desired behavior.

It's a bug that OpenWrt does not implement it natively and does not implement the link liveness check for non-PPP connections, which is a prerequisite.

jtsn · April 6, 2025, 12:27pm

On consumer routers this usually coupled to the modem state. A DSL router knows if its DSL link is still up, a Cable modem knows the state of its DOCSIS link and a 5G router knows if it is still has a signal.

Fail-over is always disruptive for consumer equipment (breaking all existing sessions), but consumer apps don't care. Every smartphone switching from Wi-Fi to Mobile and back does this. Resetting the stale connections (TCP reset and appropriate ICMP messages instead of blackholing) is the wanted behavior in this type of environment. Apps getting their connections reset know what to do with this instead having to wait for timeouts. Even browers are able do to this nowadays.

The goal is that user doesn't see "The server didn't respond", not that some TCP session survives as long as possible.