This one is an odd one, which I'm assuming will have an easy but not so obvious solution.
I've been using OpenWRT and MWAN3 for over a year now with COX and T-Mobile for both of my connections. It's been flawless for the most part. I do have an extra router that's just been collecting dust (same model and I figured that it might be neat to have both in use with keepalived managing a VIP for the Default GW. Since the T-Mobile box has 2 ethernet ports, I connected the second one to the second router, configured keepalived, and dang, that was easy! the VIP moves back and forth, apps on the client end figure out the change in less than a second, all is good, I now have full network HA for my home/homeoffice
Until I reboot the secondary router (or either router for that matter). The surviving router within about 10 seconds will loose all outbound connectivity via the 2 WAN interfaces (both T-mobile and COX). "ping -I lan2 (or lan3) 8.8.8.8 is dead. If I do a network restart, it all comes back to life. If I let it sit for 10 minutes or so, it comes back to life (the other router is back on line, at this point).
So, the mystery here is, why does the reboot of one physical router, affect just the outbound connectivity of the other physical router?
The interesting point that I need to make is that I cloned the original router. I took it's backup file, restored it to the secondary router, changed it's IP addresses, name and ensured the MAC addresses were using the HW from boot. So I think I'm good there. I might do a full system wipe/reimage and build it from scratch just for the sake of it.
Specs/software in use
WRT3200ACM (router only, WIFI disabled)
OpenWRT 22.03.2
MWAN3
KeepAlive
dnsmasq
odhcpd
Network Topology:
T-Mobile -------> Primary Router, lan3 (DHCP client with local NAT on the t-mobile box)
--------> Secondary router lan3 (DHCP client with local NAT on the t-mobile box)
COX --------> Primary router lan2
(there is no second plug on this box)
lan1 -> plug into 2 different switches on the core network which are uplinked to eachother.
lan4 unused (backup address for login)
My thought here is MWAN3 is somehow talking to the other router and when it looses connection with the other router, it does a reset of sorts. The part of the network stack that breaks is directly related to what MWAN3 is managing.