2 routers - When one reboots, breaks networking on the other

This one is an odd one, which I'm assuming will have an easy but not so obvious solution. :slight_smile:

I've been using OpenWRT and MWAN3 for over a year now with COX and T-Mobile for both of my connections. It's been flawless for the most part. I do have an extra router that's just been collecting dust (same model and I figured that it might be neat to have both in use with keepalived managing a VIP for the Default GW. Since the T-Mobile box has 2 ethernet ports, I connected the second one to the second router, configured keepalived, and dang, that was easy! the VIP moves back and forth, apps on the client end figure out the change in less than a second, all is good, I now have full network HA for my home/homeoffice :slight_smile:

Until I reboot the secondary router (or either router for that matter). The surviving router within about 10 seconds will loose all outbound connectivity via the 2 WAN interfaces (both T-mobile and COX). "ping -I lan2 (or lan3) 8.8.8.8 is dead. If I do a network restart, it all comes back to life. If I let it sit for 10 minutes or so, it comes back to life (the other router is back on line, at this point).

So, the mystery here is, why does the reboot of one physical router, affect just the outbound connectivity of the other physical router?

The interesting point that I need to make is that I cloned the original router. I took it's backup file, restored it to the secondary router, changed it's IP addresses, name and ensured the MAC addresses were using the HW from boot. So I think I'm good there. I might do a full system wipe/reimage and build it from scratch just for the sake of it.

Specs/software in use
WRT3200ACM (router only, WIFI disabled)
OpenWRT 22.03.2
MWAN3
KeepAlive
dnsmasq
odhcpd

Network Topology:
T-Mobile -------> Primary Router, lan3 (DHCP client with local NAT on the t-mobile box)
--------> Secondary router lan3 (DHCP client with local NAT on the t-mobile box)
COX --------> Primary router lan2
(there is no second plug on this box)
lan1 -> plug into 2 different switches on the core network which are uplinked to eachother.
lan4 unused (backup address for login)

My thought here is MWAN3 is somehow talking to the other router and when it looses connection with the other router, it does a reset of sorts. The part of the network stack that breaks is directly related to what MWAN3 is managing.

After some trial and error this issue only occurs on boot up (of ether router) and only when both are physically connected to the T-mobile box (hub, essentially).

If I unplug the secondary router from the main switch, it'll boot just fin and no issues seen the primary router.

So, at some point in the boot process OpenWRT is creating a full loop. Once fully booted, all is good. no issues on the network that I can see.

Actually, this is probably not OpenWrt, but rather the "lack" of OpenWrt. Specifically, many bootloaders will initialize the built-in hardware switch as a 4 or 5 port 'dumb' switch until the firmware is booted sufficiently for the desired configuration to be loaded. Often (but not always) the bootloader will isolate the port labeled 'wan' on the case from the other ports (assuming it is all present on the same hardware switch) because the manufacturer knows that the port will likely be used for the upstream internet connection and should not be bridged. Not all bootloaders do this, though -- the ER-X, for example, used to bridge all 5 ports until the newer bootloader came out in ~2019 or so). But the other ports will typically be bridged together until the switch is reconfigured by the firmware (once booted).

You can verify this by looking at the UART serial output during the boot sequence and comparing the switch state during the bootloader portions and then through the actual OpenWrt boot sequence (keep in mind that the switch won't be reconfigured from the bootloader's default state until OpenWrt has booted enough to bring up the switch chip drivers and read the config files).

3 Likes

That does explain what's going on for sure

I suppose I have to re-configure and test with the WRT3200, but is the WAN port segregated from this? If so, then my solution is to use the WAN port for the uplink to the core switches and use the other ports for MWAN (multiple internet connections).

Today, I use port 1 for LAN, 2 and 3 for WAN

Can you draw a diagram of the physical topology of your network so that I can understand all of the connections. This would go a long way towards helping with suggestions and potential solutions (vs the simplified description earlier in the thread)

Psherman,

I think I found a solution based on your example of ER-X's past firmware issue. On the secondary router I just moved it's "Uplink/core" network connection to the WAN port and bounced the router. The reboot didn't break the network. It seems that the bootloader does in fact segregate WAN from the LAN ports on boot up.

I'm in procress of moving configuration over to the WAN port and will report shortly.

Thank you!

I did order a pair of ER-X's as these WRT3200's are old as dirt, might as well upgrade for a $125 (I do love their LR-6's, I use them for my AP's)

Problem solved. WRT3200 does not in-fact bridge the WAN port with the LAN ports on bootup!

At this point, my "WAN" port is the main uplink to the core network and reboots no longer break things.

Fun part of this was I was able to make the changes to both routers with out impact to the Netflix and gaming going on. Lost just a couple pings when Keepalive moved my GW over between routers :smiley:

I have 2 UPS' and kinda have "A / B" power from the UPS upward in the stack. Later this week I'm replacing one of them with a newer one that will cope with my generator better (which I have 10 hours of run time on this year!). In thoery, I should be able to take out an entire UPS and not lose any network connectivity for my home endusers :slight_smile:

Awesome!! Glad we got to the bottom of this problem!!

If your problem is solved, please consider marking this topic as [Solved]. See How to mark a topic as [Solved] for a short how-to.
Thanks! :slight_smile:

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.