Do I want VLANs? Home network going down once a month with ARP storm

I have a rather large home network with dozens upon dozens of devices, a router in AP mode, and 5 or 6 unmanaged switches. At least 20 wired devices and 30 wireless devices.

About once a month, the whole wired part of the network goes down (my openwrt router's wifi still works, AP does not).

Plugging in wireshark into one the ports on the central unmanaged switch, I get almost all MAC-specific-ctrl-proto-01 packets / MAC CTRL / Pause packets. I suspect I'm getting into an ARP storm and eventually one of the unmanaged switches has hit its limits.

I'd like to begin segmenting my network, but I don't have any idea on how to start. I learned about the OSI model in school, but never really dove into it in practice.

Is a VLAN what I would want? I would still like layer 3 communications to work between VLANs, but want to isolate layer 2 traffic into smaller segments.

If a VLAN is what I want, is there a way to do that in OpenWRT 21 with Luci? I watched some videos on setting up VLANs, but it looks like the settings in OpenWRT have since changed.

Thanks, any advice appreciated.

There are a lot of things to unpack... but I'll start with this: it depends on your objectives.

VLANs serve two important functions:

  1. VLANs make it possible to segment the network into multiple broadcast domains. As compared against one large network, several smaller networks can improve network efficiency because fewer hosts per network means reduced 'chattiness' of the network, especially with broadcast traffic which tends to be one of the things that can multiply exponentially.
  2. Add security by using VLANs -- firewall rules can allow or prohibit (or selectively allow/prohibit) traffic from flowing between VLANS. this means that you can, for example, protect trusted networks from untrusted (IoT/guest/kids, etc.) hosts.

Another reason to use VLANs is the logical grouping of hosts... so if you are a business, it might make sense to have all of the computers for one department (maybe marketing) on a VLAN, and a separate VLAN for another team (say finance). Or have a VLAN for your network infrastructure, another for your media devices, and one for your general purpose computers... just as an example. You can couple that with the two things above, of course.

Next thing -- if you need any single physical switch to handle multiple VLANs, you must get managed switch to do this. Unmanaged switches just can't do this -- they are designed for just a single untagged network. You can, however, split your OpenWrt router's ports such that each physical port represents one network, and then connect an unmanaged switch to each port. This can work just fine if your physical topology allows.

Wifi is another issue -- your AP(s) (and any intermediate switches) must support VLANs if you wish to create wifi networks for each of your VLANs.

I would recommend your starting point should be as follows:

  • map out all of your devices and their respective services/dependencies so you can logically group them together (i.e. all your IoT devices might be good on a single VLAN, etc.).
  • figure out the security model you want (wide open vs restricted, etc.)
  • make a physical topology map to understand how these networks get distributed through your home. This will help you understand if you need to buy one or more managed switches or any other hardware (such as APs that can support multiple SSIDs and VLANs), as well as your overall strategy for how you will configure everything once you have appropriate plans around the hardware.
2 Likes

Better (L2+) managed switches usually also offer advanced features to detect and stop issues like this, beyond their mere VLAN capability - so they do make sense either way.

20 wired devices aren't really a "rather large […] network", so you should be able to find the culprit and kill it with fire (the same goes for 30 wireless ones).

3 Likes

Yup- all good points! And the other thing that you gain with managed switches is more memory and larger MAC/arp table capacity which means that it is less likely you’ll overwhelm the switch. This may not really be the issue, but you can ensure that you are not in fact hitting any limits on your infrastructure.

Just getting a managed switch, even if VLANs aren’t in use, could help the performance of the network based on these factors and the tools @slh mentioned.

1 Like

Any advice on how to find the culprit? I've unplugging ports & tried wireshark traces, but I'm kind of over my head to be honest.

Yeah, I'm going to replace one of the central switches with a managed one. Was planning on segmenting a whole part of the house as one VLAN, and another as another VLAN.

I should probably create a new SSID for all the smart things that shouldn't get LAN access (just WAN to phone home to the mothership)

First step, wireshark - second step, unplugging leaves of the network while it happens (might not directly help, but it could).

2 Likes

Look for patterns and methodically evaluate what is happening on the network.

The culprit can be things that you wouldn't expect, for example:

  • some USB-C docking hubs with ethernet will take down an entire network when the host computer goes to sleep or is disconnected. This is a bug in the firmware that causes a broadcast storm from the affected hubs, but is entirely unexpected.
  • I've seen Peloton bikes that will also take down a network when they are connected by both wifi and ethernet -- likely due to a bug in how the handle the network connections (probably bridged rather than treated as independent interfaces).
  • Sonos made a change sometime back that caused switching loops (likely a problem with their STP algorithms) which caused major issues with my personal network despite years of use in a given configuration. Removing the Sonos Bridge device and making sure that only a single Sonos was wired into my network fixed that issue, but I thought I had a failing switch because I couldn't believe that the hardware on my network was causing that type of problem.
  • Failing hardware can also cause these issues. I've replaced failed capacitors on a few switches which would just crap out as soon as there was any significant traffic or multiple nodes connected.
  • Insufficient power from the wall or PoE (a bad or undersized power supply, or hitting/exceeding the power budget) can cause devices to get knocked offline or into an unresponsive state.

The list goes on.... run lots of tests, be methodical, and don't make any assumptions.

3 Likes

The first clue would probably be the actual time that is described as once a month. That is both repeatable but not that often but anyway a fixed time interval. I don’t think you have a setup problem with this interval. This usually mean something specific outside interference that is happening with these frequency.

But bring down a network can mean a lot. If the ISP goes dead it can look like the network is dead but the internal networks actually work just fine but the router have no data to move since internet is down and 99% of all devices only use internet.

And sometimes I think many just forget to look at the blinking lights on the devices…

They do blink fast or slow for a reason but everyone want the fancy and complicated solutions first.

I wouldn't say that the suggestions from @slh and myself are fancy or complicated. Yes, the Blinky Blinky lights can be super useful for diagnostics, and by all means that is an easy first step. However, I would argue that my method is more about the exercise of critically examining the situation (patterns, trends, behaviors) than it is about complicated solutions -- unless someone wants to argue that the gray squishy stuff is fancy :stuck_out_tongue:

1 Like

Wifi on my router is still working just fine when it gets into an unhealthy state. It's not an ISP issue, it's an issue with the local wired network.

I thought more about how often people say they hooked up wireshark the first thing they did and found nothing. I only used wireshark once and it didn’t help much.

Simple logic fault finding and googling solved all the problems faster then hooking up a wireshark solution.

But the wifi is actually the same wired network as the wired network that “stopped working”?

The only difference is if you use copper or radio waves for data distribution.

I have a CalDigit TS3 USB docking station for my work macbook. I rarely unplug it, but I did this afternoon, around 1-2:30pm. Plugged back in around 3pm and network was down at 5:30pm today. I've never liked that thing much, lots of annoying issues (won't wake up from sleep, every power cycle my speaker balance needs to be adjusted, etc). Definitely will unplug that ethernet cable first next time there's an issue

Almost two decades ago, I had an unmanaged 5-port 100 MBit/s desktop switch that started reflecting packages when it was powered off... While powered on, it worked just 'fine', but beware whenever it got powered off, it also took the whole network beyond its own ports down - not immediately, but as soon as there was slightly elevated network traffic. It was a nightmare to debug, especially as it only happened when I wanted to access my server from the outside (switch and most of the remaining network powered off, while it was naturally on, when I was home using those systems interactively).

Well it is probably a reason why most managed switches with little self respect today has storm identification and braking turned on as standard.

But also a VLAN differentiated network will help for these kind of fault finding since the faulty device should only wreck havoc in the VLAN it’s connected to at first.

The router can be brought down by capacity drowning by one vlan storm but you should see a lot if light activity on those vlan ports only while the other lights “freeze”.

Now when all is connected to each other in one single vlan it is all or nothing if something happens since everyone is connected to each other.

1 Like

Here's a sample of what I captured form one of my switches during an unhealthy state. I can't find a device that matches the src MAC address, but it appears to be from a realtek device.

Precisely, so now you need to find out what devices you have with realtech network card, but I am not that sure you will find the source of the problem anyway becase the least significant bytes should be more than 0 in a real device.

1 Like

Depending of how you gather log data from your devices you should look at the logs just before the storm begins what happened in the system at time 0. With focus on connected or disconnected devices.

Yes. VLAN is what you want - to firewall all those IOT devices you mentioned phoning home if for nothing else. VLAN's set up for IOT, Guest, and Home Lan, with an SSID for each; and segregating wired port access for wired devices accordingly are a good plan for today's connected homes. Unless WiFi throughput is REALLY slow (as in 802.11b legacy rates slow), within reason you don't need to worry about overhead from too many SSIDs combined with too many neighbors crowding the same WiFi channels all conspiring to slow down your network. You'll get tired configuring VLANs and SSID's before that becomes a noticeable issue :wink:

Yes. If OpenWrt has migrated to DSA from swconfig for your device, the configuration is quite a bit different between the two. This is a good starting point for DSA configuration:
mini-tutorial-for-dsa-network-config

1 Like