WDS and mysterious network disruptions (network loop?)

I have an independent DHCP server on my LAN, so I don't need or want DHCP servers running on my router(s).

I'm also trying to get WDS working on two identical atheros-based systems (EA8500), both running LEDE.

Everything works as expected on my LAN (and WAN too though I realize WAN is not involved in WDS) except for WDS. I have my WDS AP router configured according to https://wiki.openwrt.org/doc/recipes/atheroswds and it's working as expected in every way except for WDS. But as soon as I power on my WDS STA router (that I'm pretty certain I've configured correctly according to https://wiki.openwrt.org/doc/recipes/atheroswds ), my network begins to suffer from disruptions.

I'm thinking that the DHCP server on the STA router is almost certainly the root cause that is disrupting the ability of multiple hosts on my LAN to find other hosts by their long-established static IP addresses.

I can ping multiple hosts from multiple hosts and get 100% return rates on these pings with no problem. Then I turn on the STA router and within seconds, I'm no longer able to get pings returned from multiple hosts. Then when I power off my STA router, I see pings getting returned again within 1 minute of doing so. I suppose it might be something else (another WiFi network broadcasting with the same SSID?) about the STA router that is disrupting my network, but I'm pretty sure that it's the DHCP servers because even hosts that are connected only via ethernet are getting disrupted.

I've tried doing...

/etc/init.d/odhcpd stop
and
/etc/init.d/odhcpd disable

...and yet I still see...

# ps|grep dhcp
  796 root       804 S    odhcp6c -s /lib/netifd/dhcpv6.script -P0 -t120 eth0.
  800 root      1036 S    udhcpc -p /var/run/udhcpc-eth0.2.pid -s /lib/netifd/
 1438 root      1036 S    grep dhcp
root@dor:~#

and

# netstat -lpn|grep dhcp
udp        0      0 :::546                  :::*                                796/odhcp6c
raw        0      0 ::%3069509888:58        ::%125776:*             58          796/odhcp6c
root@dor:~#

...after doing both

/etc/init.d/odhcpd stop
and
/etc/init.d/odhcpd disable

I think there is a respawn option set somewhere, so even when I kill -9 pid, I still get the dhcp server restarting.

I've also tried setting my /etc/config/dhcp file so that it looks like (NB the option ignore '1' lines):

config dhcp 'lan'
	option interface 'lan'
	option dhcpv6 'disabled'
	option ra 'server'
	option ra_management '1'
	option ignore '1'

config dhcp 'wan'
	option interface 'wan'
	option ignore '1'

config odhcpd 'odhcpd'
	option maindhcp '0'
	option leasefile '/tmp/hosts/odhcpd'
	option leasetrigger '/usr/sbin/odhcpd-update'
	option ignore '1'

And in the Network -> Interfaces section of LUCI, I've cleared the field under Global network options->IPv6 ULA-Prefix and left it empty and saved and applied because I don't need or want ipv6.

Yet despite all these changes, I still see processes running and ports listening associated with dhcp services.

How can I turn off all dhcp servers (ipv4 and ipv6) running on my LEDE router, keep them off, prevent them from starting during a reboot, and yet preserve the dhcp client process that needs to request an IP address from my ISP?

Any suggestions?

Thank you.

If you've got DNS elsewhere as well, disable dnsmasq and odhcpd either through LuCI, or in /etc/rc.d

(You'll want to check which of the odhcp*, if any, are required for DHCP client.)

Yes, procd is likely respawning them.

1 Like

The process lines you quoted are DHCP clients, not servers.

1 Like

Oh! I didn't realize that. But I don't think I'll need a DHCP client on the STA router. Or will I?

But that prompts other questions then, too.

  1. So how do I disable these DHCP clients on the STA router?

  2. And so if these are not DHCP servers, what then could be causing these major network disruptions (consistently being unable to ping multiple hosts over my LAN within seconds of turning on my STA router)?

Thanks for the feedback!

I don't have my own DNS server within my LAN, so I think I'll need dnsmasq (to pass along DNS queries by LAN workstations to the DNS servers outside my LAN) on my AP router, won't I? But I guess I should be able to disable dnsmasq on my STA router.

Thanks for the feedback!

Before thinking carefully about this just now (in light of your feedback), I didn't realize that a DHCP client process (pid 796 above) actually listens (like a server process) on multiple ports (apparently udp 546 and raw 58; see output of netstat above). I found this link which mentions RFC 213 and BOOTP in explaining this.

But this still leaves me trying to figure out exactly what it is in my STA router that is causing my network disruptions. I'll try disabling dnsmasq on the STA router at my next opportunity, but if that's not the cause, any other thoughts on this?

Thanks again!

dnsmasq is the typical way to supply DNS to the LAN and the device itself and works reasonably well for that application. You can disable the DHCP portion of it, certainly through the config files, as well as through LuCI, I'm pretty sure. dnsmasq works well, as long as you don't need to do split-horizon DNS or other "interesting" setups.

Personally, I run kea (DHCP for v4 and v6 both) and unbound (DNS) elsewhere in the network. unbound is available as a package with some kind of LuCI integration, but it looks pretty big for a flash-constrained install (~700-750 kB, uncompressed, I'd guess).

If your network outages are being caused by "bad" DHCP, you should be able to see it on a "failing" client as an address change of that client (or the service).

I'd look into network loops first, then at MAC-address conflicts. wireshark is my tool of choice for that kind of debugging, running on a "desktop". Being able to change the filtering of already-captured data lets me quickly explore various theories. If need be, you can tunnel the output of tcpdump from another host over ssh to wireshark, but it sounds like it is a network-wide problem, so a "sniffer" on your wired network may be sufficient.

1 Like

Thanks again @jeff!

I'll try all these suggestions at my next opportunity (weekends) and post a follow-up with what I discover.

By any chance...

...do you have a network loop that involves the on-board switch itself?

I'm seeing something on my network that's suggesting to me that the on-board switch isn't actively using STP (though the AR8327 spec sheet says it is supported in silicon, it looks like it may need the host to actively manage it).

1 Like

Good question. I'm honestly not sure. I've never connected two routers to the same LAN with the same SSID and the same netmask before. And I'm not sure how to test for a network loop. I'm thinking that you're referring to this, right?

I read this and the description it contains makes it seem like I could indeed have something like this going on.

Because you mentioned wireshark for troubleshooting problems like this, I also watched this video and found it very insightful. What's not clear from the video, however, is the author's network topology and which host(s) was (were) used to collect the trace files that he's analyzing with wireshark. Do you know: for the purpose of detecting a network loop (layer 2 or layer 3) can I collect trace files with pretty much any host on the LAN? Just turn on my STA switch, collect a bunch of packets on a LAN-connected host, save the trace file, then analyze it in wireshark? Or will I need to have the packet-collecting host occupy a certain place (eg. perhaps the STA router itself should collect the packets?) in the network topology in order for me to be able to detect a loop with wireshark the way the video author does?

I'll troubleshoot this issue this weekend when I'll have more time.

Thanks again @jeff.

You'll likely be able to see "problems" on or near any of the impacted hosts.


Rough sketch showing the loop -- two OpenWRT routers, similarly configured, wired connection and L2 tunneled connection (over wireless) between the two bridges. Both bridges have STP enabled in OpenWRT config.

The log indicates that the Ethernet interfaces see their own IPv4 address coming back to them with the wired link in place, even after a long time to let STP settle. The wireless "goes haywire" in this situation, flooding everything, that then apparently degrades the rest of the wired network (also bridged to Bridge A on Router A).

Pull the cable between Switch A and Switch B, and things "return to normal".

I haven't tested it with one of my Cisco switches in between the two -- this came about during testing a new build and I grabbed a port on Router A to get connectivity to the newly flashed Router B as it was convenient at the moment.

It would be interesting to see of two, physical cables produced similar results.

           Router A                     
           --------                              
network -- bridge A  (STP enabled)                          
              |    \                      
            eth0    gretap (L2 tunnel)    
              |        :                  
           switch A    :                  
              |        :                  
 (wired link) |        : (wireless link)     
              |        :  
           switch B    :  
              |        :  
             eth0   gretap
              |    /
           bridge B (STP enabled)
           --------
           Router B
1 Like

Hi Risole,

Did you ever figure this out? I am having the same problem.

Thanks,
Gili

I think I figured it out. When you have 2+ routers and each has 2+ wireless bands (e.g. one is 2.4Ghz, the other is 5Ghz) then the LAN interface -> Physical Settings on all routers must have "Enable STP" checked. I am theorizing that some sort of network loop occurs because there are 2+ paths into each router. If there was a single wireless band this probably wouldn't be a problem.

Again, this is a guess but ever since I enabled STP on the routers the problem went away.

Gili