Connection to services on LAN timeout, are delayed or freeze; a desperate plea. : )

Hiya:
I've spent days on this now and I am completely stuck. Any help would be very much appreciated!

Thanks

NOTE:

  • The OpenWRT devices are only set up as Access Points.
  • No VLAN is being used throughout the topology.
  • The behavior is only exhibited by devices on the 192 subnet.
  • Access to the internets is not affected.
  • The behavior is identical regardless if I try to connect by IP or hostname.

NETWORK TOPOLOGY:
WAN goes into NIC 01 on my OPNsense box. OPNsense provides DHCP/DNS services and Firewall to all subnets. Subnet 192 is on NIC 02. Subnet 172 is on NIC 03. 172 is blocked from access to any device on 192. 192 is allowed access to any device on 172.

NIC 02 (192) is connected to an unmanaged switch. Several hosts are connected to the switch, one OpenWRT device, and a WDS Bridge. The OpenWRT device provides WiFi access for location A for subnet 192.

NIC 03 (172) is connected to an OpenWRT device. It provides WiFi and Wired access for location A for subnet 172.

At the remote location B, the WDS Bridge (192) is connected to an OpenWRT device. It provides WiFi and Wired access for that location.

ISSUE:
Accessing SSH and SMB services from location A (192) on subnets 172 and 192 works flawlessly except when trying to access the OpenWRT device via SSH at the remote location B (192).
When I try to connect the connection frequently times out immediately. Eventually, I am able to connect after several tries. Although after I connect the cursor may stop blinking after any given length of time and the terminal stops accepting any input. This may last up to 30 seconds or longer or the connection may just freeze. This behavior is consistent and limited to only these devices.

Accessing SSH and VNC services from remote location B (192) at location A on subnet 172 works flawlessly.

Accessing SSH services from remote location B (192) at location A on subnet 192 exhibits the same behavior as described in the first paragraph but affects all devices on that subnet.
The behavior for SMB is similar. When I initially try to connect to a share I get timeout errors. Several tries later I am able to connect, but navigating folders is slow and sometimes the file manager freezes and I am unable to access anything. This may last anytime between several seconds or minutes and is consistent.

And now for the fun part:

I can consistently resolve all of the above behavior by initiating a continues ping prior to attempting to connect to an SSH or SMB service.

So for instance I start a continues ping to my SMB server. I can then connect to SSH or SMB without any problems. It’s fast, no delays, no hangups. This is consistent behavior.

I would start by checking the ARP tables on each device, ensure that MAC addresses match to IP addresses, and double-check that you do not have two devices with the same IP or MAC address.

3 Likes

Thank so much for your reply!

All of my DHCP mappings are static and use static ARP as well on OPNsense. So there's no chance for mismatching or duplicates.

I did some more testing from a device A at location B on 192 that had problems accessing SMB on two other devices B and C at location A on 192.

When I do an arp -a on device A it showed a complete arp entry for device B and showed IP ... on eno1 for device C.

Today I had no problems accessing SMB on device B. But I had the same problems accessing device C.

So I created a static ARP entry on device A and on my dummy AP at location B. I still got the same behavior trying to access device C although arp -a now showed a complete arp entry for device C (IP address and MAC address ... PERM on eno1).

So creating a static entry made no difference.

What I previously forgot to mention is that when I use the continuous PING workaround it initially frequently times out anywhere for a few seconds, a minute, or sometimes not at all. When it eventually is successful and I let it keep running it resolves my issue for that particular device. So I guess it's an inconsistent workaround and I don't know what to infer from that behavior.

What baffles me completely is that I have ZERO issues accessing devices at location A on 172 from location B on 192. That makes me believe the bridge or the dummy AP at location B is not the issue.

Could it be the switch at location A? I don't think so since I have ZERO issues accessing SMB devices at location A on 192 from location A on 192.

I am scratching my head.

If the ARP tables look right, check that the STP root elected is what you expect/intend.

Should I enable STP at all?

The switch is unmanaged. The bridge allows me to enable STP and of course my APs. The APs have the checkmark set for enable STP. I also have STP enabled on the bridge. Maybe I shouldn't have enabled it since I don't know enough about STP. Do thinks that's the issue and if so which device would be considered the STP root device?

It’s the one with lowest priority, as I recall. In a multi-switch environment it may not be your “best” one. As I vaguely recall, in default settings, the lowest MAC address wins (and it wasn’t Cisco).

MAC address is usually the tie-braker. You have also Priority and System ID extension (taken from VLAN). However as there is not much to configure other that to enable it, you can expect it to be defined by the MAC.
If it was indeed an STP issue, there would be some delay only when the devices booted, until the whole STP process finished. After that, and if there is no topology change, there should not be any delay.
And the topology doesn't sound like it has any loops, so STP doesn't seem to be necessary.
Another thing I can think of is interference in the WDS link or some power saving regime.

Yes, priority is set by multiples of 4096 and the lowest is root. So, just to be clear I have one unmanaged switch, 3 OpenWRT APs, and the bridge (Engenius).
STP is set to enabled on all APs and the bridge is set up as follows:

Hello Time 2 seconds (1-10)
Max Age 6 seconds (6-40)
Forward Delay 4 seconds (4-30)
Priority 8192

Hello Time 2 seconds (1-10)
Max Age 6 seconds (6-40)
Forward Delay 4 seconds (4-30)
Priority 4096

So identical settings for each only the priority differs. I've reversed the priority to see if it makes a difference, but it doesn't.

The WDS has no power saver settings. It scheduled to reboot twice a week at 2AM. What interference where you thinking?

EDIT: This is an AC 5 GHz bridge. The distance between the stations is maybe 30 feet. Throughput is excellent.

If it is only 9 meters far, I don't think there will be any interference from other wireless networks.
But for that distance I would try with a cable to rule out this factor.

Sorry, may bad. I don't know what I was thinking it's more like 40 meters. But again this a 5GHz bridge what could possible interfere? The only networks I see in my neighborhood are some APs about the same signal strength from either location A or B. BTW, this is in a rural area. Not much around here. Even cell reception is weak. Thanks though. It's real head scratcher.

Are you by any chance operating in a DFS channel?

Ch 36, 5.180 GHz, it's a default selection option on my bridge. Sigh

I do think now that it has to be the bridge because accessing 192 devices at location A from location A always works and the same for location B. Issues only occur for randomly for some devices located at the other end of the bridge. So it's consistently inconsistent. But again I have no problems accessing 172 at location A from location B on 192. Or at least I haven't so far.

I've hooked up my laptop to either end of the bridge and I am getting the same behavior, so the bridge is somehow mucking up the LAN traffic.

Any more thoughts?

Thanks for everybody feedback

Give us some more details on the WDS bridges.
Vendor, Model, OS version etc

Engenius
ENS500-AC
Firmware Version v3.6.5_c1.9.30

Last evening I logged into the web GUI for the radio on either end of the bridge and ran a ping to the first 192 device on either end. There were no lost packages and TTL was between 5 - 2 mms.
One more thing, when I log into the GUI farthest away from the radio on either side, I am experiencing long delays logging in but never complete timeouts. Once logged in and navigating the menu and/or applying changes there are also always long delays. Long delays are betwen 15 - 40 seconds. On the other hand navigating menus, etc. on the device closest I don't experience any such delays.
This weekend I'll try to get a ticket in with the vendor as well, but my experience with them as a personal customer hasn't been stellar.

Thanks everybody

There seems to be latest firmware 3.6.5.3
https://www.engeniustech.com/wp_firmware/ens500extac-all-v3.6.5.3_c1.9.30.bin
But you should definitely contact Engenius.

Thanks I will apply the latest as soon as possible.