Bridged wifi ap, DHCP Offers only reaching clients after 5 min period

overmyhead · January 11, 2019, 5:16pm

My environment:

Kit, liv and bed are espressobin wireless access points with Atheros QCA9888 chips each of which have their wan ports connected to the upstream switch infrastructure. None of the APs (kit,liv or bed) have dnsmasq or firewalls enabled, they are simply bridged with the uplinking ethernet port on each espressobin.

Router 10.0.1.1 (Running DHCP) ---> TP-LINK Switch (SG105E) ---> NETGEARMS510TX |
                                                                                |----> kit (Espressobin Openwrt 18.06.1)
                                                                                |----> liv (Espressobin Openwrt 18.06.1)
                                                                                |----> bed (Espressobin Openwrt 18.06.1)

Problem:
Wireless clients connecting to any of the access points will obtain a dhcp lease almost immediately if they have not connected to any of the APs recently. If I roam, that is connect to the liv wifi after having been connected to the kti wifi network, DHCP Fails I can see the DHCPDISCOVER travel from the espressobin to the router, and the router sending a DHCPOFFER which travels via unicast through the tp-link, the netgear to the appropriate port on the netgear switch but then it disappears, it never appears on the ap uplink ethernet interface. I am positive that the unicast DHCPOFFER packet is being sent to the appropriate switch port, I have verified it by running wireshark on many switch ports simultaneously. Running tcpdump on each espressobin I do not see the these DHCPOFFER packets. However if I wait some time, say 5 minutes, the offer eventually appears and the wifi client gets an address. The packet also magically begins to appear in tcpdump on the espressobin which is acting as an ap for the wifi client at that particular time. In short, problem is associate with one ap, fine initially, move to another ap, DHCPOFFERS dropped by new ap for 5 minutes then inexplicably able to obtain ip and internet access.

I have also noticed that some arp packets are dropped by the espressobin during this process in the same fashion.

Solutions tried:
turning on proxy arp on ap ethernet interface-> I realize now this will have no effect because there is only a single subnet.
Completely working around the problem.. turning each ap into a nat firewall and creating a private subnet where the ethernet port perform ip masquerading this works fine, but not at all what I want.

Does anyone have any idea what could be causing this strange behavior?

Thanks

trendy · January 11, 2019, 6:53pm

It most likely has to do with the mac address table in MS510TX switch.
5 minutes is a typical arp timeout time, so I suspect that upon roaming to another AP, the mac of the client is still forwarded to the port of the old AP.
Have you done any configuration on the switch, taking in consideration that it is managed?

overmyhead · January 12, 2019, 1:21am

It could be that the arp mapping in the MS510TX is somehow forwarding it to the the wrong place but the evidence indicates otherwise.

I have experimented with completely bypassing the MS510TX and the problem still occurs. That is I plug the the APs directly to the SG105E tp link switch and remove the connection between the SG105E and the MS510TX, problem still occurs.
I have used wireshark to monitor the MS510TX traffic and I can see that at least the DHCPOFFERS are leaving the appropriate port for the AP that the wifi clients are connected to at the time when they send the DHCPDISCOVER request. If the MS510TX had a different port associated with the client mac address wouldn't it send the DHCPDISCOVER to the port of the old ap? it doesn't.

Would it be possible that the esspressobin's themselves do not think that the Mac address of the wifi client that is currently connected to them is actually located in wifi interfaces list of mac addresses. I have monitored the arp tables on the espressobin aps and they apprear to be updating appropriately, e.g. when the client roams to a new ap, that ap's arp table entry updates so that it thinks the clients mac address is on the wifi address and not a neighboring mac address accessible through its wired connection.

Though the arp tables on the aps seem to be updating appropriately, would turning on hairpin mode, that is allowing the aps to forward arp requests outside the same interface that they arrived on, permit me to detect that the aps thought that the wifi client was still accessible through the wired interface, instead of the wireless interface?

Thanks for the response.

trendy · January 12, 2019, 10:44am

Explain something here:

Does that mean that you see the DHCPOFFER leaving the correct port of the MS510TX towards the Esspesobin, but you don't see it reaching Espressobin?

overmyhead · January 12, 2019, 11:48am

Yes, that's right I seem to see them leave the MS510TX on the appropriate port towards the espressobin ap (say liv), and then on the liv ap where you would expect to see the DHCPOFFER there is nothing, except after 5 minutes when they inexplicably begin to appear.

trendy · January 12, 2019, 5:23pm

So my understanding is that espressobin is droping the packet at some point. Anything in the logs (with debugging enabled) ?

overmyhead · January 12, 2019, 7:35pm

nothing in.the logs, I think hostap is already in debug mode, I don't see anything in the logs really except for the client successfully associating. To get more information from hostapd, or ath10k I would have have to download the whole buildroot and recompile with debugging enabled which I am a.bit reluctant to do. I wouldn't be surprised if it were something wrong with either ath10k or hostapd because the problem only occurs with wireless clients. Taking a wired client and roaming.g the wired ethernet works fine, by unplugging a laptop connected to kit and walking over to live for example, it is only the wireless client's that have this issue.

trendy · January 12, 2019, 7:42pm

This happens because when you unplug the cable and the port goes down, all the arp information concerning this port is erased. But in wifi no port goes down, so there can be stale entries. However this is not your case, since it seems that the espressobin is discarding the packet.
The debug I was referring to is the option conloglevel in config/system

anon50098793 · January 13, 2019, 7:36am

What wireless security are you running?

Can you alter your topology ( remove dual path to switch )?
Can you try a release just prior to "roaming" ( testing purposes )?
Can you try sending a gratuitous arp when stale from client?
Can you try DHCPrelay on AP ( routed or just point to the dhcp address)?
Can you put mac changer ifup scripts on the clients?

overmyhead · January 13, 2019, 10:06am

According to

May/has no effect??? I turned it up anyway to 8 and installed syslog-ng

On kit I see the following message in the logs, which I had actually seen earlier with the previous logging scheme but did not really recognize a correlation.
Jan 13 09:55:14 kit kernel: [ 910.125116] br-wan: port 3(wan) received tcn bpdu
Jan 13 09:55:14 kit kernel: [ 910.130115] br-wan: topology change detected, propagating

br-wan is the bridge which includes the uplink to the MS510Tx as well as the wifi interface on it.

The message comes appears on kit and once it appears I am able to obtain a dhcp address on kit if I am roaming to kit, or on liv if I happen to be roaming from kit to liv. The strange thing is that I am only able to see the message on kit, eventhough I believe both of them are configured essentially the same way.

Thanks, for the help so far, trendy. What are the tcn bdpu messages? How do I configure them to update more frequently.

trendy · January 13, 2019, 1:38pm

Are you running STP of any kind on any of the devices?

overmyhead · January 13, 2019, 2:36pm

yes, stp is on for liv, bed, the MS510TX and I just turned it on for the SG105E.

MS510TX is running RSTP, standard STP it appears.

trendy · January 13, 2019, 2:58pm

Do you have double links or loops among the switches? The topology you have presented earlier doesn't need any STP.

A tcn bpdu is created upon topology change to notify the root bridge.
So my guess is that the roaming is creating some sort of loop among the espressobins and the switch, which causes the block of the port until it autorecovers from the error disabled state.
Try to disable STP for a start to verify this is the culprit.

overmyhead · January 13, 2019, 4:20pm

what is a double link?

Yes the network is more complicated than the diagram. First there are three vlans, a management vlan 1, a lan vlan 10, and an extra vlan on the sg105e to connect to a cable modem on that switch. to complicate things further there is an apple airport.

also, some machines have more than one connection to the network, maybe that is what you mean by a double link.

I will try to simplify everything tonight and post back.

trendy · January 13, 2019, 4:53pm

By double link I meant if two switches have two cables connecting them, or backup links. That, and other situations, could create a bridge loop, which can render the network useless. We use stp to prevent that.
If there are no loops, backup links, double links or anything like that, then enabling STP won't offer anything and will introduce some delays when a port comes up. There is a transition time to verify that there won't be a loop from the new link that came up.
The vlans don't matter so much here, just verify the physical topology that doesn't have any loops.

overmyhead · January 28, 2019, 8:15pm

Well, I wasn't exactly able to determine the root cause of this problem, but I was able to fix it.
By decreasing the ageing time of /sys/class/net/br-wan/bridge/ageing_time from 30000 to 1000 I was the roaming is now much better, not perfect but definitely tolerable.

Maybe at some point I will fully analyze why I was observing the packet drops that I was seeing. For now I am just happy that the clients are able to roam relatively quickly.

jaryn · April 13, 2020, 6:08pm

If you install ip-brige package, The bridge fdb command becomes available. It dumps the forwarding database. You will see once your device roams between the APs on bridge ports, it's MAC will have two records bridge fdb | grep MAC. Once you delete the now invalid entry with bridge fdb del ... the device starts working.

Brain2000 · March 5, 2023, 3:09am

I'm also having this issue. I can't believe this topic seems to have been dropped.

psherman · March 5, 2023, 3:20am

@Brain2000 - given that this thread is about 3 years old, it may or may not apply to your situation. The best bet is to start your own new thread and then provide details about your configuration (without those, we cannot help).

Brain2000 · March 5, 2023, 4:04pm

Thank you for posting this, as I spent hours trying to figure out why an external DHCP server could not get packets back to a wifi client. I found the Cisco switch would send DHCP Offer/ACK packets, but they were not passing through the wifi client.

I googled that issue specifically, along with the "5 minutes" and came across the bridge ageing information that solved your issue.
Lowing the bridge ageing time solved my issue as well.

This is only an issue with an external DHCP server when roaming, I have these same units at home using the internal DHCP server, and roaming is not an issue.

This is also only an issue with newer versions of OpenWRT. I have Chaos Calmer 15.05, and it does not exhibit this problem.

In case this helps anyone else, I also have 8 vlans (8-15) and do 802.1q tagging, so I ended up running the following script on a 1 minute cron timer, which changes the LAN bridge as well as all 8 wifi bridges, to an ageing time of 10 seconds with an STP forward delay of 0.01 (yes, STP is disabled, but I'm paranoid that might also be a problem):

~!/bin/sh
if [[ -d /sys/class/net/br-lan ]]; then brctl setfd br-lan 0.01; brctl setageing br-lan 10; fi
if [[ -d /sys/class/net/br-vlan8 ]]; then brctl setfd br-vlan8 0.01; brctl setageing br-vlan8 10; fi
if [[ -d /sys/class/net/br-vlan9 ]]; then brctl setfd br-vlan9 0.01; brctl setageing br-vlan9 10; fi
if [[ -d /sys/class/net/br-vlan10 ]]; then brctl setfd br-vlan10 0.01; brctl setageing br-vlan10 10; fi
if [[ -d /sys/class/net/br-vlan11 ]]; then brctl setfd br-vlan11 0.01; brctl setageing br-vlan11 10; fi
if [[ -d /sys/class/net/br-vlan12 ]]; then brctl setfd br-vlan12 0.01; brctl setageing br-vlan12 10; fi
if [[ -d /sys/class/net/br-vlan13 ]]; then brctl setfd br-vlan13 0.01; brctl setageing br-vlan13 10; fi
if [[ -d /sys/class/net/br-vlan14 ]]; then brctl setfd br-vlan14 0.01; brctl setageing br-vlan14 10; fi
if [[ -d /sys/class/net/br-vlan15 ]]; then brctl setfd br-vlan15 0.01; brctl setageing br-vlan15 10; fi