OpenWrt Forum Archive

Topic: wifi roaming problem

The content of this topic has been archived on 29 Apr 2018. There are no obvious gaps in this topic, but there may still be some posts missing at the end.

Hi,

(I'm not sure if it's called roaming, but anyway)

I have 2 APs in my house one, on each floor, both connected to the same wired network, have the same SSID, channel, password.
All the devices connect to them and switch easily when it's needed, so everything's fine until now.

The only problem I have is that when a device switches from an AP to another, it doesn't disconnect from the old AP and remains listed as associated until it reaches a timeout.
In that period, it cannot communicate with other devices connected to the original AP.

Is there any way to fix that?

Thanks.

Yes, roaming is the correct term.  I stand to be corrected, but IIRC it's not entirely abnormal for the client's association to remain in the way that you describe.  I've not reproduced it but wouldn't expect it to prevent communication between clients.  In theory the AP is bridged to the internal switch, which should have an independent L2 forwarding table (anyone else care to comment on that?)

Which APs are you using, and what software version(s) are you running?

Also, let's walk through a scenario:
- Stations A and B are associated with AP1
- Station C is associated with AP2
- Station D has a wired connection to AP1
- Station A roams to AP2

For the scenario above, can you please test and verify whether Station A can ping/otherwise connect to:
- AP1
- AP2
- Station B
- Station C
- Station D

Finally, you're saying that once that stale association expires, things work again?  Does the association time out eventually, or do you take action to make it expire?

I have an Intel Atom box with AR280 (with Barrier Breaker) and an Archer C7 (with snapshot ~a week old)
Both have the wifi adapter bridged with ethernet.

Regarding your scenario, when Station A is roaming from AP1 to AP2, before AP1 disassociates it due to timeout,
it can connect to: AP1, AP2, Station C and Station D.

Right now I'm thinking that the Atom router (AP1) might cause the problem.
I will try some things and I will get back with the results, if I find any.

The timeout value is controlled by ap_max_inactivity, as far as I know.
But lowering this value too low might affect the battery life of mobile devices.

Thanks.

(Last edited by load.runner on 20 Apr 2015, 17:46)

load.runner wrote:

Right now I'm thinking that the Atom router (AP1) might cause the problem.

So this is incorrect.
It happens also when roaming from AP2 to AP1.

What seems interesting to me is that, while the device can't communicate with other devices connected to the AP, it can communicate with the AP itself.
So I assume that the traffic between the devices connected to the same AP doesn't go through the bridge.

LE:
After some research, I found this info:

# When IEEE 802.11 is used in managed mode, packets are usually send through
# the AP even if they are from a wireless station to another wireless station.
# This functionality requires that the AP has a bridge functionality that sends
# frames back to the same interface if their destination is another associated
# station. In addition, broadcast/multicast frames from wireless stations will
# be sent both to the host system net stack (e.g., to eventually wired network)
# and back to the wireless interface.
#
# The internal bridge is implemented within the wireless kernel module and it
# bypasses kernel filtering (netfilter/iptables/ebtables).

(Last edited by load.runner on 21 Apr 2015, 00:32)

Interesting.  In adition to knowing whether broadcasts are actually working (given the theory you seem to imply), it would be interesting to know whether directional traffic is flowing in one direction but not another.  Are you able to "sniff" on both stations to see if broadcast frames are making it across to check?

One approach I can think of:
- start up your favourite packet sniffing tool
- flush the ARP cache on both Station A and B after the roam
- initiate a ping from both stations
- watch for ARP and ICMP frames
- verify whether broadcast frames are making it across
- verify whether directional ICMP frames are making it across

I'm still at a loss to explain why it doesn't work though; my knowledge of the innards of the wireless kernel module is mostly limited to knowing the names of the drivers I generally use.

When the roam occurs, try clearing the arp cache of both APs.

First, make sure there is only one DHCP server in the network.  It is simplest to run it on the lan interface of the router that is routing to the internet.  The other router's configuration is called a "dumb AP" because all it does is pass wifi packets over to the ethernet cable to the main router.  Thus your clients should be getting the same IP address (based on their MAC address) even though they roam.

An AP wifi driver is also a bridge between clients that are connected to it wirelessly.  These packets never leave the wifi driver.  The default is for this bridge to be on.  It can be turned off with the "AP Isolation" setting for situations like a coffee shop where you don't want the users to be able to reach each other.

Yes, that internal bridge is the "problem".

As long as a device (basically it's MAC, I assume) is listed as connected to a certain AP, all the unicast packets that originated from other clients connected to the same AP, targeted to that device, will never leave the internal bridge.
So if that device roamed to another AP, but did NOT disassociate from the first one, it cannot communicate with the other devices connected to the first AP. (or to be more accurate, the other devices can't communicate with it)
It makes sense, right?

AP isolation does turn off the bridge, but all the frames that were supposed to be bridged are just dropped, so that doesn't help.

@atom: ARP works, but the ARP reply is unicast, so it will only be received if the request is made for the roamed client by other devices, and not by the roamed client itself.
I tested this, and it confirms the theory.

Now I'm thinking 'disassoc_low_ack' option should help. (although it's enabled and it doesn't)
The documentation says "Disassociate stations based on excessive transmission failures or other indications of connection loss."
If a device is listed as connected, but does not respond when other devices are trying to reach it, why doesn't that count as connection failure?


A possible hack/custom fix would be to detect when a client associates with a the AP (is that possible?) and call a script which sends a message to the rest of the APs in the network to disassociate that MAC.


LE: disassoc_low_ack does help.
If I flood the target with pings (like ping -i 0.05 192.168.0.X), after a few seconds it disassociates it from the old AP and the connection starts working.
Is there any way to make the "low_ack" condition more sensible?

(Last edited by load.runner on 22 Apr 2015, 01:31)

Useful thread, thanks - I've learnt something significant!  I don't deal much with situations where clients communicate amongst one another so have never noticed this.

On a practical level to try and resolve your real-world problem: How much overlap is there between your APs?  Specifically, in your normal walking route from one AP to the other, at what point(s) is the signal level from both APs the same?  There are so many conditions that influence this, but you'd probably want to maintain SNR of >= 10, which could mean anything from -76 dBm to -86 dBm...

I'd like to understand the probability that the client goes out of range of AP1 before it can see AP2.  Maybe the client behaviour will change if there is more overlap?  Transmit power on most mobile devices is quite low (typically <16 dBm for 2.4 Ghz) so it's also possible that the client does attempt to disassociate but the AP never sees it.  If lack of overlap is the case, adding a 3rd AP en route could be an option.

It might also be interesting to see if roaming behaviour changes if you try to use 802.11r, but client support is somewhat limited at present and documentation on how to set it up on OpenWRT is spotty.  Some of the major vendors also caution of "unexpected results" when non-802.11r clients associate with an 802.11r network.

The APs overlap 100%, that's the weird part. But that only makes things worse.
My laptop for example, is sometimes jumping randomly like crazy from on AP to another, without moving at all, and it (almost) never sends de-auth packages. The signal is between -40 and -55 for both APs, I'm using a BCM4313 PCIe adapter on Ubuntu.
But it's not the only device that does that. I also had this problem with Android phones or tablets.
I just assumed that I can't rely on clients doing that.

I someone is curious, I still need the second AP because I use a Chromecast device, which probably has a lower sensitivity and/or transmit power and can't play 1080p content when connected to the main AP.

I will probably end up with using different network names, and forcing some devices on a certain AP.

I never used 802.11r, but from what I read, if only changes the key negotiation part.

I still think that manual disassociation could help.
I learned that hostapd_cli can do that, I just need to call a script whenever a client is associated.
Is there any way of doing that?
I can even modify hostapd if I really have to, but I wouldn't want to create more bugs.


LE:
I added the "feature" to hostapd.
Added this in ieee802.11.c at line 2432, right after the device is logged as associated.
I didn't bother to add a config option for it, because it will never be included in hostapd. smile

     char macStr[18];
     int res = os_snprintf(macStr, 18, MACSTR, MAC2STR(sta->addr));
     if (res != -1)
         os_exec("/usr/local/bin/wifi_assoc.sh", macStr, 0);

and the script:

#!/bin/sh

ssh -y -i /root/.ssh/id_rsa root@192.168.0.1 "hostapd_cli disassociate ${1}"

Until I find a nicer solution, I guess this will do it. It seems to work.

(Last edited by load.runner on 22 Apr 2015, 22:49)

You could also try IAPP.
Add

option iapp_interface 'br-lan'

to the "config wifi-iface" section for both APs (br-lan should be the interface that has IP addresses in the lan where the APs can communicate with each other).

I once tried this, but it doesn't work well with dual band devices (if I add it to both radios, it stops with an error..), so I ended playing. Maybe you could give it a try if your devices are running only in one band.

In theory this protocol sends a broadcast message from an AP to the defined interface when a station associates to it, thus letting other APs know they should disassociate that client (in case it was associated to them in the first place).

Hmm, it sounds like exactly what I need. Thanks.

I will try it for sure, but it doesn't sound very promising.

hostapd documentation wrote:

IEEE 802.11F-2003 was a experimental use specification. It has expired and IEEE has withdrawn it. In other words, it is likely better to look at using some other mechanism for AP-to-AP communication than extending the implementation here.

I mentioned 802.11r as I know there is some interaction between APs which might (or might not) help them become aware of the change, but duvi's IAPP suggestion looks like it's better-placed to do that.  Also 802.11r would require 802.1X authentication.

I've read a bit and it looks like IAPP (802.11f) was never really standardised and is officially "discontinued" as a standard, though it would still work on devices that support it.  From what I can gather, it doesn't really look like it got implemented in commercial products.

It seems IAPP is replaced by a combination of 802.11r (fast roaming) and 802.11k (assisted roaming.)

Thanks for the workaround that executes the disassociation command; I anticipate that doing a remote shell would be quite slow, up to several seconds - have you noticed any side effects?  A fun project might be to somehow forge a disassociation frame and send it on behalf of the client, though that's perhaps a bit extreme :-)

PS - in terms of overlap you might want to try reducing the transmit power of APs as a possible way of encouraging devices to stick with one AP.  Commercial products generally recommend cutover at -67 dBm, which usually leaves you with 20-25 dB of Signal to Noise ratio (assuming a noise floor of -87dBm to -92dBm.)  802.11n with 20 Mhz channels reaches maximum performance at ~25dB SNR, so anything beyond that largely goes to waste.  http://www.revolutionwifi.net/revolutio … pping.html

The last param for os_exec is wait_completion, so as long as it's 0, it shouldn't cause any issues. (in theory at least)
Looking into the code for IAPP I found a much better place for calling my script, but before trying to improve it, I will really look into IAPP. It seems perfect. Similar to what I did with the script, just using a better approach and supporting any number of APs.
It doesn't seem to work at first glance, it sends the request, but the other AP doesn't get it. But I will try to debug the code, find the source of the problem, maybe even fix it.

IAPP was dropped probably because nobody (or very few people) care about communicating with other Wifi clients. As long as the internet is working, it's great. But that's not the case for me.

I only read about them on Wikipedia, but 802.11k seems to only exchange information about the available APs and their usage, in order to improve client distribution between them, and 802.11r seems to only cache part of the key, in order to improve the time it takes to switch from an AP to another, mainly for scenarios where you use VOIP while driving. smile
802.11f was the only one that tries to enforce unique association.
I repeat, this is only from reading Wikipedia. smile

(Last edited by load.runner on 23 Apr 2015, 12:55)

load.runner, did you ever solve your problem? I'm running 4 different ssids (2 APs, 2 2.4ghz radios, each with their own SSID, and 2 5 ghz radios, each with their own SSIDs). I'd really like to just make them all share the same SSID, but I'll likely run into the same problem you have. I'd love to know if you figured out IAPP or another solution.

I'd try your ssh script, but I'm using a wrt1900ac right and the devs/wiki recommend using CC RC3. Not sure yet how to insure I'm just getting RC3 when I build, nor am I a openwrt-building pro by any means. Been a year or more since I compiled it myself. I suppose I could check out RC3 from subversion, build, and then compare to the checksum of the downloadable image, but if you found a solution it'd be nice to know.

Thanks!

I'm still using the script I posted.
It works fine and it doesn't seem to cause any problems.

Finally got around to trying your script in my own environment, but ran into a bit of a snag: sometimes - not always - a zombie process (visible through ps) gets left behind after the script is called.  The zombie is left on the router that triggers the script; the SSH session terminates and cleans up successfully on the remote router.

The confusing thing is that the contents of the script - or whether it exists at all - does not seem to matter.  If the script does not exist at all, there is an additional zombie process titled "hostapd" after every few new associations.

The zombie processes cannot be killed using "kill".  Restarting hostapd (w.g. by running "wifi") kills the zombies.

Thoughts on changing the way the script is called to prevent this?

I noticed this a while back and I fixed it by adding "signal(SIGCHLD, SIG_IGN);" to utils/os_unix.c
Details here: https://www.win.tue.nl/~aeb/linux/lk/lk-5.html#ss5.5

This is the complete patch file I use:

--- a/src/ap/ieee802_11.c
+++ b/src/ap/ieee802_11.c
@@ -2429,6 +2429,11 @@
            "associated (aid %d)",
            sta->aid);
 
+    char macStr[18];
+    int res = os_snprintf(macStr, 18, MACSTR, MAC2STR(sta->addr));
+    if (res != -1)
+        os_exec("/usr/local/bin/wifi_assoc.sh", macStr, 0);
+
     if (sta->flags & WLAN_STA_ASSOC)
         new_assoc = 0;
     sta->flags |= WLAN_STA_ASSOC;
--- a/src/utils/os_unix.c    2015-09-22 17:23:42.977119181 +0300
+++ b/src/utils/os_unix.c    2015-09-29 00:24:28.973989299 +0300
@@ -643,6 +643,7 @@
     pid_t pid;
     int pid_status;
 
+    signal(SIGCHLD, SIG_IGN);
     pid = fork();
     if (pid < 0) {
         perror("fork");

Awesome, thanks!  Will try to recompile and test.

Seems to be running stably - thanks very much.  Note to future self: tabs do not transfer across reliably in terminal copy & paste, so diff file above may need to be re-created manually.

Can either of you guys submit the patch to openwrt devs please?

It would probably require a config option for the script name/path.
Right now it's hardcoded,

While certainly useful functionality, my viewpoint is that this is at best a hack and agree that submitting any of this into trunk would be premature at best; even then, any attempts to submit this would likely be countered with an argument that - though deprecated - IAPP (as suggested by duvi) already does this.

I'd anticipate that for most people, using a "repeater"-style implementation would be a much simpler solution to the problem as it would keep the MAC on the wireless side.  With multi-stream being common in consumer routers today, the reduced throughput would likely be acceptable in most cases.

Those of us desperate, stubborn or stupid enough to make it work the "right" way can continue to hack away in the spirit of OSS :-)

The discussion might have continued from here.