Dawn: a decentralized wireless controller

It seems to me a lot of Android phone companies don't implement wifi roaming well.

I am pretty sure Android can support 802.11 k / v / r. But take my case for example - even with a new Xiaomi flagship phone, I had to tweak its wifi WCNSS driver to make it work, and only some of the time. And if I don't root my phone it wouldn't even have been possible to make those tweaks.

(can share with you what I have done, but I think there are different modification for different brands of phone / wifi chip used. PM me if interested !)

I am still encountering some issues, check out Strange problem with 802.11k and 802.11v that I have just posted.

But so far what I can conclude is that at least for my Android phone, I don't even need Dawn to be running. Like the iDevices, it makes its own decisions on which AP has the best connection quality and which one to switch to. But unlike the iPhone, it needs 802.11k/v to be up and running (this is the case for my phone anyway) to feed info to it for making the decisions.

Hi,
I see in your other post that you've kept bss_transition at 0.
With your redaction, does it mean that bss_transition of 0 is what your one device is requiring to work, or that all your non Apple devices still need that value as opposed to '1'?

That '0' value simply means without Dawn or any other BSS transition management tool, there is nothing that will give out BSS_tm_req commands (hostapd function, to tell devices to switch to other APs). 802.11v in this case doesn't have any network-based switching capability.

Be warned this is only my understanding of the situation. I can be totally wrong.... (I hope not)

I’ve just received my new Android phone (from work, privately I use iOS). Setting that new Android phone up and connecting to my WiFi I already experience problems connecting… My old Android phone is connected, all my other devices including iot devices are connected and “just work”. From OpenWRT logging I see no reason why this new Android phone won’t connect. The so called “Smart Switch” app to migrate data from the old to the new phone can utilize either a cable (which I don’t have) or WiFi. The app needs to be running on both Android Phones and connected to the same WiFi network. The app just doesn’t work. It even crossed my mind that the app is dumber than I thought and both devices had to be on the same band: either 2.4GHz or 5GHz… But no, doesn’t work.

I admit I’m a bit biased when it comes to Android but seeing how iOS devices work, but I can’t really tell (yet?) if there are WiFi settings some Android versions don’t like? I can tell however that my iOS and other iot devices work “just fine”.

So after all that, should bss_transition be 0 or 1? :slight_smile:

An important bugfix was pushed to hostapd:

Please upgrade to latest snapshot. Probably hostapd can crash when using dawn.
More Information:

1 Like

As a consequence for you, maybe running dawn on stable 21.02 will cause hostapd crashes and so actually a worse situation for your wireless connections. I will have a look if we can backport everything to 21.02. Maybe until then switch to snapshot, if you can.

I need some time to answer to all of your other questions. I was more interested in removing all the crashes of dawn, and also hostapd. :slight_smile:

1 Like

@PolynomialDivision - I may be able to provide a hint regarding the "sticky behavior" some people are experiencing with DAWN.

Note: this was from a DAWN build (using latest github source) 1-2 months ago. DAWN was seemingly unstable on my hardware, and my family's patience for internet trouble dried up, so Im not currently running DAWN and cant test anything. Sorry.

That said, when I was using DAWN I wrote a script to group the hearing map output in different ways (available HERE - calling this with argument showHM_groupAP is particularly useful). I noticed that some of the items were wrong for non-local AP's. I forget all the specifics but remember that all non-local AP's were shown as not having HT or VHT support (i.e., in a 2 AP setup, AP1 always scores AP2 as if it had no HT/VHT support for any device, and AP2 always scores AP1 as if it had no HT/VHT support for any device). This always increased the score of the current access point, which I imagine tended to make clients sticky.

I hope this may be of use in debugginh things. Good luck!

1 Like

I see something similar actually on my end with 3 APs. It seems the hearing map on each AP shows False for HT Sup for other APs than itself. However the scores remain the same on all APs I believe because by default ht_support and vht_support are 0 in the config.

AP1

Client MAC AP MAC Frequency HT Sup VHT Sup Signal RCPI RSNI Channel Utilization Station connect to AP Score
CLIENT1 AP3 2.437 GHz Channel: 6 False False -53 -1 -1 58.43 % 0 134
AP1 2.437 GHz Channel: 6 True False -73 -1 -1 37.65 % 1 94
AP1 5.765 GHz Channel: 153 True False -76 -1 -1 6.67 % 3 113
AP2 2.437 GHz Channel: 6 False False -48 -1 -1 57.65 % 0 144

AP2

Client MAC AP MAC Frequency HT Sup VHT Sup Signal RCPI RSNI Channel Utilization Station connect to AP Score
CLIENT1 AP3 2.437 GHz Channel: 6 False False -53 -1 -1 59.22 % 0 134
AP1 2.437 GHz Channel: 6 False False -73 -1 -1 37.65 % 1 94
AP1 5.765 GHz Channel: 153 False False -76 -1 -1 6.67 % 3 113
AP2 2.437 GHz Channel: 6 True False -48 -1 -1 57.65 % 0 144

AP3

Client MAC AP MAC Frequency HT Sup VHT Sup Signal RCPI RSNI Channel Utilization Station connect to AP Score
CLIENT1 AP3 2.437 GHz Channel: 6 True False -53 -1 -1 59.22 % 0 134
AP1 2.437 GHz Channel: 6 False False -73 -1 -1 37.65 % 1 94
AP1 5.765 GHz Channel: 153 False False -76 -1 -1 6.67 % 3 113
AP2 2.437 GHz Channel: 6 False False -48 -1 -1 57.65 % 0 144
1 Like

However the scores remain the same on all APs I believe because by default ht_support and vht_support are 0 in the config.

True. I had forgotten that in my setup I had made these non-zero, which made the scores for a given STA on a given band differ from AP to AP. Actually, I think that investigating why the scores differed was actually what led me to discovering this DAWN error.

Interesting note: If I remember correctly, the no_ht_support and no_vht_support metrics still were correct for non-local AP's, so if instead of using

option ht_support '100'
option no_ht_support '0'
option initial score '0'

you used

option ht_support '0'
option no_ht_support '-100'
option initial score '100'

then the computed scores were the same as if the ht_support=100 case scores were being evaluated correctly for non-local AP's.

Note: the following is mostly wild speculation on my part.

That said, even if (with default config) this doesnt change the DAWN metric scores, I could see it still causing sticky behavior IF this (incorrect) info about the [V]HT support of neighboring AP's is getting passed on the the STA's somehow. Being that

  1. the "end goal" of roaming (from the STA's viewpoint at least) is to switch a STA to a new AP so that STA gets faster speeds,

  2. the STA ultimately decides which AP to connect to,

  3. non-[V]HT speeds are considerably lower than [V]HT speeds, and

  4. I would guess that most devices that are mobile and that we actually care about wifi speed on (phones, tablets, laptops, etc) probably support [V]HT speeds...now days most 802.11a/b/g-only devices probably dont move around much

I could see a device with a [V]HT connection to their current AP not wanting to switch to another AP that is advertised as not having [V]HT support.

Again, wild speculation, but I could even see this explaining the (deleted?) comment re: setting bss_transition=0 helped make some devices less sticky. Specifically, if those devices determine the capabilities of neighboring AP's by

  • bss_transition=1 --> rely on info sent from their current AP about neighboring AP's

  • bss_transition=0 --> determine capabilities of neighboring AP's independently

Per this comment, there is no need to set bss_transition for Dawn.

Also worth noting is that you don't even need to set 80211k and 80211v either.

(In my previous reply I mentioned a different reason for setting bss_transition to 0 - that I wasn't using Dawn anymore. I am now using it once again, but still bss_transition is set to zero regardless... or in my case I am simply not including a bss_transition setting in /etc/config/wireless)

Should be zero, per this comment by the author himself.

Thanks!

The only thing is that the comment is still open to interpretation I think no?

ie he say “no you don’t need it, because I set it.”

I think that lines up more with your comment 2 above: That I guess it doesn’t matter because Dawn overrides the value anyways

Hello @PolynomialDivision, I built a snapshot image to use the latest version of Dawn with latest snapshot of hostapd. Here is my report.

  1. I set "option kicking '1'". It seems to work for my iPhone, but I recall seeing my iPhone roam even on a previous installation of Dawn without this option enabled. (How can I be sure that Dawn is sending out those BSS transition requests to devices like my iPhone to get them to switch APs, assuming this is what 'kicking' does? Is there a logging option that I can enable in places like /etc/init.d/dawn or something like that to allow me to see when Dawn is actually issuing commands to kick devices from one AP to another?)

  2. I also changed "option kicking_threshold" from 20 to 15, assuming that it means the difference in scores based on which Dawn will decide to kick devices out of the current AP with a lower score to the AP with the highest score. It SEEMS to work for my iPhone (again without any log that shows when Dawn is actually kicking devices, I cannot say for certain).

  3. I understand in your own environment, you are setting up your Dawn routers as dumb APs with a wired backbone connecting these APs. But I believe many others use 802.11s and batman to wirelessly create a hub of APs that use Dawn. With batman, I encounter this issue where the neighbor AP list takes a LONG time to populate (as long as one hour or so). This is very similar to the issue reported here quite some time ago.

  4. Also I am seeing these errors in logread for each of my routers:

Mon Nov 22 22:08:08 2021 kern.warn kernel: [ 7175.548272] br-lan: received packet on bat0 with own address as source address (addr:xx:xx:xx:xx:xx:xx, vlan:0)
Mon Nov 22 22:08:18 2021 kern.warn kernel: [ 7185.788576] br-lan: received packet on bat0 with own address as source address (addr:xx:xx:xx:xx:xx:xx, vlan:0)
Mon Nov 22 22:08:28 2021 kern.warn kernel: [ 7196.028277] br-lan: received packet on bat0 with own address as source address (addr:xx:xx:xx:xx:xx:xx, vlan:0)

One error every 10 second, and it continues as long as Dawn is running. If you could point me to the details on what command Dawn passes on to umdns once every 10 second, I could try filing a bug report over at umdns github to see if the umdms devs will respond. (I think a vast majority of Dawn users will be using a wireless backbone like batman, so fixing this, plus the problem on neighbor list population would mean a lot to many of us)

  1. I have an Android phone that just doesn't play well with 802.11v. I set both eval_auth_req and eval_assoc_req to 1 to see if the AP will simply deny access to force my Android to switch to another AP, but nope. Even when the current AP has a score 30 points lower than the adjacent AP, no switching happened. Is it because the kicking option takes precedence (I have kicking set to 1 also) and Dawn continues to steer through BSS_tm_req messages?

  2. On a related note, may I make a request that should make Dawn even more useful: option to allow for a list of MAC addresses to use the 'eval_auth_req' and 'eval_assoc_req' methods and for the rest of the devices not in this MAC list to fall under 'kicking'. Because like you said somewhere else, 'eval_auth_req' and 'eval_assoc_req' are primarily used to cater for legacy devices. And in a network environment, typically there will be a mix of legacy (bad) devices that do not comply with 802.11v, and modern (good) devices that do.

  3. Sorry, this is something that I feel a strong urge to say... the documentation (or lack thereof) is probably driving a lot of people away from otherwise a very powerful and useful tool that to me has the potential to become a core part of Openwrt. This is probably the only place with a bit of official information: https://github.com/berlin-open-wireless-lab/DAWN, but even there the info is seriously outdated. For example, 'kicking' isn't even there :slight_smile: . Also I end up having to read a lot of other posts along with many guess work to give myself enough confidence of the meaning of the scoring system in general, and a lot of the options.

Edit: for item 4 above - turns out the problem was not related to umdns / dawn at all. It was the way I configured batman that caused the loops. Fixed it by adding in wifi-iface: option macaddr 'xx:xx:xx:xx:xx:xx' to change the mac address of the mesh interface.

2 Likes

I'm enabling the features here:

That is a very interesting finding. Actually bss_tm_req should kick it because we set some timer?

But I just realized sometimes I use del_client_interface. I need further investigation into it.

If you want to help me, feel free to open a wiki artikel about your findings and config options and I am happy to read the things and correct them if they are wrong?

Currently, I lack on a good testbed, and I have more important coding tasks in my limited free time to do. :confused: If you want to start coding for DAWN, I am happy to help. Or please write issues on GitHub, and then I can rewrite DAWN code and add more debug output that you need. I take it with me that I should write some log, that shows the steering decisions?

1 Like

Can you create a ring buffer that has the last ~50 decisions that can be read from /proc ?

Sure. Can you point to some example code, I can use as reference?

Are you in kernel space or userspace only? I think proc is kernel
Kernel space example

Userspace only.

To be honest I was gonna give up on Dawn - as it wasn't really working in my network environment, but....

... after I actually went back from snapshot to release build 21.02.1 without dawn, by accident I bumped into this old comment from you: How does rrm work? - #36 by PolynomialDivision

I then went ahead and try without any expectation but very much to my surprise, both my iPhone (old 6s+) and my Android (Xiaomi Mi11) could be steered at will to the AP of my choice ! All I did was

ubus call hostapd.wlan0-1 wnm_disassoc_imminent '{"addr":"xx:xx:xx:xx:xx:xx", "duration": 120, "neighbors":"$sam"}'

where $sam is

sam=$(ssh root@<ip of another AP> ubus call hostapd.wlan0-1 rrm_nr_get_own)

... works like magic. So now I have high hopes for Dawn again all of a sudden. But too bad last time I did any sort of programming was something like 20 years ago so it will be difficult for me to determine from your source code if kicking means 'wnm_disassoc_imminent' is issued (nor will I be able to start contributing to your github project even though I am willing to do it).

I will soon set up Dawn once again in snapshot Openwrt, and try to determine whether wnm_disassoc_imminent was somehow never triggered in Dawn, or something went wrong during wnm_disassoc_imminent causing the transition to have never taken place for my phones. At least now I know what to look for in logread: any appearance of BSS-TM-RESP xx:xx:xx:xx:xx:xx status_code=0 bss_termination_delay=0 target_bssid=yy:yy:yy:yy:yy:yy as I can see these logs all over my manual tests.

If Dawn ends up working well in my environment, I will be more than happy to contribute to your future wiki (or your README.md in github) :+1:

(so far for Dawn I can see a scoring system that seems to calculate the scores well, plus I know hostapd's wnm_disassoc_imminent is used to steer/kick devices around. What seems to be missing for me is a clear sign of linkage between these two aspects of Dawn. But I am happy there is progress in my testing)