Pagefault error - Possibly DAWN related

Hello,

I've been experiencing issues with my wifi set up since updating to Openwrt 21.02.2.

My setup is essentially a Opnsense Router/Firewall and 3 Xiaomi R3Gs as dumb APs. I have several VLANs, each mapped to a wifi network, plus each AP has a hidden wifi management network.

I use DAWN for roaming, which worked just fine out of the box in Openwrt 21.02.0.

I finally sat down and upgraded all 3 APs to the exact same openwrt version: OpenWrt 21.02.2 r16495-bf0c965af0 , configured dawn as recommended here sat down and started looking at logs.

This happens essentially with Android phones. A BSS related message will pop up, a page fault will happen:

do_page_fault(): sending SIGSEGV to hostapd for invalid read access from 00000005

Which causes the reload of the entire network, for a few seconds: the interfaces are brought down and back up. Sometimes one time, but I've seen it happen up to 5 times in quick succession.

Apr 17 20:06:05 ap-1 hostapd: wlan0: STA a2:11:b0:XX:XX:XX WPA: pairwise key handshake completed (RSN)
Apr 17 20:06:05 ap-1 firewall: Reloading firewall due to ifupdate of lan (br-main.10)
Apr 17 20:06:05 ap-1 hostapd: wlan0: BSS-TM-RESP a2:11:b0:XX:XX:XX status_code=6 bss_termination_delay=0
Apr 17 20:06:05 ap-1 kernel: [ 8078.021263] do_page_fault(): sending SIGSEGV to hostapd for invalid read access from 00000005
Apr 17 20:06:05 ap-1 kernel: [ 8078.030106] epc = 55610855 in wpad[55609000+106000]
Apr 17 20:06:05 ap-1 kernel: [ 8078.035160] ra  = 55610855 in wpad[55609000+106000]
Apr 17 20:06:05 ap-1 kernel: [ 8078.044161] br-mgmt: port 2(wlan1-3) entered disabled state
Apr 17 20:06:05 ap-1 netifd: Network device 'wlan1-3' link is down
Apr 17 20:06:05 ap-1 firewall: Reloading firewall due to ifupdate of lan (br-main.10)
Apr 17 20:06:05 ap-1 kernel: [ 8078.503116] br-mgmt: port 2(wlan1-3) entered disabled state
Apr 17 20:06:05 ap-1 kernel: [ 8078.517472] device wlan1-3 left promiscuous mode
Apr 17 20:06:05 ap-1 kernel: [ 8078.522198] br-mgmt: port 2(wlan1-3) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8078.591527] br-main: port 9(wlan1-2) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8078.603469] device wlan1-2 left promiscuous mode
Apr 17 20:06:06 ap-1 kernel: [ 8078.608170] br-main: port 9(wlan1-2) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8078.691327] br-main: port 8(wlan1-1) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8078.706651] device wlan1-1 left promiscuous mode
Apr 17 20:06:06 ap-1 kernel: [ 8078.711356] br-main: port 8(wlan1-1) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8078.783544] br-mgmt: port 1(wlan0-3) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8078.797028] device wlan0-3 left promiscuous mode
Apr 17 20:06:06 ap-1 kernel: [ 8078.801804] br-mgmt: port 1(wlan0-3) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8078.883887] br-main: port 6(wlan0-2) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8078.897795] device wlan0-2 left promiscuous mode
Apr 17 20:06:06 ap-1 kernel: [ 8078.902550] br-main: port 6(wlan0-2) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8078.988144] br-main: port 5(wlan0-1) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8079.003559] device wlan0-1 left promiscuous mode
Apr 17 20:06:06 ap-1 kernel: [ 8079.008316] br-main: port 5(wlan0-1) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8079.077565] br-main: port 7(wlan1) entered disabled state
Apr 17 20:06:06 ap-1 kernel: [ 8079.084956] br-main: port 4(wlan0) entered disabled state
Apr 17 20:06:06 ap-1 netifd: Wireless device 'radio0' setup failed, retry=3
Apr 17 20:06:06 ap-1 netifd: Wireless device 'radio1' setup failed, retry=3
Apr 17 20:06:06 ap-1 netifd: Interface 'mgmt' is now down
Apr 17 20:06:07 ap-1 netifd: Interface 'mgmt' is disabled
Apr 17 20:06:07 ap-1 kernel: [ 8079.611780] br-main: port 4(wlan0) entered disabled state
Apr 17 20:06:07 ap-1 kernel: [ 8079.625358] br-main: port 7(wlan1) entered disabled state
Apr 17 20:06:07 ap-1 kernel: [ 8079.679329] device wlan0 left promiscuous mode

This causes the phones in question to drop out of wifi completely.

I've googled quite a bit, and found some issues related to dawn and hostapd on this openwrt version, but nothing explains this behavior I'm seeing.

Can anyone shed some light on this?

I can't see any immediate relation to DAWN there. I don't think it registers for or uses that BSS-TM-RESP message for example.

If you disable DAWN do you still see the hostapd crashes?

Hey, thanks for your reply.

Effectively none of these messages are DAWN related, but as you've stated when DAWN is stopped this issue stops happening.

I have no idea on how the ubus communication is actually handled between dawn and hostapd, but the message seems pretty consistent with some malformed instruction between the two causing a pagefault.

It should be possible to seperate DAWN from this, which should help with a resolution.

If you'r so inclinded you should have the ubus message for hostapd shown at the end here, which DAWN uses to steer a co-operative devcie and is probaly the source of the device BSS reply:

root@localhost:~# ubus -v list hostapd.wlan0
'hostapd.wlan0' @c9a5ca08
        "reload":{}
        ...
        "wnm_disassoc_imminent":{"addr":"String","duration":"Integer","neighbors":"Array","abridged":"Boolean"}
root@localhost:~#

There's some notes here on how to use that to steer a device manually: Dawn: a decentralized wireless controller - #61 by seemebreakthis

If you see the same crash when doing so it may help figure out what is happening. First thing I spot is that your BSS message has a status code of 6 (some kind of problem?) rather than 0 (OK). That may mean the rest of the message has an form that hostapd is not expecting.

Hey Ian,

I've ugpraded my aps to the recent new release. I'm still seeing this behavior with dawn enabled.

I'm going to try to get to the bottom of this with your suggestion.

I think i may also suffer from the same or something similar (see https://forum.openwrt.org/t/openwrt-21-02-3-third-service-release/125732/55?u=ramon for log).

Any idea if there is this is a know issue in hostapd and/or if this has been fixed already in a newer version?

Did you try the release candidate to see if it is still there?