Android device disconnects after fast roaming due to lost DHCP replies

I've been trying to diagnose this issue for a while now using tcpdump and logread with wireless.radio1.log_level='1'. The following is my setup:

  1. Main OpenWrt router:
    model: x86_64
    DHCP/DNSmasq server, IPv4 only
    Firewall service enabled
    Has ethernet ports only
    OpenWrt 21.02.0 r16279-5cc0535800
  2. Two Dumb OpenWrt WiFi AP:
    model: Netgear WAC104
    Firewall service disabled
    DHCP/DNSmasq service disabled
    Hardware offloading disabled
    802.11r enabled
    FT over air
    WPA2 PSK
    OpenWrt 21.02.3 r16554-1d4dea6d4f

All of my tests were done using 5GHz WiFi interface.

I start off by connecting my device (Samsung Note 9) to the first dumb AP, and check that it has done full WPA 4-way handshake, and it sent a dhcp-request and got a dhcp-reply (it didn't send dhcp-discover since it was previously connected).

Afterwards I run the ping app from the my device and move towards the second dumb AP which causes a fast transmission and logs show FT authentication already completed - do not start 4-way handshake.

Lastly I go back towards my first dumb AP which again causes a fast transmission and logs show FT authentication already completed - do not start 4-way handshake. At this point, the device suddenly sends dhcp-requests again. The main router receives the dhcp-requests and sends back dhcp-replies via the eithernet interface, however when I look at the tcpdump on first dumb AP, I don't see the dhcp-replies (only dhcp-requests are shown). The device continues to send dhcp-requests (due to it not getting any dhcp-replies) then suddenly decides to disconnect the WiFi.

Below is tcpdump screenshot of main router on the physical ethernet port (not the br-lan):

Below is tcpdump screenshot of dumb AP on the physical ethernet port (not the br-lan):

Below is logread output of the dumb AP showing the first WPA 4-handshake, the FT and the disconnection:

I've tried:

  • switching ethernet ports between the dumb APs and the main router,
  • tried the test by starting with the second AP and moving towards to the first AP and then back,
  • tried a different Android device (Nokia 6.1 with Android 10)

they all ended up with the same result, dhcp-replies being dropped somewhere between the AP and the main router.

I have a few question:

  • if the firewall service is disabled on the dumb APs, could it be that the iptables still drop packets?
  • If it's not the iptables that's causing the drops, then what is?

I've read the following threads, for more info, but can't seem to find the real cause:

1 Like

I'm just troubleshooting an issue that seems similar. I don't understand why your phone is sending DHCP requests though. What I have is:

  1. phone connected to AP#1, running Ubiquiti WifiMan app signal monitor (awesome stuff, btw), phone sends lots of pings to 8.8.8.8 and reports latency, etc. All works great.
  2. walk to AP#2 (the one that shows the problem), eventually phone transitions to that AP, FT works, phone sends a couple of pings but then starts sending ARPs for the gateway, phone never gets any response to anything
  3. after 4 seconds, phone disconnects from AP#2 and reconnects to AP#1 and everything resumes

During phase #2 I can see the ARP requests at the gw as well as its replies. Tcpdump on the ethernet port of AP#2 shows the packets going into the ethernet but there are no replies.

In my case, I have an AP#3 and everything works seamlessly between AP#1 and AP#3. Also, I have other devices that are connected fine to the problem AP#2 but they do show issues when walking away and back.

My current suspicion has to do with VLANs. AP#2 is a Microtik hAP AC2 running 21.02 and the network in question is on VLAN 5. I suspect the VLAN stuff is borked, it has an internal switch and that uses some VLAN stuff too. I'm planning on testing a simplified config without VLAN this evening...

NB: I also posted at 802.11r Fast Transition how to understand that FT works? - #131 by wrtsurfr

Update: the plot thickens. I had to reflash so used 22.03. Simplest install I can figure out, without VLAN, and it shows the same problem. Turning FT off makes it all work.

Thanks for the reply.

Just a few questions:

  • I read that the device facing the problem is pixel 3a, is that right? Which android version is it running?
  • I'm guessing that the ARP loss and DHCP loss are somewhat similar in the sense the packets are just not appearing back at the AP that the device is connected to. Did you check if by chance the ARP reply packets are appearing on your AP#1 when your device is connected to AP#2?
  • Did you try setting option reassociation_deadline '20000'? (I'll be trying this next and post an update soon)
  • What are the models of AP#1 and AP#3?

I'm so confused with packets not appearing, like where are they disappearing? I would understand that after FT the device should not send dhcp-request, but could it be that the switch on the dumb APs is dropping them or something? Generally I feel like I'm missing something.

It just occurred to me that I have the following setup:
Main Router <---> Dumb AP 1 <---> Dumb AP 2
I'm going to try the following setup:
Dumb AP 1 <---> Main Router <---> Dumb AP 2

I'm thinking that if the Main Router has two separate ethernet nics, positioning it in the middle might have an affect?

I just tried this setup with the same steps as in my first post, indeed the main router is properly figuring out which ethernet nic to send out the dhcp-replies, so it doesn't seem to be a problem with the main router. My guess is it's probably the ethernet switch on the dumb AP is losing the packet or something. Need to think of a way to sniff the packets between the dumb AP that's losing the packets and the main router. I have looked at this post: Mini tutorial for DSA network config - #64 by erdoukki , I'm not familiar with tc but will try to understand commands and what they do on my dumb AP.

Just as a general thing, some devices just don't work well with fast roaming enabled. Do things work as expected if you disable 802.11r?

FWIW, a well tuned system can have excellent roaming without 802.11r at all. Obviously fast roaming should improve the speed and performance of the hand-off, but in many cases it actually proves to be detrimental (as a function of the client device support).

The pixel3a is running the last security patch as of a couple of months ago, it's android 12
The missing packets are not appearing at the other AP, I did check that.
I have the reassociation deadline set to the value you mention.
AP#1 is a Gl.Inet AR750 (QCA9531 w/switch)
AP#2 is a Microtik hAP ac2 (IPQ4018 w/switch)

If you have a smart switch, you can use port mirroring to monitor where the packets are coming out. I now have a set-up where I can see that better. My suspicion is that the packets are disappearing in the switch internal to the APs. I'll know more in a couple of hours...

By "devices" you mean STAs? I'm sure the pixel3a I'm using works well with FT. In fact, I do see it perform FT nicely. The breakage we're observing happens after FT completes.

If I disable FT the network works but the behavior doesn't. During my tests I've been walking back and forth between two rooms to cause the phone to transfer between the APs and I noticed that with FT enabled it transfers sooner than without. So yes, perhaps the transfer is fast enough without FT not to drop calls (the original issue I'm trying to solve) but the problem is that the transfer doesn't get initiated promptly enough 'cause the phone hangs on to the worsening AP for too long.

What do you call a "dummy AP"? A dummy is generally a non-functional prop. Do you mean "dumb AP" as in non-routing AP?

Yes. Even those that technically support the feature don't always actually work well.

This can be improved by carefully tuning your power levels on your APs (and ensuring that you have chosen channels that are non-overlapping and have sufficiently low noise/interference).

Yes, I know. I did that. It makes things better but doesn't make the problems go away. Also, I'm dealing with a large property and if I make things better when walking in one direction it gets worse walking in the other due differing distances and obstructions.
I hope you're not trying to imply "it's OK that FT got broken 'cause users should just spend time tuning AP power"?

no, I'm not implying that. But I can tell you that it also doesn't (or at least didn't) work well on Unifi APs with the Unifi firmware. I used to be active on their site, and the recommendation was to disable fast roaming. My dad's house is large with 5 Unifi APs and fast roaming disabled -- after tuning, things work well. But obviously tuning has its limitations as you have described.

OMG :joy: I can't believe I made that mistake, you're right it's dumb AP, gonna edit my posts not to confuse people here!

In my case the fast roaming works from first AP to second, my ping between device and another host doesn't lose a single packet! Obviously this is working as expected, but to take things further which is going back from the second AP to the first AP, that's were things get messed up. Even if I do the scenario by starting with the second AP, the same behavior is observed.

I'm so curious about that so looking forward to your reply (still trying to figure out tc for port mirroring which is making me wish I had something like a managed switch).

This stuff is sooo difficult to troubleshoot. I'm moving my phone from one tin box with an AP to the other trying to get stuff to happen, logging syslog and multiple tcpdumps, and then the phone just switches to cellular !@#$%^&*.

I applied some stuff recommended by 802.11r Fast Transition how to understand that FT works? - #134 by 72105 (mostly disabling dhcp and firewall services).

Also a difference from yesterday is that then the APs were in different rooms and I would observe AP#1 -> AP#2 (FT) resulting in no connectivity, then device going back to AP#1 and resuming operation.
Now with the tin cans I have:

  • AP#1 -> AP#2 (FT) resulting in no connectivity
  • STA disconnects and reconnects to AP#2 using 4-way-HS, no connectivity
  • STA repeats disconnect/reconnect, same outcome
  • STA eventually finds AP#1 and resumes normal operation

The syslog on AP#2 is as follows. I put blank lines between the phases above.

Sat Sep 10 10:48:14 2022 daemon.err hostapd: nl80211: kernel reports: key addition failed                                                                                                                                                                       
Sat Sep 10 10:48:14 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: binding station to interface 'wlan1'                                                                                                                                   
Sat Sep 10 10:48:14 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: authentication OK (FT)                                                                                                                                                 
Sat Sep 10 10:48:14 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-AUTHENTICATE.indication(58:cb:52:38:a3:bb, FT)                                                                                                                           
Sat Sep 10 10:48:14 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: association OK (aid 1)                                                                                                                                                 
Sat Sep 10 10:48:14 2022 daemon.info hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: associated (aid 1)                                                                                                                                                      
Sat Sep 10 10:48:14 2022 daemon.notice hostapd: wlan1: AP-STA-CONNECTED 58:cb:52:38:a3:bb                                                                                                                                                                       
Sat Sep 10 10:48:14 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-REASSOCIATE.indication(58:cb:52:38:a3:bb)                                                                                                                                
Sat Sep 10 10:48:14 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: binding station to interface 'wlan1'                                                                                                                                   
Sat Sep 10 10:48:14 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: event 6 notification                                                                                                                                                           
Sat Sep 10 10:48:15 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: FT authentication already completed - do not start 4-way handshake                                                                                                             

Sat Sep 10 10:48:18 2022 daemon.notice hostapd: wlan1: AP-STA-DISCONNECTED 58:cb:52:38:a3:bb                                                                                                                                                                    
Sat Sep 10 10:48:18 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: event 3 notification                                                                                                                                                           
Sat Sep 10 10:48:19 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.1X: unauthorizing port                                                                                                                                                     
Sat Sep 10 10:48:19 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: deauthenticated                                                                                                                                                        
Sat Sep 10 10:48:19 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DEAUTHENTICATE.indication(58:cb:52:38:a3:bb, 3)                                                                                                                          
Sat Sep 10 10:48:19 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DELETEKEYS.request(58:cb:52:38:a3:bb)                                                                                                                                    
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: authentication OK (open system)                                                                                                                                        
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-AUTHENTICATE.indication(58:cb:52:38:a3:bb, OPEN_SYSTEM)                                                                                                                  
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DELETEKEYS.request(58:cb:52:38:a3:bb)                                                                                                                                    
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: association OK (aid 1)                                                                                                                                                 
Sat Sep 10 10:48:21 2022 daemon.info hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: authenticated                                                                                                                                                           
Sat Sep 10 10:48:21 2022 daemon.info hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: associated (aid 1)                                                                                                                                                      
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-ASSOCIATE.indication(58:cb:52:38:a3:bb)
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DELETEKEYS.request(58:cb:52:38:a3:bb)
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: binding station to interface 'wlan1'
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: event 1 notification
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: start authentication
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.1X: unauthorizing port
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: sending 1/4 msg of 4-Way Handshake
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: received EAPOL-Key frame (2/4 Pairwise)
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: sending 3/4 msg of 4-Way Handshake
Sat Sep 10 10:48:21 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: received EAPOL-Key frame (4/4 Pairwise)
Sat Sep 10 10:48:21 2022 daemon.notice hostapd: wlan1: AP-STA-CONNECTED 58:cb:52:38:a3:bb
Sat Sep 10 10:48:22 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.1X: authorizing port
Sat Sep 10 10:48:22 2022 daemon.info hostapd: wlan1: STA 58:cb:52:38:a3:bb RADIUS: starting accounting session 0F21276C4A92770B
Sat Sep 10 10:48:22 2022 daemon.info hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: pairwise key handshake completed (RSN)
Sat Sep 10 10:48:22 2022 daemon.notice hostapd: wlan1: EAPOL-4WAY-HS-COMPLETED 58:cb:52:38:a3:bb

Sat Sep 10 10:48:39 2022 daemon.notice hostapd: wlan1: AP-STA-DISCONNECTED 58:cb:52:38:a3:bb
Sat Sep 10 10:48:39 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: event 3 notification
Sat Sep 10 10:48:40 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.1X: unauthorizing port
Sat Sep 10 10:48:40 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: deauthenticated
Sat Sep 10 10:48:40 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DEAUTHENTICATE.indication(58:cb:52:38:a3:bb, 3)
Sat Sep 10 10:48:40 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DELETEKEYS.request(58:cb:52:38:a3:bb)
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: authentication OK (open system)
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-AUTHENTICATE.indication(58:cb:52:38:a3:bb, OPEN_SYSTEM)
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DELETEKEYS.request(58:cb:52:38:a3:bb)
Sat Sep 10 10:48:42 2022 daemon.info hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: authenticated
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: association OK (aid 1)
Sat Sep 10 10:48:42 2022 daemon.info hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: associated (aid 1)
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-ASSOCIATE.indication(58:cb:52:38:a3:bb)
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DELETEKEYS.request(58:cb:52:38:a3:bb)
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: binding station to interface 'wlan1'
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: event 1 notification
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: start authentication
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.1X: unauthorizing port
Sat Sep 10 10:48:42 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: sending 1/4 msg of 4-Way Handshake
Sat Sep 10 10:48:43 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: received EAPOL-Key frame (2/4 Pairwise)
Sat Sep 10 10:48:43 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: sending 3/4 msg of 4-Way Handshake
Sat Sep 10 10:48:43 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: received EAPOL-Key frame (4/4 Pairwise)
Sat Sep 10 10:48:43 2022 daemon.notice hostapd: wlan1: AP-STA-CONNECTED 58:cb:52:38:a3:bb
Sat Sep 10 10:48:43 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.1X: authorizing port
Sat Sep 10 10:48:43 2022 daemon.info hostapd: wlan1: STA 58:cb:52:38:a3:bb RADIUS: starting accounting session E11037D83998A4F0
Sat Sep 10 10:48:43 2022 daemon.info hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: pairwise key handshake completed (RSN)
Sat Sep 10 10:48:43 2022 daemon.notice hostapd: wlan1: EAPOL-4WAY-HS-COMPLETED 58:cb:52:38:a3:bb

Sat Sep 10 10:49:01 2022 daemon.notice hostapd: wlan1: AP-STA-DISCONNECTED 58:cb:52:38:a3:bb
Sat Sep 10 10:49:01 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb WPA: event 3 notification
Sat Sep 10 10:49:01 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.1X: unauthorizing port
Sat Sep 10 10:49:01 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb IEEE 802.11: deauthenticated
Sat Sep 10 10:49:01 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DEAUTHENTICATE.indication(58:cb:52:38:a3:bb, 3)
Sat Sep 10 10:49:01 2022 daemon.debug hostapd: wlan1: STA 58:cb:52:38:a3:bb MLME: MLME-DELETEKEYS.request(58:cb:52:38:a3:bb)

Something that has changed from previous tests is that I see nothing from the STA on the AP's ethernet (or switch port). Dunno whether this has to do with disabling DHCP & firewall or what. Now I also need to tcpdump the wireless interface... It just takes a crazy amount of time to get a clean test run.

I had my dumb APs dhcp and firewall disabled from the start, checked the iptables and it seems ok for lan related rules. If you connect to AP#2 first, would there be connectivity?

Doing the same, just hoping to get over with this before my family gets back and questions my sanity :joy:.

Aaaaaaaand..... now it works! :boom: :upside_down_face: :thinking: :roll_eyes:

What changed? Either disabling firewall, dnsmask, and odhcp made the difference or I had some other config snafu that got corrected.

What works? I can now FT successfully between both APs (FT itself always worked I just lost connectivity). I did it several times and also started at both APs by rebooting the phone and letting it connect to a specific one first.

Confusing is that "the other AP" still has firewall and dnsmasq, so either "it's more complicated" or some other config thing was the culprit...

Before I can celebrate, I need to get the VLAN config back... we shall see...

Update: VLAN config is back in and works! Now I have 3 APs between which I can seamlessly Fast Transfer.
I have a 4th down in the crawlspace that I haven't tested.
And I have an outdoor one that is ~100m/300ft away from the house that doesn't play ball (seems to have the same symptoms I had earlier with "AP#2"). It's a TP-link EAP225 (I have no two same APs :roll_eyes:). So now I get to see which change makes it work... Pain is that it takes a walk to test...
And... I have another AP yet further away that is yet a different model and that I haven't tested yet.
Yeah, I'm a whole IT department...

2 Likes

Do you have option reassociation_deadline '20000' and option max_inactivity 15' set in config?

Looks like my problem is about the netgear wac104 switch. I extended my test like so (my dumb APs have 2.4GHz and 5GHz ssids):

  1. FT from first dumb AP to second on 5GHz ssid, which worked and connectivity is working without any significant drops (like in my first post).
  2. FT from back from second dumb AP to first on 5GHz ssid, FT appears working in the logs, but like before dhcp-replies are being dropped somewhere and eventually the device disconnects.
  3. On the device, select another 5GHz ssid (wlan1-1) on first AP, doesn't connect and it is seen that dhcp-replies are still being dropped, eventually device disconnects.
  4. On the device select a a 2.4GHz ssid (wlan0) on first AP, doesn't connect and it is seen that dhcp-replies are still being dropped, eventually device disconnects.
  5. Wait for quite some time (1 to 10 minutes) and then try again step 3 or 4, finally dhcp-replies are seen on the first AP.

Gonna try to some other devices tomorrow and update.

I'll post my configs in a bit. But yes and 20 instead of 15.

I would try 22.03, lots of changes at that level IIUC. Are you using any VLANs?

Edit: also, you have disabled the firewall service and removed /etc/config/firewall, correct?

I'm not exactly ready for fw4, even though this is a dumb AP I just want to try to understand it before I run 22.03 on something. And not using VLANs yet.

Firewall service is disabled, but the point is that we ought to be working on the LAN zone only, and by default LAN to LAN is surely allowed. Removing the /etc/config/firewall sounds a bit extreme (like I'm guessing it similar to flushing the iptables), it wasn't even mentioned in here: https://openwrt.org/docs/guide-user/network/wifi/dumbap. I'll check out what exactly it does so I don't brick my AP.

To be honest, I have 7 dumb APs and only one has the firewall removed, etc. I just did several walk arounds and I have FT all the way! So my suspicion at this point is that the issue I had wasn't the firewall or dnsmasq but some integrated switch + VLAN fubar. The switch + VLAN stuff clearly still has bugs or "features", some possibly due to a combination of firmware issues/limitations and incomplete/poor documentation.

What is your /etc/config/network on the AP that has issues?

I have the same problem. I have three routers. I don't use vlan.