Authentication timed out - Raspberry Pi 4B

OK, I'm in the process of a bit of a headscratcher here. After not updating OpenWRT for a while, I rebuilt it about a week ago and flashed my test device. Since then, it is only able to connect to wifi for 10 seconds or so approximately every 2 minutes. From another machine, here's a snippet of ping results:

64 bytes from <redacted>: icmp_seq=1100 ttl=61 time=6.81 ms
64 bytes from <redacted>: icmp_seq=1101 ttl=61 time=2.46 ms
64 bytes from <redacted>: icmp_seq=1102 ttl=61 time=2.29 ms
64 bytes from <redacted>: icmp_seq=1103 ttl=61 time=2.30 ms
64 bytes from <redacted>: icmp_seq=1104 ttl=61 time=2.53 ms
64 bytes from <redacted>: icmp_seq=1105 ttl=61 time=2.38 ms
64 bytes from <redacted>: icmp_seq=1106 ttl=61 time=2.39 ms
64 bytes from <redacted>: icmp_seq=1107 ttl=61 time=2.87 ms
64 bytes from <redacted>: icmp_seq=1108 ttl=61 time=2.66 ms
64 bytes from <redacted>: icmp_seq=1109 ttl=61 time=2.59 ms
64 bytes from <redacted>: icmp_seq=1231 ttl=61 time=120 ms
64 bytes from <redacted>: icmp_seq=1232 ttl=61 time=2.34 ms
64 bytes from <redacted>: icmp_seq=1233 ttl=61 time=2.85 ms
64 bytes from <redacted>: icmp_seq=1234 ttl=61 time=2.76 ms
64 bytes from <redacted>: icmp_seq=1235 ttl=61 time=2.40 ms
64 bytes from <redacted>: icmp_seq=1236 ttl=61 time=2.39 ms
64 bytes from <redacted>: icmp_seq=1237 ttl=61 time=2.12 ms
64 bytes from <redacted>: icmp_seq=1238 ttl=61 time=2.30 ms
64 bytes from <redacted>: icmp_seq=1239 ttl=61 time=2.43 ms

In system.log, I see repeatedly:

Tue May 16 13:33:01 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: CTRL-EVENT-SSID-REENABLED id=0 ssid="<REDACTED>"
Tue May 16 13:33:01 2023 daemon.notice wpa_supplicant[1409]: BSSID <REDACTED> ignore list count incremented to 3, ignoring for 60 seconds
Tue May 16 13:33:01 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: Trying to associate with SSID 'Automation'
Tue May 16 13:33:04 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: Associated with <REDACTED>
Tue May 16 13:33:04 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0
Tue May 16 13:33:14 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: Authentication with <REDACTED> timed out.
Tue May 16 13:33:14 2023 daemon.notice wpa_supplicant[1409]: BSSID bc:26:c7:fe:4f:40 ignore list count incremented to 4, ignoring for 120 seconds
Tue May 16 13:33:14 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: CTRL-EVENT-DISCONNECTED bssid=bc:26:c7:fe:4f:40 reason=3 locally_generated=1
Tue May 16 13:33:14 2023 daemon.notice wpa_supplicant[1409]: BSSID <REDACTED> ignore list count incremented to 5, ignoring for 600 seconds
Tue May 16 13:33:14 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: CTRL-EVENT-SSID-TEMP-DISABLED id=0 ssid="<REDACTED>" auth_failures=12 duration=120 reason=CONN_FAILED
Tue May 16 13:33:14 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: CTRL-EVENT-DSCP-POLICY clear_all

So here's what is puzzling me: If authentication is timing out, how in the heck is this thing passing traffic?

The interval between "Associated" and "Authentication timed out" is 10 seconds - which is almost exactly the duration for which the interface actually passed traffic.

Other OpenWRT devices running an older Pi4 or Pi3 build connect to the same network with no issues whatsoever - this appears to be a regression that crept in sometime within the past month or so. (Sadly, I haven't been updating and rebuilding quite as often as I used to. I also made the rather stupid mistake of not backing up the last "known working" factory and sysupgrade images, although worst case I can probably mangle some together by dding from a working device...)

Has anyone seen a scenario where wpa_supplicant disconnects with an "Authentication timed out" error, but the interface passed traffic prior to the timeout, indicating that it was properly authenticated (otherwise it would not have been able to pass traffic)?

The 2 minute duration is consistent with:
Tue May 16 13:33:14 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: CTRL-EVENT-SSID-TEMP-DISABLED id=0 ssid="<REDACTED>" auth_failures=12 duration=120 reason=CONN_FAILED

What version were you running previously? What is the upgraded version?
Did you preserve settings or did you start fresh from a default configuration and configure from scratch?

Previously - git master as of approximately 1-2 months ago. I might be able to get an approximation from the version string of a "working" device, but keep in mind that there are about 10-12 patches on top of the OpenWRT base so the version string won't be exactly correct.

Currently my base is git revision https://github.com/openwrt/openwrt/tree/e882af2850ff7077f610eaf0ecee18c371ddf98f - again, 10ish patches on top of it.

Settings were preserved, but the first thing I did when seeing issues was delete and recreate the configuration for connecting to the network. I've triple checked that the WPA2 passphrase is correct - and also tested and verified that when the WPA2 passphrase is incorrect I get different results:

Tue May 16 13:54:31 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: Trying to associate with SSID 'Automation'
Tue May 16 13:54:36 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: CTRL-EVENT-ASSOC-REJECT bssid=bc:26:c7:fe:4f:40 status_code=16
Tue May 16 13:54:40 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: Trying to associate with SSID 'Automation'
Tue May 16 13:54:47 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: CTRL-EVENT-ASSOC-REJECT bssid=bc:26:c7:fe:4f:40 status_code=16
Tue May 16 13:54:47 2023 daemon.notice wpa_supplicant[1409]: BSSID bc:26:c7:fe:4f:40 ignore list count incremented to 2, ignoring for 10 seconds
Tue May 16 13:54:52 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: Trying to associate with SSID 'Automation'
Tue May 16 13:54:59 2023 daemon.notice wpa_supplicant[1409]: phy0-sta0: CTRL-EVENT-ASSOC-REJECT bssid=bc:26:c7:fe:4f:40 status_code=16
Tue May 16 13:54:59 2023 daemon.notice wpa_supplicant[1409]: BSSID bc:26:c7:fe:4f:40 ignore list count incremented to 3, ignoring for 60 seconds

Do you specifically need/want to run Master, as compared to running a standard release build of 22.03?

Per my post, I'm maintaining around 10-12 local patches (including https://github.com/openwrt/openwrt/pull/9843 since no one has bothered to review that for an entire year), and I don't want those patches to age out. It takes a HELL of a lot longer to figure out what changes need to be made when rebasing a patchset after not touching it for a year or two. I can definitely NOT run anything from the buildbox - it won't work in the use case for these units. I've been rebasing these patchsets every month or more for the past 2-3 years with minimal issues, and giving up and sticking my head in the sand instead of fixing the problem is not an option.

I don't want to be stuck on an old version forever if this issue persists to the next release, so I want to figure out where to look to resolve it sooner rather than later.

Working on getting some wpa_supplicant DEBUG logs, gotta bump up the log size in /etc/config/system and reboot

Attempting to downgrade to https://github.com/openwrt/openwrt/commit/e11d00d44c66b1534fbc399fda55951cd0a2168a which is just before https://github.com/openwrt/openwrt/commit/e722b667c5a59bbd5aff787002f7f5457d0bf721 (likely candidate for wifi regressions), hopefully the new widespread dependency of feeds/packages on nonexistent libxml2 doesn't become a problem, I don't think I've selected any of the affected packages

Edit: e11d00d44c66b1534fbc399fda55951cd0a2168a works

Edit 2: e722b667c5a59bbd5aff787002f7f5457d0bf721 also works, so... not the usual suspect here

So the problem was introduced sometime between April 18 and May 9

Edit 3: Next suspect is https://github.com/openwrt/openwrt/commit/304423a4ff841129f1b6b7504c83dd11c1f71c84, testing https://github.com/openwrt/openwrt/commit/e1242fc3390aa8b50d7685ce49f5b35f15d5b424 now

Edit 4:
e1242fc3390aa8b50d7685ce49f5b35f15d5b424 is good

Edit 5:
Winner winner, chicken dinner - 304423a4ff841129f1b6b7504c83dd11c1f71c84 (hostapd/wpa_supplicant upgrade) is the culprit. Not sure WHAT within that changeset is the root cause, but given the logs I'm having indicating some sort of weird 802.1X EAPOL failure (despite this being a WPA2-PSK network that doesn't use 802.1X) and the patch name "170-wpa_supplicant-fix-compiling-without-IEEE8021X_EAPOL.patch" I suspect it's something near what that patch touched... Done for the evening, will be severely limited as far as further time I can spend on this over the next few days.