[21.02] EAP235-Wall: hostapd 'Out of memory' after a few days uptime

I have a cronjob that restarts WiFi at night (since once you get kicked from a DFS channel, the radio won't try to switch back). The radio is a MediaTek MT7613BE 5 GHz Wave 2 802.11ac radio. AP is a TP-Link EAP235-Wall v1. It's running 21.02 HEAD.

After a few days uptime, I'm seeing this (radio disappears):

Sun Mar 21 06:00:01 2021 daemon.notice hostapd: Configuration file: /var/run/hostapd-phy1.conf (phy wlan1) --> new PHY
Sun Mar 21 06:00:01 2021 daemon.err hostapd: Could not set interface wlan1 flags (UP): Out of memory
Sun Mar 21 06:00:01 2021 daemon.err hostapd: nl80211: Could not set interface 'wlan1' UP
Sun Mar 21 06:00:01 2021 daemon.notice hostapd: nl80211: deinit ifname=wlan1 disabled_11b_rates=0
Sun Mar 21 06:00:01 2021 daemon.err hostapd: nl80211 driver initialization failed.
Sun Mar 21 06:00:01 2021 daemon.notice hostapd: wlan1: CTRL-EVENT-TERMINATING
Sun Mar 21 06:00:01 2021 daemon.err hostapd: hostapd_free_hapd_data: Interface wlan1 wasn't started
Sun Mar 21 06:00:01 2021 daemon.notice netifd: radio1 (10421): Command failed: Invalid argument
Sun Mar 21 06:00:01 2021 daemon.notice netifd: radio1 (10421): Device setup failed: HOSTAPD_START_FAILED

Killing hostapd doesn't help; wifi up doesn't either (that's what the cronjob does after all). Memory looks just fine (no OOM killer in the logs, and plenty of RAM still available from what I can tell).

# free -m
              total        used        free      shared  buff/cache   available
Mem:         122512       27368       73848         524       21296       59660
Swap:             0           0           0

The only solution is a reboot of the device (EAP235-Wall v1). This might be related with the radio dying, this pops up once in a while as well:

Fri Mar 19 11:49:49 2021 kern.err kernel: [79757.928007] mt7615e 0000:02:00.0: Message 73 (seq 12) timeout
Fri Mar 19 11:50:16 2021 kern.err kernel: [79784.551749] mt7615e 0000:02:00.0: Message 73 (seq 13) timeout
Fri Mar 19 11:50:42 2021 kern.err kernel: [79811.175491] mt7615e 0000:02:00.0: Message 73 (seq 14) timeout
Fri Mar 19 11:51:09 2021 kern.err kernel: [79837.799231] mt7615e 0000:02:00.0: Message 73 (seq 15) timeout
Fri Mar 19 11:51:36 2021 kern.err kernel: [79864.422975] mt7615e 0000:02:00.0: Message 73 (seq 1) timeout
Fri Mar 19 11:52:02 2021 kern.err kernel: [79891.046714] mt7615e 0000:02:00.0: Message 73 (seq 2) timeout
Fri Mar 19 11:52:29 2021 kern.err kernel: [79917.670470] mt7615e 0000:02:00.0: Message 73 (seq 3) timeout
Fri Mar 19 11:52:55 2021 kern.err kernel: [79944.294194] mt7615e 0000:02:00.0: Message 73 (seq 4) timeout
Fri Mar 19 11:53:22 2021 kern.err kernel: [79970.917940] mt7615e 0000:02:00.0: Message 73 (seq 5) timeout
Fri Mar 19 11:53:49 2021 kern.err kernel: [79997.541683] mt7615e 0000:02:00.0: Message 73 (seq 6) timeout
Fri Mar 19 11:54:15 2021 kern.err kernel: [80024.165426] mt7615e 0000:02:00.0: Message 73 (seq 7) timeout
Fri Mar 19 11:54:42 2021 kern.err kernel: [80050.789172] mt7615e 0000:02:00.0: Message 73 (seq 8) timeout
Fri Mar 19 11:55:09 2021 kern.err kernel: [80077.412913] mt7615e 0000:02:00.0: Message 73 (seq 9) timeout
Fri Mar 19 11:55:35 2021 kern.err kernel: [80104.036656] mt7615e 0000:02:00.0: Message 73 (seq 10) timeout
Fri Mar 19 11:56:02 2021 kern.err kernel: [80130.660401] mt7615e 0000:02:00.0: Message 73 (seq 11) timeout

Try the latest mt76 to see if 'Message 73 (seq 11) timeout' has been solved.

They're already running on all three, thanks. I'm keeping an eye on them.

@ryderlee1110 I'm on the 2021-03-26 mt76 version, I haven't seen any of the timeout messages so far. Both MT7613BE radios have seen intermittent use.

I did notice, however, that I had to restart the wifi like four or five times in a few hours because my Windows laptop would suddenly complain that it was connected to the AP but had no internet connectivity. No pointers in the logs that I could see, Windows 10 wouldn't reconnect until I had restarted the OpenWrt wireless interface.

Reporting back here. Still on 2021-03-26, and no timeout messages so far, radios still up, access points still usable. Already pinged nbd about pushing it to 21.02.

@ryderlee1110 Unfortunately, I'm seeing this "Message 73 timeout" pop up again, took a week more or less to pop up now. That's with mt76 2021-03-26. Will try with the latest 21.02 (which for now seems on par with OpenWrt master).

Is there any channel on which it works reliably or is the issue unrelated to channels / channel width?

This is a transmit queue issue as documented here:

I've made updated driver packages for 21.02 available here, can you test them and report back? They include a hack that forcibly empties the buffers.