5 GHz Interface Disabled Randomly / Kernel Crash

Background: I'm using a TP-Link TL-WR902AC (AC750) v1 running 19.07.7. I'm using the 5 GHz radio to connect to xfinitywifi as WAN, and both the 5 GHz and 2.4 GHz radio for LAN. I use stubby for DoT name resolution.

I've been having an issue for the past few months, where every week or two, my 5 GHz radio stops working (but 2.4 GHz radio continues working), and I haven't been able to determine what triggers it, or how to consistently fix it.

When this happens, the Wireless section in luci appears as follows:

I have tried running "/etc/init.d/network restart", restarting the 5 GHz radio in luci, rebooting the router through luci, unplugging it for a minute and plugging it back in, removing and recreating the 5 GHz networks. It's always some combination of those things plus 10 minutes to 1 hour, and it starts working again.

When this happens, I can't catch in the system logs what happened initially, since it dumps kernel warnings, pushing information prior to the event out of the backlog. But here's what I could capture:

System Log: https://pastebin.com/tzHgvBPa

Kernel Log: https://pastebin.com/zshzpwp9

I'd be happy to provide any other information that might be needed to help diagnose this issue. Thanks in advance for any assistance!

@Ryan-Goldstein, welcome to the community!

I'd surmise your xfinitywiwi 5.4 channel is changing. Do you control that access point to set a 5.4 GHz channel that's not DFS (I'm guessing no)?

Thanks for the response. No, I don't control the xfinitywifi access point (it's a public network to which anyone with a Comcast account can connect and access the internet after entering their username/password, which gets associated with the client's MAC address), but I don't think the channel changing is the issue.

When this issue occurs, the TP-LINK_6303_5G network itself stops broadcasting; all devices connected to it disconnect, and it no longer appears in the list of available networks on any clients. To troubleshoot further when that happens, I need to connect to the TP-LINK_6303 network, broadcast using the 2.4 GHz radio, which is the only one that remains functional.

From my rudimentary understanding of the system logs, it appears that something at a low level is crashing, hence the many kernel lines, including stack dumps, call traces, and messages like:

kern.err kernel: [138796.502131] wlan0-1: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-5)

kern.err kernel: [138796.845403] wlan0-1: failed to remove key (2, ff:ff:ff:ff:ff:ff) from hardware (-5)

kern.info kernel: [138797.176869] br-lan: port 3(wlan0-1) entered disabled state

kern.warn kernel: [138797.517100] ath10k_pci 0000:00:00.0: could not suspend target (-70)

daemon.err hostapd: Failed to set beacon parameters

daemon.notice hostapd: wlan0-1: INTERFACE-DISABLED

Correct, this is what happens when the upstream AP changes channels, the downsream has to disconnect as well.

:thinking: This is different though...

This is a Kernel regression bug from my reading...but...

:smiley: It occurs at normal disconnects.

The call traces are odd though...

I wonder if others have some ideas...

I really appreciate your time helping with this!

Maybe showing the Interfaces will help clarify:

LAN is a bridged connection between the router's Ethernet port, the 2.4 GHz network (TP-LINK_6303), and the 5 GHz network (TP-LINK_6303_5G), whereas WWAN is the 5 GHz connection to xfinitywifi.

There have certainly been instances when the xfinitywifi network went down for a few hours, but my local network remained functional, including both radios; only the upstream connection to the Internet stopped working.

This issue that seems to happen every 1-2 weeks is different, in that the 5 GHz radio seems to stop working entirely; the TP-LINK_6303_5G network is no longer visible, and the connection to xfinitywifi is not working either. I think the 5 GHz radio stops broadcasting entirely when this happens, and only some combination of "/etc/init.d/network restart", rebooting the router, removing power from the router, and waiting from 10 minutes to an hour, results in it broadcasting again, and everything reconnects as expected thereafter.

Again, I appreciate your help, and anyone else's input or additional troubleshooting steps would be appreciated as well! Since I haven't been able to nail down what triggers this problem, or precisely what fixes it, it'll be difficult to conclusively troubleshoot, but again, any ideas are welcome.

The same thing happened again overnight last night. I was able to fix it this time by connecting to the 2.4 GHz network (since the 5 GHz network disappeared) and rebooting the router through luci. Again, I wasn't able to catch it in the system or kernel logs when this happened, because more recent messages cleared out the backlog. But here are the logs I could grab before rebooting the router:

System Log: https://pastebin.com/TqDWJdWy

Kernel Log: https://pastebin.com/pay0HkKn

The notable items in the system log are repeated "daemon.err hostapd: Failed to set beacon parameters" and "ath10k_pci 0000:00:00.0: failed to send pdev bss chan info request: -143" messages, and a large number of (expected) stubby errors due to not being connected to the xfinitywifi upstream: "daemon.err stubby[27908]: Could not schedule query: None of the configured upstreams could be used to send queries on the specified transports"

The kernel log had nothing but "ath10k_pci 0000:00:00.0: failed to send pdev bss chan info request: -143" messages every few seconds.

This certainly seems like an OpenWRT bug to me. If anyone has any ideas what else to try or look for, I can do so the next time my router gets into this state.

Looking into this further, it appears that other people have experienced this over the years, and it may be an issue with ath10k driver, which is purportedly resolved by using ath10k-ct. Sources: FS#333 - ath10k_pci: crash after ~10d uptime and upstream kernel bug Kernel crashes with ath10k radio in AP mode and Nexus 5X as client

I'm currently using the ar71xx/generic platform (19.07.7). I see in the firmware selector that there's a 21.02 snapshot using the ath79/generic platform, purportedly for my router, which was built about 12 hours ago. It's not clear to me that that new platform is using ath10k-ct though or would otherwise fix this bug, and I don't want to risk bricking my router or causing some other issues. The wiki page for it shows the current release as 19.07.7.

Is it worth trying to update to the 21.02 snapshot?