Zyxel NBG6817 - Unstable on 5.15 kernels

lss4 · November 9, 2022, 12:26pm

I'm not sure what's causing this, but with an OpenWrt snapshot I built last week the router would randomly stop working. Sometimes it would recover on its own but other times I have to manually do a power cycle.

Just now the router stopped working twice. When it stopped working the first time I reopened the UDP log output to a PC in order to capture it, and a few minutes later it stopped working the second time. This time I captured these messages which kind of resembles these mentioned here, though that mail was meant for ipq40xx and the issue appeared to have been resolved through an update.

ath10k_pci 0001:01:00.0: bss channel survey timed out
ath10k_pci 0001:01:00.0: Cannot communicate with firmware, previous wmi cmds: 36892:369240 36892:369240 36892:369240 36872:369240, jiffies: 369544, attempting restart restart firmware, dev-flags: 0 x142
ath10k_pci 0001:01:00.0: failed to recalculate rts/cts prot for vdev 1: -11
ath10k_pci 0001:01:00.0: failed to set erp slot for vdev 1: -108
ath10k_pci 0001:01:00.0: failed to set preamble for vdev 1: -108
ath10k_pci 0001:01:00.0: failed to set mgmt tx rate -108
ath10k_pci 0000:01:00.0: bss channel survey timed out
ath10k_pci 0000:01:00.0: Cannot communicate with firmware, previous wmi cmds: 36892:369856 36892:369856 36892:369856 36872:369856, jiffies: 370160, attempting restart restart firmware, dev-flags: 0 x142
ath10k_pci 0000:01:00.0: failed to recalculate rts/cts prot for vdev 0: -11
ath10k_pci 0000:01:00.0: failed to set preamble for vdev 0: -108
ath10k_pci 0000:01:00.0: failed to set mgmt tx rate -108
ath10k_pci 0001:01:00.0: failed to set beacon mode for vdev 0: -108
ath10k_pci 0001:01:00.0: failed to set dtim period for vdev 0: -108
ath10k_pci 0001:01:00.0: failed to set cts protection for vdev 0: -108
ath10k_pci 0001:01:00.0: failed to recalculate rts/cts prot for vdev 0: -108
ath10k_pci 0001:01:00.0: failed to set erp slot for vdev 0: -108
ath10k_pci 0001:01:00.0: failed to set preamble for vdev 0: -108
ath10k_pci 0001:01:00.0: failed to set mgmt tx rate -108
ath10k_pci 0001:01:00.0: failed to send pdev bss chan info request: -108
ath10k_pci 0001:01:00.0: failed to set beacon mode for vdev 1: -108
ath10k_pci 0001:01:00.0: failed to set dtim period for vdev 1: -108
ath10k_pci 0001:01:00.0: failed to set cts protection for vdev 1: -108
ath10k_pci 0001:01:00.0: failed to recalculate rts/cts prot for vdev 1: -108
ath10k_pci 0001:01:00.0: failed to set erp slot for vdev 1: -108
ath10k_pci 0001:01:00.0: failed to set preamble for vdev 1: -108
ath10k_pci 0001:01:00.0: failed to set mgmt tx rate -108
ath10k_pci 0000:01:00.0: failed to send pdev bss chan info request: -108
ath10k_pci 0000:01:00.0: failed to send pdev bss chan info request: -108
ath10k_pci 0001:01:00.0: failed to send pdev bss chan info request: -108
ath10k_pci 0001:01:00.0: failed to send pdev bss chan info request: -108
ath10k_pci 0001:01:00.0: failed to send pdev bss chan info request: -108
ath10k_pci 0001:01:00.0: failed to send pdev bss chan info request: -108
ath10k_pci 0000:01:00.0: failed to send pdev bss chan info request: -108
ath10k_pci 0000:01:00.0: failed to set beacon mode for vdev 0: -108
ath10k_pci 0000:01:00.0: failed to set dtim period for vdev 0: -108
ath10k_pci 0000:01:00.0: failed to set cts protection for vdev 0: -108
ath10k_pci 0000:01:00.0: failed to recalculate rts/cts prot for vdev 0: -108
ath10k_pci 0000:01:00.0: failed to set preamble for vdev 0: -108
ath10k_pci 0000:01:00.0: failed to set mgmt tx rate -108
ath10k_pci 0001:01:00.0: failed to send pdev bss chan info request: -108

Note that the bss channel survey timed out is the first line of all the kernel error messages I captured via UDP when the router stopped working. When the router stopped working, the browser would immediately error when I try to access LuCI and ping gives me destination route unreachable errors, plus that the UDP log output is flooded with these messages that the captured log file grow faster than usual.

Should note that I did not recall seeing this issue happening with a snapshot built last month (Oct 5 to be precise) so I'm not sure if there are some kind of regressions going on. I searched about this in the forum but couldn't find anything too relevant. I can't be really sure about the older snapshot, though, as I only turned on UDP logging just now when the issue has become apparent. Even though I already let the router output logs to a file, it was almost impossible to capture the error messages in case the router crashes.

I'm going to build another image and see if the particular issue persists. It's been about an hour since the last crash and the router is still working okay just as I write this post...

lss4 · November 11, 2022, 12:36am

Bump on this. The issue still persists on current snapshot (files dated Nov 8), kernel 5.15.77. The router is still rebooting on its own at some point.

I couldn't capture any useful logs this time, as it looks like capturing the very moment the router crashed/restarted is harder than I thought. Quite often the router fails to actually transmit the log contents over UDP to the PC I use to capture.

The PC I use to capture the logs uses a static IP, not DHCP, so normally this shouldn't fail, but I still get messages that the router failed to actually transmit logs over UDP when I see the router's System Log.

I'm using nc on the target PC to capture UDP traffic and dump output to log files for analysis, and even though I never closed the process/port the router may still fail to transmit log output so nothing was really written to the output file.

Will see if I could eventually capture anything useful to determine if the issue is still the same, or has changed to something different.

slh · November 11, 2022, 12:43am

That is unlikely to be an ath10k issue, nor nbg6817 specific at all - there is an open issue with (likely) core scaling on ipq806x (common to all devices), see e.g. https://github.com/openwrt/openwrt/pull/11173 (but there is no real solution, yet).

lss4 · November 11, 2022, 1:04am

Not sure about ath10k but this may not be ipq806x specific. The mail-archive link I referred to in the OP was a few months ago and was about ipq40xx, yet some of the error messages were quite similar, but it kind of went away on its own when the reporter updated the router.

As for the PR you're referring to, I don't recall seeing logs that have a lot in common to what they were reporting there... so not sure.

I don't recall seeing this issue with an image I built last month (Oct 5) so not sure what's really going on here. The router was running fine and I did not recall seeing any reboots according to the recorded uptime. Now I couldn't even manage to get the router to last for even just 12h of uptime.

EDIT: Okay so there's an open checklist about 5.15 kernel in general, across all devices...

slh · November 11, 2022, 1:07am

Well, it's a bit unclear what you mean by "randomly stop working" then. If you refer to 'just' the wireless going down, you may be right - I'm referring to your "The router is still rebooting on its own at some point" issue, and that is a known issue with ipq806x.

The later can be quite easily confirmed, remove the ath10k packages, and the router is still likely to continue rebooting quite frequently.

lss4 · November 11, 2022, 1:18am

To be precise, when ath10k went down that time I managed to capture the log outputs, the router became completely inaccessible. All accesses to it (LuCI, or even pings) would fail immediately with error, rather than timeout.

This is the case every time when the router "randomly stopped working". The router may reboot on its own, but there are times it would fail to reboot on time requiring me to manually power cycle it in order to bring it back.

With current kernels I could hardly keep it up for even just 12 consecutive hours. I did not have reboots with the build I made last month, according to the router's uptime back then.

I still have a backup of the ImageBuilder I used at that time, but I'm not sure if I could still produce images based on these old, known good, kernel versions right now. Will give that a try, though I'm afraid it might be unlikely to succeed...

lss4 · November 11, 2022, 10:45am

An update.

Maybe it's indeed a bit related to the changes/PRs that's currently going on. It seems the time when the crash/reboot happens is not completely random and so far most if not all such occurrences I've observed happened at particular time ranges.

Maybe it does have something to do with CPU load. I'll see if I could get any ramoops, but most likely I'd be doing a downgrade (if possible) to see if that would make any difference. I'm afraid I won't be able to extract ramoops in the event that I have to manually do a power cycle.

PS: Just checked my old build. It was on 5.10.146 kernel. Guess it's indeed a problem with 5.15... Not sure what to do now, as I don't think it's trivial to downgrade to stable. I actually tried downgrading to stable before, and that ended badly. Settings were reset and what's worse, anything I changed would not survive a reboot. I cannot just build another image with the old ImageBuilder anymore, as it's now pulling incompatible kmods.

EDIT: Nope... I don't have ramoops. Maybe it needs to be enabled manually somehow. I'm considering building 22.03.2... just hope I won't find myself in the same situation like the last time I downgraded to stable (it was on another device actually, though I do fear it was a common issue)...

lss4 · November 11, 2022, 11:40am

Okay... this time I managed to get the router downgraded to 22.03.2 stable successfully. Looks like all the services are running fine as well.

Glad that nothing happened during the downgrade, though that was a long time ago when I had that issue during downgrade and I haven't really tried downgrading to stable since then.

Will see if the router still crashes/reboots on stable build, but given I never had such issues on snapshot with 5.10.146 kernel I think it should be okay.

system · November 21, 2022, 11:40am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.