Apologies if this is the wrong place to post this.
So this is an odd one. I've been running OpenWrt on a Ubiquti UniFi AC LR router for a few years now, and it has been brilliant. I recently upgraded from an old master build to the latest 21.02 branch and my HP printer was no longer able to establish a connection. It's able to find the network and attach to it -- I can see it sending out DHCP requests and I can see the responses going back -- but the printer clearly doesn't receive them and sends out another DHCP request after a suitable pause. Manually setting an IP address doesn't help.
So I git bisected to find the commit where things stopped working and came up with this one:
commit 37752336bdfb361d597b316cd5bb9d8dc6ac1762
Author: Felix Fietkau <nbd@nbd.name>
Date: Sat Jan 23 00:17:31 2021 +0100
mac80211: add significant minstrel_ht performance improvements
Completely redesign the rate sampling approach
Signed-off-by: Felix Fietkau <nbd@nbd.name>
That contains six kernel patches. The first three appear to be fine. The third causes things to break:
Good suggestion, but the printer was already running the latest firmware.
Another data point: this issue (or something that produces the same symptoms) has always been present on the 5 GHz network, but the 2 GHz network was fine. I had previously blamed the 5 GHz network issue on HP not properly supporting that network band, but now I'm wondering if the issue is actually something to do with OpenWrt. I'll try to track down a non OpenWrt access point to test that theory.
Since the change that broke things for me reimplemented way the best rate is found, I thought it might be useful to look at what's in the rc_rates file before and after the change. When things are working, the output looks like this:
I've worked out that the A, B, C and D represent the four best rates, and that 'P' is maximum probability rate. I'm less confident about 'S'. I know that flag appears if minstrel_ht_is_sample_rate finds it in the cur_sample_rates table, so I'm guessing that it indicates that it's a rate being considered by the algorithm. I've noticed the 'S' flags bounce around over time, but the ABCDP flags remain stuck on the first line. I assume that's a bad sign, as is the absence of any successes on any row.
I finally noticed the knobs under /sys/kernel/debug/ieee80211/phy1/rc, including fixed_rate_idx. Forcing that to a known good index was sufficient to allow things to get started. However, even clearing fixed_rate_idx to 4294967295 afterwards didn't cause the algorithm to probe any other rates. Most puzzling.
I've updated to the latest openwrt-21.02 build and used the fixed_rate_idx trick to get things going. I did see the rate shifting around there, though the reported lookaround remained at zero until about 800 packets had been exchanged. After that, I saw some probe packets sent out and the algorithm started sending packets over both long and short GI entries. In other words, it seems to be working as expected.
My hunch is that the first listed rate (MCS0, LGI) not being able to successfully transmit is somehow causing the algorithm to get stuck in a bad state. All of my other 2.4 GHz devices are able to communicate on their first listed rate, and none of them have this problem.
I guess I'll hack some debug info into the file to see if I refute this theory.
I've found the problem. In the middle of minstrel_ht_get_rate is this line:
if (time_is_before_jiffies(mi->sample_time))
return;
I assume that this is supposed to cause the function to return if it's called too early. If so, the logic is backwards; it'll return if time (i.e. mi->sample_time) is before jiffies (the current time). In other words, it'll return early only if the sample_time has passed.
Replacing that with time_is_after_jiffies makes everything much better -- I each station select a rate other than the first in the list right away.
Sorry for the delayed response. What you found makes perfect sense, and I will push a fix to OpenWrt (and send it upstream) right away.
Thanks for tracking this one down, I have no idea why it works so well in all of my tests without this fix.