What I find extremely curious is that, while my case looks and feels to be the same, my AP and the 22.03 device are separate devices on the same network, and the wifi connection was established through a 21.02 device. My 22.03-rc4 device that showed the issue does not even have wifi.
This means that it's, somehow, not related to wifi on the device itself, but to a SSH connection that at some point went through an (OpenWrt) wifi.
Honestly, I don't know what to make of it, and how this is possible. But here's hoping this will help to narrow down the problem.
Is it worth going back to 21.02.3 to test if it fails there?
More about the test case. All the Wi-Fi tests have been done with my laptop on my dining room table (much to the chagrin of my wife that's about 25 feet from the router. (So it's not likely to be a "weak Wi-Fi signal causing disconnects.)
Maybe. However, I have several devices running 21.02.3, most notably the AP (MT7621AT) and my main router (X86-64), none of which exhibit the problem. I also think if it were a problem with 21.02, we would have heard more reports by now.
I will set up some other device on a different target with a 22.03-rc and put it on the network to see if the problem is reproducible. Let's see what I have lying around
Edit: an R6220, that will do.
Edit 2: 30 minutes in and SSH is still up and responsive. So that's a "fail", the (or at least: my) problem doesn't seem to be universal to the 22.03-rcs. Next up I will test again with the MBL that gave me the issue first and try to reproduce it.
Edit 3: "success" I guess? 22.03.0-rc6 on the MBL, as before, fresh install from official download, disabled firewall (no dnsmasq or odhcpd present), no other software installed. SSH through wired clients on the network works fine. SSH through wifi (again, external AP!) prompts an established connection syslog entry after a loooong wait, never makes it to the login prompt before timing out.
Edit 4: The fact that LuCI works fine even when SSH fails makes me think it is some kind of dropbear issue.
Edit 5: Well, that is getting weirder and weirder. After some 10 minutes, somehow it ... "recovered"? SSH is now again possible as if there was never any issue. After googling the issue with dropbear, some (older) posts suggest it is a 5GHz issue, so I tried 2.4GHz in between, with no different outcome, SSH was still timing out. But then all of a sudden ... it recovered.
Edit 6: ... aaaand it stopped responding to SSH again, some 5 to 10 minutes later.
This is somehow the behaviour we're seeing. SSH tunnel not opening and timing out, wait 10 min. and it works again -just because-, SSH tunnel becomes non-responsive and connection drops, no way to reconnect because of more time outs. Wait for a few minutes, or reboot the computer, or reboot the access point, and lo and behold we are back in business.
Hey, on a positive none, seems to me that there is some consistency.
Someone CMIIW, but if it was an entropy issue, it would hit all clients, wired or wireless. Which it does not, it only affects clients coming through wifi, and most confusingly, even if that wifi connection is external to the device.
So what makes packets that at some point passed through a wifi connection so particular that they trigger this bug?
JFTR: I had an extended chat with @jow on #openwrt-devel about the issue. The current working theory is that the packets get mangled somehow on their way through the MT76 wifi, possibly as a result of some flow optimization, and dropbear on 22.03 seems to be sensitive to that.
It may be related to this issue which has potentially been fixed in this commit. That may explain my issues (my connection goes through an 21.02 MT76 OpenWrt AP before it passes on to the OpenWrt 22.03-rc). However, that fix has been backported to rc5 and as such it should have fixed the issue if the MT76 wifi is on the same 22.03-rc6 device that runs dropbear.
Another theory is that it might be related to packet sizes.
The next step is to tcpdump capture the stalling/failing SSH connection attempt both from the client and the server.
(And of course now that I'm watching intently, for some inexplicable reason my SSH connection works just fine, possibly because while testing I restarted the LAN on the AP. I will make another attempt to reproduce the issue tomorrow after the AP had a bit of a workout.)
Do you think my issue with seeing very slow ssh that results in timeouts connecting only over 2.4 is related? I recently set up three RT3200's for a neighbour and saw exactly the same thing happen there.
As @richb-hanover-priv linked above I experienced the same problem with unstable connection to ssh (and some proprietary closed source management software). But after restaring the lan interface it seems stable 12h after the interface restart.
And while that's "good" as a band-aid, it is really bad for actually finding the bug. The issue is clearly caused by some sort of interaction between a MT76 AP and dropbear (even if they are not on the same device), and after restarting my AP's LAN I haven't been able to reproduce it anymore. And that is immensely frustrating if your plan is to capture a failed attempt.
@richb-hanover-priv ... can you still reproduce it? If so, can you tcpdump-capture the isolated interaction between the SSH client and the device that stops answering? Because right now I ... just can't.
Not a tcpdump specialist at all, used it for the first time yesterday actually, but this should work. Assuming 192.168.1.1 for the RT3200 (server) and 192.168.1.100 for the wireless client (laptop):
on the server: tcpdump -nn -s 0 -w /tmp/capture.pcap host 192.168.1.100 and tcp port 22
and on the laptop: tcpdump -nn -s 0 -w /tmp/capture.pcap host 192.168.1.1 and tcp port 22
Adjust your IPs of course. A capture of a few seconds during which the connection attempt is made and fails should be enough.Ctrl+C to end capture.
(If you leave out the -s 0 -w /tmp/capture.pcap part it will not output to the file, but abridged capture data to the console so you can check if it works, and also if it only captures the connection attempt and not some other traffic.)
I decided to try a couple more devices running htop, so I fired up my old MacBook Pro ("oMBP") and an old Win10 laptop. My primary machine is a newer MBP. All three were connected via Wi-Fi and running htop successfully.
After ~1h 15 minutes, I got tired of waiting, so I stopped htop and exited the SSH sessions on oMPB and the Win10 machine .
Within 5 minutes, htop froze on the third computer (MBP). I could not re-establish a new SSH connection. (oMBP and Win10 were still disconnected from SSH at that time.)
While the new MBP was in that bad state, I was able to ssh in to the router and run htop on oMBP and Win10. After checking that htop worked, I stopped it and exited the SSH session on those machines.
I turned off Wi-Fi on the new MBP, then turned it back on, and was immediately able to reconnect to the router.
My summary of the evidence:
Something is interfering with SSH & Wi-Fi. Running htop over Wi-Fi freezes within 5-20 minutes, and that computer cannot re-connect to SSH.
If multiple computers were connected via Wi-Fi and running htop, no freeze was observed (I waited 1h 15m, when a freeze normally occurs within 20 minutes)
(I didn't try it in this round of experiments, but...) SSH over Ethernet seems always to work
When one computer is in the "frozen-Wi-Fi" state, another computer can SSH in via Wi-Fi
When I turned Wi-Fi off and back on for the affected computer, it could immediately SSH back in.
In the meantime I managed to get meaningful tcpdumps, I was able to Ralph-Wiggum-style help the devs to at least get some general idea where the issue is.
As far as I can wrap my head around the subject matter, the culprit is the MT76 wifi driver getting confused about DSCP markings, specifically the "af21" marking dropbear started setting in the version 2022.83 used on 22.03-rc*. That might even affect other software or devices setting that DSCP marking.
This now leads to the question about what to actually fix. As far as I understand it, ideally the MT76 driver should be fixed, but of course that wouldn't help with the existing install base (MT76 is affected at least back to 21.02, possibly even earlier). The nftables rule above is a band-aid, it clobbers the DSCP markings on all outgoing packets, but that doesn't help with devices not running nftables in the first place (like my MBL, I had to aftermarket-install nft to confirm the rules working). And another possible hotfix would be patching dropbear for 22.03 to not set the DSCP markings, or set ones that don't trip up the MT76 driver.
This is now in the hands of some really capable devs making some very intricate decisions, I'm afraid we can't really do much anymore.
(Apologies if I'm misstating some of the details. I'm really way out of my depth here.)
Do you observe the ssh session "freeze" only on the apple devices? Same band (i.e. both ssh client and server on 5 GHz)?
I've observed ssh sessions freezing when connecting via wifi to a 2019 mac book air (connected to a recent build of openwrt master branch for a r7500v2 configured as an AP only) from ubuntu clients also connected via wifi on the same r7500v2. I can't remember if this was on the same band - I'll try to reproduce and report back next week.
I had hoped the recent changes to AQL might have fixed that, but I have not tried to reproduce it recently (I've been traveling and busy with other interests for several months). This typically happens when I'm doing netperf/flent testing with the mac book (i.e. it's wifi network connection is busy).
I do not observe similar ssh freezing from "busy" ubuntu wifi clients. I typically don't do much with Win10 to make it's wifi "busy" when I'm on ssh so I can't comment on that.