[SOLVED] SSH over wifi stops working on RT3200/E8450 with 22.03.0-rc6

No - nothing special installed. I installed RC6 and then added the packages below.

bash
htop
iputils-ping
luci-app-sqm
luci-app-wireguard
nano

NB: CAKE-autorate service is disabled and not started.

For more info, look at https://pastebin.com/CqjLZ8C7 - the output of OpenWrtScripts getstats.sh: https://github.com/richb-hanover/OpenWrtScripts/blob/main/getstats.sh

It's inaccessible. I walk away from htop and a while later it's frozen. Attempts to begin a new SSH session time out after 30-60 seconds.

I would like to share that our SecureLink (ssh tunnels) issues have disappeared with this commit:

Why? I don't properly understand it yet, but give it a go.

1 Like

Are you experiencing the issue via WiFi? Because my tests (in which I experienced no instability) were all done in a wired setup.

Yes - I have only tested for these lockups on Wi-Fi. I now have these tests in my queue:

  1. Test on a wired connection
    • tested with Ethernet - ran successfully for > 1 hour
    • tested with wi-fi again, failed after 9 minutes
  2. Re-flash with rc6, but no additional packages only htop and nano
    • reconnected over wi-fi, froze after 10 minutes
    • immediately (within 5 minutes) connected to Ethernet - ssh connection works
  3. Flash with a snapshot that contains the 80211 uninitialized lock fix
    • flashing now...

It'll take a couple days to report back...

1 Like

This sounds suspiciously like the issue I had with -rc4 on an MBL, except I'm going through an external AP (running OpenWrt, too). Perhaps there's a relation?

(JFTR, I have not tried with another rc or snapshot yet.)

How is this going? No complains from the missus here after 2 days, to note.

Larger bug report: Here's the testing I have done (#1 & #2 below reported earlier). See the OP for initial description of the problem, but the short story is that htop freezes after a while while ssh'd in over wi-fi.

  1. Test on a wired connection
    • tested with Ethernet - ran successfully for > 1 hour
    • tested with wi-fi again, failed after 9 minutes
  2. Re-flash with rc6, installing only htop and nano
    • Connected over wi-fi, froze after 10 minutes
    • Immediately connected using Ethernet (no reboot) - ssh connection works
  3. Flash with a snapshot that contains the 80211 uninitialized lock fix
    • Connection over wi-fi failed after ~1h 14 minutes
    • Reconnected over Ethernet (no reboot) - ssh works
    • NB: All the while wi-fi works to connect to the router and browse the internet
    • I then used LuCI to restart the LAN interface. I was able to connect to ssh over wi-fi.

The remainder of the note is my notes as I was doing these steps:

Belkin RT3200

- Initially flashed with dangowrt UBI instructions

On Ethernet: htop - 1d 4:33:50 or so
Still working at 1d 05:36:51 (~1 hour later)

Switch to Wi-Fi: htop 1d 5:36:51
Froze at 1d 05:45:33 (~9 minutes later)

==================
Flash with RC6 again

- Don't keep settings (none of the three checkboxes checked)
- Set password
- Update packages
- install htop
- install nano (not nano-full or nano-plus)
- Configure Wi-Fi - open
- Change LAN subnet to 192.168.249.1/24
- Start htop
- Uptime: 00:11:16
- Froze at 00:21:11

========================
Flash with snapshot from 9Aug2022, 18:12 EDT

Powered by LuCI Master (git-22.213.35850-abd9125) / OpenWrt SNAPSHOT r20265-e6e4f97999

- Don't keep settings (none of the three checkboxes checked)
- ssh in
- opkg update; opkg install luci
- (from LuCI...)
- Set password
- Update packages
- install htop
- install nano (not nano-full or nano-plus)
- Change System Name to Belkin-RT3200
- Configure Wi-Fi - open
- Change LAN subnet to 192.168.249.1/24
- Start htop
- Uptime: 00:13:10
- Disconnect with htop showing 01:27:21

Restart LAN interface from LuCI

Start htop on wifi at - 08:57:34
Stopped at: 09:39:06 (~12 minutes)
Restarted LAN interface using LuCI - ssh access over Wi-Fi works again

Another report: DIR-2660 admin ssh unstable through wifi (22.03 rc6)

I edited the title to contain "wifi".

This is likely something about the wifi breaking connection, forcing the continuous TCP session (?) for SSH to break, which then freezes the SSH terminal.

1 Like

What I find extremely curious is that, while my case looks and feels to be the same, my AP and the 22.03 device are separate devices on the same network, and the wifi connection was established through a 21.02 device. My 22.03-rc4 device that showed the issue does not even have wifi.

This means that it's, somehow, not related to wifi on the device itself, but to a SSH connection that at some point went through an (OpenWrt) wifi.

Honestly, I don't know what to make of it, and how this is possible. But here's hoping this will help to narrow down the problem.

2 Likes

Thanks for fixing the title of the post.

Is it worth going back to 21.02.3 to test if it fails there?

More about the test case. All the Wi-Fi tests have been done with my laptop on my dining room table (much to the chagrin of my wife :slight_smile: that's about 25 feet from the router. (So it's not likely to be a "weak Wi-Fi signal causing disconnects.)

Maybe. However, I have several devices running 21.02.3, most notably the AP (MT7621AT) and my main router (X86-64), none of which exhibit the problem. I also think if it were a problem with 21.02, we would have heard more reports by now.

I will set up some other device on a different target with a 22.03-rc and put it on the network to see if the problem is reproducible. Let's see what I have lying around

Edit: an R6220, that will do.

Edit 2: 30 minutes in and SSH is still up and responsive. So that's a "fail", the (or at least: my) problem doesn't seem to be universal to the 22.03-rcs. Next up I will test again with the MBL that gave me the issue first and try to reproduce it.

Edit 3: "success" I guess? 22.03.0-rc6 on the MBL, as before, fresh install from official download, disabled firewall (no dnsmasq or odhcpd present), no other software installed. SSH through wired clients on the network works fine. SSH through wifi (again, external AP!) prompts an established connection syslog entry after a loooong wait, never makes it to the login prompt before timing out.

Edit 4: The fact that LuCI works fine even when SSH fails makes me think it is some kind of dropbear issue.

Edit 5: Well, that is getting weirder and weirder. After some 10 minutes, somehow it ... "recovered"? SSH is now again possible as if there was never any issue. After googling the issue with dropbear, some (older) posts suggest it is a 5GHz issue, so I tried 2.4GHz in between, with no different outcome, SSH was still timing out. But then all of a sudden ... it recovered.

Edit 6: ... aaaand it stopped responding to SSH again, some 5 to 10 minutes later.

This is somehow the behaviour we're seeing. SSH tunnel not opening and timing out, wait 10 min. and it works again -just because-, SSH tunnel becomes non-responsive and connection drops, no way to reconnect because of more time outs. Wait for a few minutes, or reboot the computer, or reboot the access point, and lo and behold we are back in business.

Hey, on a positive none, seems to me that there is some consistency.

Could this be a sign of insufficient entropy? Can you reproduce the hang with haveged or urngd running in the background?

Not any more.
Kernel was changed a while ago regarding entropy collection

1 Like

Always at 256, no matter what. Which is more than enough.

I know. In fact, I was CC'ed on that change. But it works only if the device has a working high-resolution clock - and I don't know if it exists on routers. That's why the double-check.

Someone CMIIW, but if it was an entropy issue, it would hit all clients, wired or wireless. Which it does not, it only affects clients coming through wifi, and most confusingly, even if that wifi connection is external to the device.

So what makes packets that at some point passed through a wifi connection so particular that they trigger this bug?

This and point about entropy above seems interesting.

In my case SSH over WDS on 2 4 GHz has become very slow. Like it will timeout and stuff. But LuCi and data transfer fine. No issue on 5Ghz.

Isn't there something about entropy being WiFi generated or something like that?