[SOLVED] SSH over wifi stops working on RT3200/E8450 with 22.03.0-rc6

As @richb-hanover-priv linked above I experienced the same problem with unstable connection to ssh (and some proprietary closed source management software). But after restaring the lan interface it seems stable 12h after the interface restart.

And while that's "good" as a band-aid, it is really bad for actually finding the bug. The issue is clearly caused by some sort of interaction between a MT76 AP and dropbear (even if they are not on the same device), and after restarting my AP's LAN I haven't been able to reproduce it anymore. And that is immensely frustrating if your plan is to capture a failed attempt.

@richb-hanover-priv ... can you still reproduce it? If so, can you tcpdump-capture the isolated interaction between the SSH client and the device that stops answering? Because right now I ... just can't.

Update: I was able to reproduce this behavior using RC1. I am unable to try 21.02 since the Belkin RT3200 did not work with that firmware.

I was thinking the same thing. I can use two laptops: One to test via Wi-Fi, the second on Ethernet.

Since Ethernet still works, what tcpdump command should I use to get more information on the Wi-Fi sessions? Thanks.

Not a tcpdump specialist at all, used it for the first time yesterday actually, but this should work. Assuming 192.168.1.1 for the RT3200 (server) and 192.168.1.100 for the wireless client (laptop):

on the server:
tcpdump -nn -s 0 -w /tmp/capture.pcap host 192.168.1.100 and tcp port 22

and on the laptop:
tcpdump -nn -s 0 -w /tmp/capture.pcap host 192.168.1.1 and tcp port 22

Adjust your IPs of course. A capture of a few seconds during which the connection attempt is made and fails should be enough.Ctrl+C to end capture.

(If you leave out the -s 0 -w /tmp/capture.pcap part it will not output to the file, but abridged capture data to the console so you can check if it works, and also if it only captures the connection attempt and not some other traffic.)

Tons of new evidence (no tcpdump yet):

  • I'm still running RC1 on the Belkin RT3200
  • I decided to try a couple more devices running htop, so I fired up my old MacBook Pro ("oMBP") and an old Win10 laptop. My primary machine is a newer MBP. All three were connected via Wi-Fi and running htop successfully.
  • After ~1h 15 minutes, I got tired of waiting, so I stopped htop and exited the SSH sessions on oMPB and the Win10 machine .
  • Within 5 minutes, htop froze on the third computer (MBP). I could not re-establish a new SSH connection. (oMBP and Win10 were still disconnected from SSH at that time.)
  • While the new MBP was in that bad state, I was able to ssh in to the router and run htop on oMBP and Win10. After checking that htop worked, I stopped it and exited the SSH session on those machines.
  • I turned off Wi-Fi on the new MBP, then turned it back on, and was immediately able to reconnect to the router.

My summary of the evidence:

  • Something is interfering with SSH & Wi-Fi. Running htop over Wi-Fi freezes within 5-20 minutes, and that computer cannot re-connect to SSH.
  • If multiple computers were connected via Wi-Fi and running htop, no freeze was observed (I waited 1h 15m, when a freeze normally occurs within 20 minutes)
  • (I didn't try it in this round of experiments, but...) SSH over Ethernet seems always to work
  • When one computer is in the "frozen-Wi-Fi" state, another computer can SSH in via Wi-Fi
  • When I turned Wi-Fi off and back on for the affected computer, it could immediately SSH back in.

What's the next experiment? tcpdump? Thanks

1 Like

Please test whether the broken SSH state still occurs after these commands on the 22.03 device running the SSH server:

nft 'add chain inet fw4 changetos { type route hook output priority mangle ; }'
nft add rule inet fw4 changetos ip dscp set 0 counter
1 Like

In the meantime I managed to get meaningful tcpdumps, I was able to Ralph-Wiggum-style help the devs to at least get some general idea where the issue is.

As far as I can wrap my head around the subject matter, the culprit is the MT76 wifi driver getting confused about DSCP markings, specifically the "af21" marking dropbear started setting in the version 2022.83 used on 22.03-rc*. That might even affect other software or devices setting that DSCP marking.

This now leads to the question about what to actually fix. As far as I understand it, ideally the MT76 driver should be fixed, but of course that wouldn't help with the existing install base (MT76 is affected at least back to 21.02, possibly even earlier). The nftables rule above is a band-aid, it clobbers the DSCP markings on all outgoing packets, but that doesn't help with devices not running nftables in the first place (like my MBL, I had to aftermarket-install nft to confirm the rules working). And another possible hotfix would be patching dropbear for 22.03 to not set the DSCP markings, or set ones that don't trip up the MT76 driver.

This is now in the hands of some really capable devs making some very intricate decisions, I'm afraid we can't really do much anymore.

(Apologies if I'm misstating some of the details. I'm really way out of my depth here.)

Do you observe the ssh session "freeze" only on the apple devices? Same band (i.e. both ssh client and server on 5 GHz)?

I've observed ssh sessions freezing when connecting via wifi to a 2019 mac book air (connected to a recent build of openwrt master branch for a r7500v2 configured as an AP only) from ubuntu clients also connected via wifi on the same r7500v2. I can't remember if this was on the same band - I'll try to reproduce and report back next week.

I had hoped the recent changes to AQL might have fixed that, but I have not tried to reproduce it recently (I've been traveling and busy with other interests for several months). This typically happens when I'm doing netperf/flent testing with the mac book (i.e. it's wifi network connection is busy).

I do not observe similar ssh freezing from "busy" ubuntu wifi clients. I typically don't do much with Win10 to make it's wifi "busy" when I'm on ssh so I can't comment on that.

@takimata - thanks for the further info

@anon98444528 - Good questions

  • I've only seen this on a Mac laptop
  • I am testing my Win10 laptop now
  • I have only tried 2.4GHz; I did not even enable 5GHz on the RT3200
  • I don't have a Linux laptop handy
2 Likes

I note that I am unfond of most usages of the non-default wifi queues. They don't do what most of our layer 3 protocols expect. I would be perfectly happy if openwrt shipped with a flat qos-map into BE and/or ripped out support for the other queues entirely.

It's kind of my hope the ssh thing is randomness, tcp itself, or related to something other than AQL and the ath10k is *lovely* - #831 by dtaht

That said, if this is really a dscp related problem I can think of one potentially related ATF flaw in packets coming from a station on one queue and going out another might act up (somehow). I'd like to rule that out - try a flat qos-map in hostapd? that avoids having to remark packets via nft.

When your SSH problem occurs, try to restart dropbear (/etc/init.d/dropbear restart) on your router via ssh from another machine and check whether you can ssh to the router from the device having the SSH over WIFI problem.

If such dropbear restart fixes your SSH problem and the fact that you can still access LUCI from the machine having SSH connection problem, then it may not be the problem with WIFI.

It is. More specifically it is related to the AF21 "interactive" marker if an MT76 wifi is somewhere in the network path. dropbear started to set it with version 2022.82:

Priority (tty) traffic is now set to AF21 "interactive".

If we apply this AF21 marker to other traffic using nft, it has the same effect. E.g. after marking all port 80 traffic with AF21, LuCI "stops working" through an MT76 wifi (OpenWrt 21.02 AP).

some products today are marking the tcp syn packet at af21 also.

On other tests on the bug I referenced we were testing a variety of other dscps and not seeing a problem. Anyone have aircaps?

Cool - Just in time! (The SSH connection had hung about 5 minutes ago.)

Neither restarting dropbear from LuCI nor /etc/init.d/dropbear restart restored SSH access from the affected machine. My other laptop could SSH in via Wi-Fi, though.

I think the culprit is in takimata's post regarding AF21 change in the latest dropbear v2022.82 not playing well with MT76 WIFI.

I did the same SSH test connecting to my R7800 (ath10k, 22.03-RC6) and running htop for hours. I did not encounter your SSH over WIFI problem.

You can try to run htop with faster refresh to see if you can reproduce the problem faster (or it may just disappear by itself :slight_smile:

htop -d 10

or even faster:

htop -d 1

As per @dtaht quick fix, can you try this iw_qos_map_set under your wifi-iface configuration? It maps EF markings into the AC_VO queues, the rest goes to AC_BE, not matter the DSCP marking.

config wifi-iface 'wifinet2'
	option device 'radio0'
	option mode 'ap'
	option ssid 'test'
	option encryption 'psk2+ccmp'
	option key 'test.test.test.test'
	option network 'lan'
	option disassoc_low_ack '0'
	option dtim_period '1'
	option iw_qos_map_set '0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46,0,0'

Update: apologies, wrong mapping of EF, this should work as expected: option iw_qos_map_set '46,6,0,63,255,255,255,255,255,255,255,255,255,255,255,255,255,255'

2 Likes

Just to confirm what you're recommending: I should change my /etc/config/wirelsss file with current contents (below).

Should I add that config to the file? Or should I modify the second stanza (starting with config wifi-iface 'default_radio0') to match the lines above? (I think it's the latter, but I'm checking first...) Thanks.

# CURRENT
root@Belkin-HBTL:/etc/config# cat /etc/config/wireless

config wifi-device 'radio0'
	option type 'mac80211'
	option path 'platform/18000000.wmac'
	option channel '1'
	option band '2g'
	option htmode 'HT20'
	option cell_density '0'

config wifi-iface 'default_radio0'
	option device 'radio0'
	option network 'lan'
	option mode 'ap'
	option encryption 'none'
	option ssid 'Belkin-HBTL'

config wifi-device 'radio1'
	option type 'mac80211'
	option path '1a143000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0'
	option channel '36'
	option band '5g'
	option htmode 'HE80'
	option disabled '1'

config wifi-iface 'default_radio1'
	option device 'radio1'
	option network 'lan'
	option mode 'ap'
	option ssid 'OpenWrt'
	option encryption 'none'

If the issues are in your 2.4 GHz radio modify the second stanza default_radio1, if your issues are with your 5 GHz radio modify the first one default_radio0, or just go and do both as follows:

config wifi-iface 'default_radio0'
	option device 'radio0'
	option network 'lan'
	option mode 'ap'
	option encryption 'none'
	option ssid 'Belkin-HBTL'
	option iw_qos_map_set '0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46,0,0'

config wifi-iface 'default_radio1'
	option device 'radio1'
	option network 'lan'
	option mode 'ap'
	option ssid 'OpenWrt'
	option encryption 'none'
	option iw_qos_map_set '0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46,0,0'

Update: apologies, wrong mapping of EF, this should work as expected: option iw_qos_map_set '46,6,0,63,255,255,255,255,255,255,255,255,255,255,255,255,255,255'

Update: I made the change to /etc/config/wireless from @amteza and rebooted. The SSH session with htop has been running for over 1h 20 minutes. I'm going to let the computer run overnight.
NB: I think @amteza typed it backwards in the note above: radio0 in my config is the 2.4GHz radio; radio1 is the 5GHz

Whether that's successful or not, I will back that change out, and try @jow's change from this post to see if that makes a difference.

THANKS ALL FOR ALL THIS SUPPORT!