[SOLVED] SSH over wifi stops working on RT3200/E8450 with 22.03.0-rc6

richb-hanover-priv · August 11, 2022, 3:04pm

@takimata - thanks for the further info

I've only seen this on a Mac laptop
I am testing my Win10 laptop now
I have only tried 2.4GHz; I did not even enable 5GHz on the RT3200
I don't have a Linux laptop handy

dtaht · August 11, 2022, 3:51pm

I note that I am unfond of most usages of the non-default wifi queues. They don't do what most of our layer 3 protocols expect. I would be perfectly happy if openwrt shipped with a flat qos-map into BE and/or ripped out support for the other queues entirely.

It's kind of my hope the ssh thing is randomness, tcp itself, or related to something other than AQL and the ath10k is *lovely* - #831 by dtaht

That said, if this is really a dscp related problem I can think of one potentially related ATF flaw in packets coming from a station on one queue and going out another might act up (somehow). I'd like to rule that out - try a flat qos-map in hostapd? that avoids having to remark packets via nft.

vochong · August 11, 2022, 7:00pm

When your SSH problem occurs, try to restart dropbear (/etc/init.d/dropbear restart) on your router via ssh from another machine and check whether you can ssh to the router from the device having the SSH over WIFI problem.

If such dropbear restart fixes your SSH problem and the fact that you can still access LUCI from the machine having SSH connection problem, then it may not be the problem with WIFI.

takimata · August 11, 2022, 7:00pm

It is. More specifically it is related to the AF21 "interactive" marker if an MT76 wifi is somewhere in the network path. dropbear started to set it with version 2022.82:

Priority (tty) traffic is now set to AF21 "interactive".

If we apply this AF21 marker to other traffic using nft, it has the same effect. E.g. after marking all port 80 traffic with AF21, LuCI "stops working" through an MT76 wifi (OpenWrt 21.02 AP).

dtaht · August 11, 2022, 7:04pm

some products today are marking the tcp syn packet at af21 also.

On other tests on the bug I referenced we were testing a variety of other dscps and not seeing a problem. Anyone have aircaps?

richb-hanover-priv · August 11, 2022, 7:16pm

Cool - Just in time! (The SSH connection had hung about 5 minutes ago.)

Neither restarting dropbear from LuCI nor /etc/init.d/dropbear restart restored SSH access from the affected machine. My other laptop could SSH in via Wi-Fi, though.

vochong · August 11, 2022, 7:24pm

I think the culprit is in takimata's post regarding AF21 change in the latest dropbear v2022.82 not playing well with MT76 WIFI.

I did the same SSH test connecting to my R7800 (ath10k, 22.03-RC6) and running htop for hours. I did not encounter your SSH over WIFI problem.

vochong · August 11, 2022, 7:30pm

You can try to run htop with faster refresh to see if you can reproduce the problem faster (or it may just disappear by itself

htop -d 10

or even faster:

htop -d 1

amteza · August 11, 2022, 7:58pm

As per @dtaht quick fix, can you try this iw_qos_map_set under your wifi-iface configuration? It maps EF markings into the AC_VO queues, the rest goes to AC_BE, not matter the DSCP marking.

config wifi-iface 'wifinet2'
	option device 'radio0'
	option mode 'ap'
	option ssid 'test'
	option encryption 'psk2+ccmp'
	option key 'test.test.test.test'
	option network 'lan'
	option disassoc_low_ack '0'
	option dtim_period '1'
	option iw_qos_map_set '0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46,0,0'

Update: apologies, wrong mapping of EF, this should work as expected: option iw_qos_map_set '46,6,0,63,255,255,255,255,255,255,255,255,255,255,255,255,255,255'

richb-hanover-priv · August 11, 2022, 8:31pm

Just to confirm what you're recommending: I should change my /etc/config/wirelsss file with current contents (below).

Should I add that config to the file? Or should I modify the second stanza (starting with config wifi-iface 'default_radio0') to match the lines above? (I think it's the latter, but I'm checking first...) Thanks.

# CURRENT
root@Belkin-HBTL:/etc/config# cat /etc/config/wireless

config wifi-device 'radio0'
	option type 'mac80211'
	option path 'platform/18000000.wmac'
	option channel '1'
	option band '2g'
	option htmode 'HT20'
	option cell_density '0'

config wifi-iface 'default_radio0'
	option device 'radio0'
	option network 'lan'
	option mode 'ap'
	option encryption 'none'
	option ssid 'Belkin-HBTL'

config wifi-device 'radio1'
	option type 'mac80211'
	option path '1a143000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0'
	option channel '36'
	option band '5g'
	option htmode 'HE80'
	option disabled '1'

config wifi-iface 'default_radio1'
	option device 'radio1'
	option network 'lan'
	option mode 'ap'
	option ssid 'OpenWrt'
	option encryption 'none'

amteza · August 11, 2022, 8:44pm

If the issues are in your 2.4 GHz radio modify the second stanza default_radio1, if your issues are with your 5 GHz radio modify the first one default_radio0, or just go and do both as follows:

config wifi-iface 'default_radio0'
	option device 'radio0'
	option network 'lan'
	option mode 'ap'
	option encryption 'none'
	option ssid 'Belkin-HBTL'
	option iw_qos_map_set '0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46,0,0'

config wifi-iface 'default_radio1'
	option device 'radio1'
	option network 'lan'
	option mode 'ap'
	option ssid 'OpenWrt'
	option encryption 'none'
	option iw_qos_map_set '0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46,0,0'

Update: apologies, wrong mapping of EF, this should work as expected: option iw_qos_map_set '46,6,0,63,255,255,255,255,255,255,255,255,255,255,255,255,255,255'

richb-hanover-priv · August 12, 2022, 2:09am

Update: I made the change to /etc/config/wireless from @amteza and rebooted. The SSH session with htop has been running for over 1h 20 minutes. I'm going to let the computer run overnight.
NB: I think @amteza typed it backwards in the note above: radio0 in my config is the 2.4GHz radio; radio1 is the 5GHz

Whether that's successful or not, I will back that change out, and try @jow's change from this post to see if that makes a difference.

THANKS ALL FOR ALL THIS SUPPORT!

amteza · August 12, 2022, 2:26am

That will be a good test, outcome should be the same if this is a mt76 driver bug.

dtaht · August 12, 2022, 12:02pm

Does the problem also occur over ipv6?

richb-hanover-priv · August 12, 2022, 12:51pm

Update: The htop session ran all night with @amtez's config. That's good news.

Bad news: I then reverted the config to the original OpenWrt /etc/config/wireless, rebooted, and re-started the htop test. It has been running without freezing for 1h 45min.

I'm not sure how to interpret this. I may re-flash to RC6 (I'm currently running RC1) to be sure everything's the same.

@dtaht - I have not been testing via IPv6

takimata · August 12, 2022, 1:18pm

The bug clearly also has a random, intermittent weirdness to itself, sometimes it does not reappear for hours, especially after something was previously done to "fix" it.

richb-hanover-priv · August 12, 2022, 2:43pm

The bug clearly also has a random, intermittent weirdness to itself,

Fortunately the SSH session froze after 1h 50 minutes. (WHEW!) My next step will be to re-flash with RC6, then re-test after using @jow's nftables commands.

dtaht · August 12, 2022, 4:04pm

what I suspect is we have some off-by bug in header encoding or decoding at some layer in the stack, which is why I suggested ipv6. A crc error, even.

takimata · August 12, 2022, 5:22pm

Just tried it, yes it does.

richb-hanover-priv · August 12, 2022, 8:06pm

Update: I re-flashed to RC6, ran the two nft commands from @jow on the router and have re-started the htop experiment. Time will tell if that also fixes it. I'll report back tomorrow morning (~16 hours from now)