Archer C7 2.4 GHz wireless dies in 24~48 hours

sammo · May 25, 2021, 10:36am

try this site, not sure how old the code is

https://insidelinuxdev.net/article/a0aby8.html

peca89 · May 25, 2021, 12:57pm

@Catfriend1 Yes, I do. I have the v2 version, running 19.07.0 and CT drivers (didn't know others existed ) WiFi drops started once I have installed the Wireguard server there, it seems related to high CPU usage or high amount of traffic. I have an ESP32 connected to 2.4GHz there, so I can quite precisely see when it stopped reporting to Home Assistant. I might even make an automation there to trigger the scan once it becomes unavailable

uweklatt · May 25, 2021, 6:06pm

Hi,

at the moment I use my script from Archer C7 2.4 GHz wireless dies in 24~48 hours - #53 by uweklatt with ct-firmware on Archer C7v2 with OpenWrt 19.07.6.

This /root/watchdog.sh script is called every two minutes from crontab (*/2 * * * * /root/watchdog.sh) and checks how many devices are connected to 2.4GHz. I have always connected a few devices. If the count goes down to "0", the command "/sbin/wifi" revives the connections.

At the moment I have a 51 days uptime without noticeable interruptions.

Uwe

sammo · May 25, 2021, 8:19pm

The scan commands prevent any devices from disconnect hence no interruption to services, conf call, video streaming.

Your watchdog could be improved instead
wifi down
wifi

Which will take down 2.4Ghz and 5Ghz

You can run

wifi down [radio0|radio1]
wifi up [radio0|radio1]

Hence it will leave 5Ghz untouch and devices stay connected on 5Ghz

uweklatt · May 25, 2021, 8:36pm

Hi sammo,

yes, this is a good point. I don't use 5GHz because I have an additional AccessPoint for Wifi6...

Uwe

anon2180415 · May 27, 2021, 9:03am

MEMORY STRESS TEST:

btw: I've found a way to trigger a radio crash by creating an out-of-memory situation on the AP.

dd if=/dev/zero of=/tmp/test1 bs=10M
dd if=/dev/zero of=/tmp/test2 bs=500K

After some of these commands, my shell gets very laggy, ram full and htop shows 100% cpu. Then, I did a speedtest (to generate huge traffic) on the 2.4 GHz radio and my SSH sessions close "by themselves", Web UI is unreachable since then. My mobile phone running the speedtest roams to another AP and can't anymore connect to the AP I've "OOM'ed".

Will now try to monitor "iw event -t -f" between "my memory has been filled up" and "ssh/web server crash" if I can see anything significant before I loose access.

Here's the timeline of the test situation:

fill up RAM by filling /tmp to the max
cpu 100% occurs, SSH laggy, web UI responsive
generating 2.4 GHz wifi traffic with speedtest
SSH gets disconnected, Web UI unresponsive (services on the device get killed? crash?)
about 5 minutes later: The device reboots itself and is reachable via SSH/Web UI again.

I've done more tests. To sum them up: In OOM situations,

a) the AP does NOT emit "iw events" pointing at the problem (just the iw event 60 indiciating normal operation appear).

b) logread -f outputs those lines when huge WiFi traffic is hitting the AP while in OOM condition

Thu May 27 11:07:56 2021 daemon.notice hostapd: nl80211:

Thu May 27 11:07:56 2021 daemon.notice hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5
Thu May 27 11:09:30 2021 daemon.notice hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5
Thu May 27 11:09:30 2021 daemon.notice hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5

Short after, the SSH was closed and Web UI unresponsive.

Anyone else maybe experiencing the "nl_recvmsgs failed: -5" under normal operation after a time "when the wireless slowly begins to die"?

anon2180415 · May 27, 2021, 9:23am

1621413448.118700: wlan1 (phy #0): unknown event 60

@sammo This means "frame_tx_status" and I suspect it's related to management frame protection ( http://w1.fi/hostapd/devel/ieee802__11_8h.html )

anon2180415 · May 27, 2021, 9:28am

CPU STRESS TEST:

Shell 1:

yes > /dev/null &

Shell 2:

htop

Result:

Even under permanent 100% cpu load I cannot get to trigger my 2.4 GHz WiFi to crash when doing the speedtest (to generate load again).

I've found some possible explanation for the sporadic 2.4 GHz "dying". Both tests use the http://speedtest-fra1.digitalocean.com speed test to monitor free RAM via htop with a single device connected to the AP.

I guess if more connected devices hammer the bandwith of the AP simulatenously, it may get out of memory so it's "random" when exactly this occurs that many devices put huge traffic at the same time.

TEST1 - 2.4 GHz radio (ath9k):

Before starting speed test: free RAM 36.2 M
While running the speed test: free RAM spikes to 61 M
After the speed test, it takes 10-20 seconds for the free RAM to drop back to 36.9 M

TEST2 - 5 GHz radio (ath10k-non-ct):

Before starting speed test: free RAM 35.3 M
While running the speed test: free RAM spikes to 56 M
After the speed test, it takes 10-20 seconds for the free RAM to drop back to 35.0 M

TEST3 - 2.4 GHz only, this time with 2 (instead of 1) devices doing the speedtest simultaneously:

Before starting speed test: free RAM 35 M
While running the speed test: free RAM spikes to 62 M
After the speed test, it takes 10-20 seconds for the free RAM to drop back to 39.4 M

More devices seem to make the RAM situation worse (?).

csantz · May 27, 2021, 11:54am

I believe it is related with OOM as well.

When, I change vdevs to 64 it crash my 5GHz device and phy0 is missing, as stated in http://www.candelatech.com/ath10k-10.4.php#config caused by a OOM.

When, change vdevs to 32 WIFI speed gets better, but stops working few minutes after and C7 memory usage goes very high until WIFI stop working.

I will made some tests today with small vdevs, but seems that memory management in this firmware is somehow buggy.

I found that have some bugs in Candela firmware about it without response yet https://github.com/greearb/ath10k-ct/issues/183.
Will be nice if we can get some advice on how provide more info about OOM crash to help it solved.

anon2180415 · May 27, 2021, 12:04pm

@csantz Just to avoid misunderstandings: I've used ONLY ath10k-non-ct drivers. But maybe OOM issues exist there as well in the closed source driver?!

csantz · May 27, 2021, 12:10pm

@anon2180415 I'm using CT HTT firmware, opensource version, with non CT firmware I did not experienced any problems.

anon2180415 · May 27, 2021, 12:22pm

Dito, non-ct firmware on my 5 archers takes a "long time" until problems like bad wifi throughput / high pings / sudden disconnects and reconnects start to occur. Works well for me most of the time, but I'm still seeking to find the root cause of an error if it pops up.

I've now repeated my above mentioned tests and changed the distance from my mobile phones generating lots of 2.4 GHz WiFi load from 5 meters to 20 meters. I re-ran "iw event -f -t " and converted the unix timestamps to local time stamps to analyze the frequency of event 60.

I've read previous post of other topics on the Archer C7's WiFi problems telling that the distance AP <=> mobile device increase may also cause more problematic connection quality to appear.

anon2180415 · May 27, 2021, 12:44pm

Currently trying the above workaround "if it helps against WiFi problems". I have no proof about it so want to try. Whenever event 60 (frame_tx_status) occurs on 2.4 GHz "wlan1", a scan is triggered "as workaround".

iw event -t -f | grep "wlan1.*: unknown event 60" | while read line; do echo $(date +%Y-%m-%d_%H-%M-%S): scan...; iw dev wlan1 scan trigger freq 2447 flush; done;

output:

2021-41-27_14-05-52: scan...
2021-44-27_14-05-53: scan...
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)

There seems to be a high frequency of those event 60 log lines per second shortly after a scan was triggered. It's "good" that the scans do not queue up but instead one scan is finished before another one could be launched.

@sammo did a good guess: The frequency 30 sec. was "the average" delay between the event 60 log lines in my test setup.

My 2.4 GHz phone (without running anything) sometimes spikes up in ping around 600 ms. That's exactly when event 60 comes in and the "wifi scan" is triggered. It seems to "lower my ping immediately" when the scan takes place on the 2.4 GHz radio.

Notes for me:

https://insidelinuxdev.net/article/a0aby8.html
iw event 70 - unprot_deauthenticate
had this once in my wlan1 2.4 GHz logs, explanation: https://praneethwifi.in/2020/03/07/protected-management-frames-in-wpa2-802-11w-wpa3-owe/

Does the 2.4 GHz drop maybe also relate somehow with management frame protection which I set to "optional"?

anon2180415 · May 27, 2021, 1:32pm

With scans triggered by "event 60" I've got about 6 times faster down/up speed and pings never going up higher than 500 ms in 20 meters distance from the AP.

Without scans triggered by "event 60" I've got pings between 100 ms and 1000 ms plus lower down/up speed in 20 meters distance from the AP.

I absolutely have no clue why the "iw scan trigger " workaround does its job, but somehow feels better with it.

iw event -t -f | grep "wlan1.*: unknown event 60" | while read line; do echo $(date +%Y-%m-%d_%H-%M-%S): scan...; iw dev wlan1 scan trigger freq 2447 flush; done;

sammo · May 27, 2021, 1:39pm

@anon2180415 much appreciated for your investigation, is this the workaround?

sammo · May 27, 2021, 3:20pm

You are hardcoding wlan1 after you got the varible in RADIO_ATH9K

anon2180415:

	RADIO_ATH9K="$(iw dev|grep "Interface\|channel\|type"|grep -B 2 'channel.*24..'|grep -m 1 -B 1 'AP'|awk '/Interface/ {print $2}')"
	logAdd "[INFO] BEGIN iw_event_scan_trigger ON interface [${RADIO_ATH9K}]"
	iw event -t -f | while read line; do
		if $(echo -n "${line}" | grep -q "wlan1.*: unknown event 60"); then
			# logAdd "$(date +%Y-%m-%d_%H-%M-%S): Scan ..."
			iw dev wlan1 scan trigger freq 2447 flush >/dev/null 2>&1
		fi
	done

anon2180415 · May 27, 2021, 6:28pm

... thanks! My mistake, updated the script.

sammo · May 27, 2021, 8:38pm

Uwe,

Not nitpicking, if you have guest AP on 2.4Ghz also

[ $(cat /sys/kernel/debug/ieee80211/phy1/netdev:*/num_mcast_sta|awk '{ SUM += $1} END { print SUM }') -eq 0 ] && { do stuff }

anon2180415 · May 28, 2021, 9:19am

Fixed radio detection in script. Before, it said "wlan1-3", now it says "wlan1".

Script updated - see Archer C7 2.4 GHz wireless dies in 24~48 hours - #199

Catfriend1 · May 28, 2021, 3:06pm

Tried the above watchdog script and my wifi works fine under simulated iperf load. thank you for all the work on this.