try this site, not sure how old the code is
https://insidelinuxdev.net/article/a0aby8.html
try this site, not sure how old the code is
https://insidelinuxdev.net/article/a0aby8.html
@Catfriend1 Yes, I do. I have the v2 version, running 19.07.0 and CT drivers (didn't know others existed ) WiFi drops started once I have installed the Wireguard server there, it seems related to high CPU usage or high amount of traffic. I have an ESP32 connected to 2.4GHz there, so I can quite precisely see when it stopped reporting to Home Assistant. I might even make an automation there to trigger the scan once it becomes unavailable
Hi,
at the moment I use my script from Archer C7 2.4 GHz wireless dies in 24~48 hours - #53 by uweklatt with ct-firmware on Archer C7v2 with OpenWrt 19.07.6.
This /root/watchdog.sh script is called every two minutes from crontab (*/2 * * * * /root/watchdog.sh) and checks how many devices are connected to 2.4GHz. I have always connected a few devices. If the count goes down to "0", the command "/sbin/wifi" revives the connections.
At the moment I have a 51 days uptime without noticeable interruptions.
Uwe
The scan commands prevent any devices from disconnect hence no interruption to services, conf call, video streaming.
Your watchdog could be improved instead
wifi down
wifi
Which will take down 2.4Ghz and 5Ghz
You can run
wifi down [radio0|radio1]
wifi up [radio0|radio1]
Hence it will leave 5Ghz untouch and devices stay connected on 5Ghz
Hi sammo,
yes, this is a good point. I don't use 5GHz because I have an additional AccessPoint for Wifi6...
Uwe
MEMORY STRESS TEST:
btw: I've found a way to trigger a radio crash by creating an out-of-memory situation on the AP.
dd if=/dev/zero of=/tmp/test1 bs=10M
dd if=/dev/zero of=/tmp/test2 bs=500K
After some of these commands, my shell gets very laggy, ram full and htop shows 100% cpu. Then, I did a speedtest (to generate huge traffic) on the 2.4 GHz radio and my SSH sessions close "by themselves", Web UI is unreachable since then. My mobile phone running the speedtest roams to another AP and can't anymore connect to the AP I've "OOM'ed".
Will now try to monitor "iw event -t -f" between "my memory has been filled up" and "ssh/web server crash" if I can see anything significant before I loose access.
Here's the timeline of the test situation:
I've done more tests. To sum them up: In OOM situations,
a) the AP does NOT emit "iw events" pointing at the problem (just the iw event 60 indiciating normal operation appear).
b) logread -f outputs those lines when huge WiFi traffic is hitting the AP while in OOM condition
Thu May 27 11:07:56 2021 daemon.notice hostapd: nl80211:
Thu May 27 11:07:56 2021 daemon.notice hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5
Thu May 27 11:09:30 2021 daemon.notice hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5
Thu May 27 11:09:30 2021 daemon.notice hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5
Short after, the SSH was closed and Web UI unresponsive.
Anyone else maybe experiencing the "nl_recvmsgs failed: -5" under normal operation after a time "when the wireless slowly begins to die"?
1621413448.118700: wlan1 (phy #0): unknown event 60
@sammo This means "frame_tx_status" and I suspect it's related to management frame protection ( http://w1.fi/hostapd/devel/ieee802__11_8h.html )
CPU STRESS TEST:
Shell 1:
yes > /dev/null &
Shell 2:
htop
Result:
I've found some possible explanation for the sporadic 2.4 GHz "dying". Both tests use the http://speedtest-fra1.digitalocean.com speed test to monitor free RAM via htop with a single device connected to the AP.
I guess if more connected devices hammer the bandwith of the AP simulatenously, it may get out of memory so it's "random" when exactly this occurs that many devices put huge traffic at the same time.
TEST1 - 2.4 GHz radio (ath9k):
TEST2 - 5 GHz radio (ath10k-non-ct):
TEST3 - 2.4 GHz only, this time with 2 (instead of 1) devices doing the speedtest simultaneously:
More devices seem to make the RAM situation worse (?).
I believe it is related with OOM as well.
When, I change vdevs to 64 it crash my 5GHz device and phy0 is missing, as stated in http://www.candelatech.com/ath10k-10.4.php#config caused by a OOM.
When, change vdevs to 32 WIFI speed gets better, but stops working few minutes after and C7 memory usage goes very high until WIFI stop working.
I will made some tests today with small vdevs, but seems that memory management in this firmware is somehow buggy.
I found that have some bugs in Candela firmware about it without response yet https://github.com/greearb/ath10k-ct/issues/183.
Will be nice if we can get some advice on how provide more info about OOM crash to help it solved.
@csantz Just to avoid misunderstandings: I've used ONLY ath10k-non-ct drivers. But maybe OOM issues exist there as well in the closed source driver?!
@anon2180415 I'm using CT HTT firmware, opensource version, with non CT firmware I did not experienced any problems.
Dito, non-ct firmware on my 5 archers takes a "long time" until problems like bad wifi throughput / high pings / sudden disconnects and reconnects start to occur. Works well for me most of the time, but I'm still seeking to find the root cause of an error if it pops up.
I've now repeated my above mentioned tests and changed the distance from my mobile phones generating lots of 2.4 GHz WiFi load from 5 meters to 20 meters. I re-ran "iw event -f -t " and converted the unix timestamps to local time stamps to analyze the frequency of event 60.
I've read previous post of other topics on the Archer C7's WiFi problems telling that the distance AP <=> mobile device increase may also cause more problematic connection quality to appear.
Currently trying the above workaround "if it helps against WiFi problems". I have no proof about it so want to try. Whenever event 60 (frame_tx_status) occurs on 2.4 GHz "wlan1", a scan is triggered "as workaround".
iw event -t -f | grep "wlan1.*: unknown event 60" | while read line; do echo $(date +%Y-%m-%d_%H-%M-%S): scan...; iw dev wlan1 scan trigger freq 2447 flush; done;
output:
2021-41-27_14-05-52: scan...
2021-44-27_14-05-53: scan...
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
2021-44-27_14-05-53: scan...
command failed: Resource busy (-16)
There seems to be a high frequency of those event 60 log lines per second shortly after a scan was triggered. It's "good" that the scans do not queue up but instead one scan is finished before another one could be launched.
@sammo did a good guess: The frequency 30 sec. was "the average" delay between the event 60 log lines in my test setup.
My 2.4 GHz phone (without running anything) sometimes spikes up in ping around 600 ms. That's exactly when event 60 comes in and the "wifi scan" is triggered. It seems to "lower my ping immediately" when the scan takes place on the 2.4 GHz radio.
Notes for me:
Does the 2.4 GHz drop maybe also relate somehow with management frame protection which I set to "optional"?
With scans triggered by "event 60" I've got about 6 times faster down/up speed and pings never going up higher than 500 ms in 20 meters distance from the AP.
Without scans triggered by "event 60" I've got pings between 100 ms and 1000 ms plus lower down/up speed in 20 meters distance from the AP.
I absolutely have no clue why the "iw scan trigger " workaround does its job, but somehow feels better with it.
iw event -t -f | grep "wlan1.*: unknown event 60" | while read line; do echo $(date +%Y-%m-%d_%H-%M-%S): scan...; iw dev wlan1 scan trigger freq 2447 flush; done;
You are hardcoding wlan1 after you got the varible in RADIO_ATH9K
... thanks! My mistake, updated the script.
Uwe,
Not nitpicking, if you have guest AP on 2.4Ghz also
[ $(cat /sys/kernel/debug/ieee80211/phy1/netdev:*/num_mcast_sta|awk '{ SUM += $1} END { print SUM }') -eq 0 ] && { do stuff }
Fixed radio detection in script. Before, it said "wlan1-3", now it says "wlan1".
Script updated - see Archer C7 2.4 GHz wireless dies in 24~48 hours - #199
Tried the above watchdog script and my wifi works fine under simulated iperf load. thank you for all the work on this.