Archer C7 2.4 GHz wireless dies in 24~48 hours

@zen1932, thanks for the confirmation it works on the C7
you can always set a cron to run every minute for the scan
Periodic scan should prevent it locking. Scanning 1 channel is quicker and blocks the wifi for less time, we are talking millisecond

@zen1932 Great observation! Could you please try

iw event -t -f > /tmp/iwevent.log

and upload a copy that shows how it looks like on your Archer when 2.4ghz WiFi locks up?

The same problem on the newest firmware for now.

#Luci

1 Like

@Catfriend1, I am monitoring the event 84 and 64, otherwise it is rather verbose. but since the 2.4Ghz freeze randomly, I may wait long before getting the results.

@sammo, I had already set a cron job, but a full scan, now thanks to your finding I will replace it with your quicker scan.

1 Like

Shorten a few commands

iw dev $(iw dev|grep "Interface|channel|type"|grep -B 2 'channel.*24..'|grep -m 1 -B 1 'AP'|awk '/Interface/ {print $2}') scan trigger freq 2447 flush >/dev/null 2>&1

2 Likes

Nice, but would it not be less CPU (given this is running every 30 to 60 seconds) to pre-compute the command once and just use the values that work for your deployment?

So like this:

# 5Ghz – ch 161

iw dev wlan1 scan trigger freq 5805 flush 

# 2.4GHz -  ch 8

iw dev wlan0 scan trigger freq 2447 flush

BTW- this works fine on MT7621 / 7603 as well as the C7 (I have both). So this workaround seems to point the finger at something common to several (all?) platforms.

Here is a NetSpot signal scan for the past 30 minutes showing how even in a crowded WiFi environment that causes the OpenWRT APs to lower their 2.4 radio signals, is maintained at a high level with a scan every 60 seconds. Top trace (Brown is the 2.4 radio, next one down (blue) is the 5Ghz radio. AP is 3' from my MacBook Pro running NetSpot.

2 Likes

I think the common problem reported is 2.4Ghz which uses the ath9k driver whereas 5Ghz uses ath10k.

I did 1 channel scan as when I do a full scan I noticed a bleep on my iperf. I think the scan 'blocks' data been sent/received.

Nice graph!!

OK I see what you mean find the wlanXXXXX, yes of course, unless someone did a uci set xxxx
Would be great if someone solves the underlining code than a workaround....

1 Like

Yeah, do NOT do that regularly (at least not on the MT7621) as it drops the link for a second.

Maybe only do it when there is some other clear signal that there is an issue, and this might clear it up.

No drops on the 2.4 GHz radio when running these scans.

[edit: Note that the wlan1 shown here is a 5Ghz on the MT76, on the C7 the 2.4 is wlan1, I keep mixing up which is which (since I run both), but this example and in my prior post were from a test MT76, so use the right wlan ID. The C7 2.4 seems to do well with the probe every 60 seconds.]

it would be great to monitor when the problem happens, You can monitor wifi event via

iw event -f -t > /tmp/event.txt

FYI. None of these workaround works, but could be useful of other purpose/vulnerabilities

disable ipv6
option disassoc_low_ack '0'
option wpa_group_rekey '43200'
option wpa_strict_rekey '1'
option wpa_disable_eapol_key_retries 1
option txantenna 1
option rxantenna 1

1 Like

thanks to
option ifname 'wlan0' in config wifi-iface (/etc/config/wireless), one can assign a fixed wlan name, no need to recalculate every n seconds.

I'm also seeing the following dmesg errors after the wifi locks up. Short iw scan does indeed solve the issue temporarily. Can we use this as a trigger for a short scan instead of scanning every 30 seconds?

[760751.300380] ath: phy1: Unable to reset channel, reset status -5
[761414.248012] ath: phy1: Unable to reset channel, reset status -5
[761582.062141] ath: phy1: Unable to reset channel, reset status -5
1 Like

@peca89 Do you also have an Tplink Archer c7? I'll check my dmesg and if it reproduces those log lines will adjust my watchdog script.

I've checked my 5 production APs (they're all Archer C7's running non-ct ath10k drivers - if that does matter in regards to 2.4 GHz ath9k driver problems). None of them after ~ one week of runtime and a lot of load in our business environment shows those lines in "dmesg | grep ath".

ath: phy1: Unable to reset channel, reset status -5

Instead, every device has a log output similar to this (captured from AP 1 after 7 days of uptime).

[   15.682419] ath10k_pci 0000:00:00.0: enabling device (0000 -> 0002)
[   15.688963] ath10k_pci 0000:00:00.0: pci irq legacy oper_irq_mode 1 irq_mode 0 reset_mode 0
[   16.956848] ath10k_pci 0000:00:00.0: qca988x hw2.0 target 0x4100016c chip_id 0x043202ff sub 0000:0000
[   16.966260] ath10k_pci 0000:00:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0
[   16.979337] ath10k_pci 0000:00:00.0: firmware ver 10.2.4-1.0-00047 api 5 features no-p2p,raw-mode,mfp,allows-mesh-bcast crc32 35bd9258
[   17.549366] ath10k_pci 0000:00:00.0: board_file api 1 bmi_id N/A crc32 bebc7c08
[   18.737064] ath10k_pci 0000:00:00.0: htt-ver 2.1 wmi-op 5 htt-op 2 cal file max-sta 128 raw 0 hwcrypto 1
[   18.849005] ath: EEPROM regdomain sanitized
[   18.849014] ath: EEPROM regdomain: 0x64
[   18.849018] ath: EEPROM indicates we should expect a direct regpair map
[   18.849036] ath: Country alpha2 being used: 00
[   18.849039] ath: Regpair used: 0x64
[   18.978956] ath: EEPROM regdomain sanitized
[   18.978965] ath: EEPROM regdomain: 0x64
[   18.978968] ath: EEPROM indicates we should expect a direct regpair map
[   18.978987] ath: Country alpha2 being used: 00
[   18.978990] ath: Regpair used: 0x64
[   44.444997] ath: EEPROM regdomain: 0x8114
[   44.449108] ath: EEPROM indicates we should expect a country code
[   44.455297] ath: doing EEPROM country->regdmn map search
[   44.460688] ath: country maps to regdmn code: 0x37
[   44.465546] ath: Country alpha2 being used: DE
[   44.470051] ath: Regpair used: 0x37
[   44.473582] ath: regdomain 0x8114 dynamically updated by user
[   44.479487] ath: EEPROM regdomain: 0x8114
[   44.483563] ath: EEPROM indicates we should expect a country code
[   44.489746] ath: doing EEPROM country->regdmn map search
[   44.495131] ath: country maps to regdmn code: 0x37
[   44.499988] ath: Country alpha2 being used: DE
[   44.504493] ath: Regpair used: 0x37
[   44.508032] ath: regdomain 0x8114 dynamically updated by user
[   53.212353] ath10k_pci 0000:00:00.0: pdev param 0 not supported by firmware

@sammo It's just a stomach feeling, but event id 84 (probe_client) does not seem to indicate a problem with the 2.4 ghz radio. I'm constantly monitoring the iw event log and walked around with my phone - when walking out of range of AP 1, it reported "at the same moment":

1621928540.803986: wlan0-3 (phy #0): unknown event 84
1621928540.805934: wlan0-3: del station 48:2c:a0:xx:xx:xx

Taking another 2.4 ghz device, the same could be reproduced.

1621928678.485789: wlan1-1: del station 0e:9a:62:xx:xx:xx
1621928678.486927: wlan1-1 (phy #1): unknown event 84

So I'll still watch out for "iw event 64" and the "ath: phy1: Unable to reset channel, reset status -5" log lines - currently none of them was found.

1 Like

I think event 84 is a probe_client follow by del client, its 'too' late to trigger by then as the clients disconnect.
event 64 could be because I was hammering with iperf at the time and as a result a by product.

We need to trigger on 'something' before any clients get disconnected it my prefer solution if the underlying coding cannot be fixed.

I dont see reset channel on my device GL ar300M

Half way house, trigger on clients connect/disconnect?

1 Like

@sammo rarely getting "unknown event 128" here but cannot proove that we have had problems with the 2.4 G wifi at this time. Where can I look up what those error codes mean?

try this site, not sure how old the code is

https://insidelinuxdev.net/article/a0aby8.html

2 Likes

@Catfriend1 Yes, I do. I have the v2 version, running 19.07.0 and CT drivers (didn't know others existed :slight_smile: ) WiFi drops started once I have installed the Wireguard server there, it seems related to high CPU usage or high amount of traffic. I have an ESP32 connected to 2.4GHz there, so I can quite precisely see when it stopped reporting to Home Assistant. I might even make an automation there to trigger the scan once it becomes unavailable :slight_smile:

1 Like

Hi,

at the moment I use my script from Archer C7 2.4 GHz wireless dies in 24~48 hours - #53 by uweklatt with ct-firmware on Archer C7v2 with OpenWrt 19.07.6.

This /root/watchdog.sh script is called every two minutes from crontab (*/2 * * * * /root/watchdog.sh) and checks how many devices are connected to 2.4GHz. I have always connected a few devices. If the count goes down to "0", the command "/sbin/wifi" revives the connections.

At the moment I have a 51 days uptime without noticeable interruptions.

Uwe

The scan commands prevent any devices from disconnect hence no interruption to services, conf call, video streaming.

Your watchdog could be improved instead
wifi down
wifi

Which will take down 2.4Ghz and 5Ghz

You can run

wifi down [radio0|radio1]
wifi up [radio0|radio1]

Hence it will leave 5Ghz untouch and devices stay connected on 5Ghz

1 Like

Hi sammo,

yes, this is a good point. I don't use 5GHz because I have an additional AccessPoint for Wifi6...

Uwe