Archer C7 2.4 GHz wireless dies in 24~48 hours

this will put it into background in /etc/rc.local

while sleep 30; do iw dev $(iw dev|grep "Interface\|channel\|type"|grep -B 2 'channel.*24..'|grep -B 1 'AP'|tail -n 2|grep Interface|awk '{print $2}') scan trigger freq 2447 flush >/dev/null 2>&1; done &

@Catfriend1

Have you managed to crash the 2.4GHz wifi with the workaround? I dont have a tp link C7, but a ath9k f/w device

I haven't used the workaround yet as I first wanted to see what iw event shows in case of "wifi crash". Currently wifi is rock solid so still waiting for a normal spontaneous crash scenario to come.

maybe run the loop scan and hammer it with iperf. No point going further if scan does not prevent a crash

@sammo, with the command top, you will see rc.local will never exit because its sub-process 'while sleep 30...' will never exit.

@Catfriend1, I can confirm that when 2.4GHz freeze, iw dev wlan1 scan will revive the clients, I did not know sammo's quicker (less cpu) trick.

1 Like

@zen1932, thanks for the confirmation it works on the C7
you can always set a cron to run every minute for the scan
Periodic scan should prevent it locking. Scanning 1 channel is quicker and blocks the wifi for less time, we are talking millisecond

@zen1932 Great observation! Could you please try

iw event -t -f > /tmp/iwevent.log

and upload a copy that shows how it looks like on your Archer when 2.4ghz WiFi locks up?

The same problem on the newest firmware for now.

#Luci

1 Like

@Catfriend1, I am monitoring the event 84 and 64, otherwise it is rather verbose. but since the 2.4Ghz freeze randomly, I may wait long before getting the results.

@sammo, I had already set a cron job, but a full scan, now thanks to your finding I will replace it with your quicker scan.

1 Like

Shorten a few commands

iw dev $(iw dev|grep "Interface|channel|type"|grep -B 2 'channel.*24..'|grep -m 1 -B 1 'AP'|awk '/Interface/ {print $2}') scan trigger freq 2447 flush >/dev/null 2>&1

2 Likes

Nice, but would it not be less CPU (given this is running every 30 to 60 seconds) to pre-compute the command once and just use the values that work for your deployment?

So like this:

# 5Ghz – ch 161

iw dev wlan1 scan trigger freq 5805 flush 

# 2.4GHz -  ch 8

iw dev wlan0 scan trigger freq 2447 flush

BTW- this works fine on MT7621 / 7603 as well as the C7 (I have both). So this workaround seems to point the finger at something common to several (all?) platforms.

Here is a NetSpot signal scan for the past 30 minutes showing how even in a crowded WiFi environment that causes the OpenWRT APs to lower their 2.4 radio signals, is maintained at a high level with a scan every 60 seconds. Top trace (Brown is the 2.4 radio, next one down (blue) is the 5Ghz radio. AP is 3' from my MacBook Pro running NetSpot.

2 Likes

I think the common problem reported is 2.4Ghz which uses the ath9k driver whereas 5Ghz uses ath10k.

I did 1 channel scan as when I do a full scan I noticed a bleep on my iperf. I think the scan 'blocks' data been sent/received.

Nice graph!!

OK I see what you mean find the wlanXXXXX, yes of course, unless someone did a uci set xxxx
Would be great if someone solves the underlining code than a workaround....

1 Like

Yeah, do NOT do that regularly (at least not on the MT7621) as it drops the link for a second.

Maybe only do it when there is some other clear signal that there is an issue, and this might clear it up.

No drops on the 2.4 GHz radio when running these scans.

[edit: Note that the wlan1 shown here is a 5Ghz on the MT76, on the C7 the 2.4 is wlan1, I keep mixing up which is which (since I run both), but this example and in my prior post were from a test MT76, so use the right wlan ID. The C7 2.4 seems to do well with the probe every 60 seconds.]

it would be great to monitor when the problem happens, You can monitor wifi event via

iw event -f -t > /tmp/event.txt

FYI. None of these workaround works, but could be useful of other purpose/vulnerabilities

disable ipv6
option disassoc_low_ack '0'
option wpa_group_rekey '43200'
option wpa_strict_rekey '1'
option wpa_disable_eapol_key_retries 1
option txantenna 1
option rxantenna 1

1 Like

thanks to
option ifname 'wlan0' in config wifi-iface (/etc/config/wireless), one can assign a fixed wlan name, no need to recalculate every n seconds.

I'm also seeing the following dmesg errors after the wifi locks up. Short iw scan does indeed solve the issue temporarily. Can we use this as a trigger for a short scan instead of scanning every 30 seconds?

[760751.300380] ath: phy1: Unable to reset channel, reset status -5
[761414.248012] ath: phy1: Unable to reset channel, reset status -5
[761582.062141] ath: phy1: Unable to reset channel, reset status -5
1 Like

@peca89 Do you also have an Tplink Archer c7? I'll check my dmesg and if it reproduces those log lines will adjust my watchdog script.

I've checked my 5 production APs (they're all Archer C7's running non-ct ath10k drivers - if that does matter in regards to 2.4 GHz ath9k driver problems). None of them after ~ one week of runtime and a lot of load in our business environment shows those lines in "dmesg | grep ath".

ath: phy1: Unable to reset channel, reset status -5

Instead, every device has a log output similar to this (captured from AP 1 after 7 days of uptime).

[   15.682419] ath10k_pci 0000:00:00.0: enabling device (0000 -> 0002)
[   15.688963] ath10k_pci 0000:00:00.0: pci irq legacy oper_irq_mode 1 irq_mode 0 reset_mode 0
[   16.956848] ath10k_pci 0000:00:00.0: qca988x hw2.0 target 0x4100016c chip_id 0x043202ff sub 0000:0000
[   16.966260] ath10k_pci 0000:00:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0
[   16.979337] ath10k_pci 0000:00:00.0: firmware ver 10.2.4-1.0-00047 api 5 features no-p2p,raw-mode,mfp,allows-mesh-bcast crc32 35bd9258
[   17.549366] ath10k_pci 0000:00:00.0: board_file api 1 bmi_id N/A crc32 bebc7c08
[   18.737064] ath10k_pci 0000:00:00.0: htt-ver 2.1 wmi-op 5 htt-op 2 cal file max-sta 128 raw 0 hwcrypto 1
[   18.849005] ath: EEPROM regdomain sanitized
[   18.849014] ath: EEPROM regdomain: 0x64
[   18.849018] ath: EEPROM indicates we should expect a direct regpair map
[   18.849036] ath: Country alpha2 being used: 00
[   18.849039] ath: Regpair used: 0x64
[   18.978956] ath: EEPROM regdomain sanitized
[   18.978965] ath: EEPROM regdomain: 0x64
[   18.978968] ath: EEPROM indicates we should expect a direct regpair map
[   18.978987] ath: Country alpha2 being used: 00
[   18.978990] ath: Regpair used: 0x64
[   44.444997] ath: EEPROM regdomain: 0x8114
[   44.449108] ath: EEPROM indicates we should expect a country code
[   44.455297] ath: doing EEPROM country->regdmn map search
[   44.460688] ath: country maps to regdmn code: 0x37
[   44.465546] ath: Country alpha2 being used: DE
[   44.470051] ath: Regpair used: 0x37
[   44.473582] ath: regdomain 0x8114 dynamically updated by user
[   44.479487] ath: EEPROM regdomain: 0x8114
[   44.483563] ath: EEPROM indicates we should expect a country code
[   44.489746] ath: doing EEPROM country->regdmn map search
[   44.495131] ath: country maps to regdmn code: 0x37
[   44.499988] ath: Country alpha2 being used: DE
[   44.504493] ath: Regpair used: 0x37
[   44.508032] ath: regdomain 0x8114 dynamically updated by user
[   53.212353] ath10k_pci 0000:00:00.0: pdev param 0 not supported by firmware

@sammo It's just a stomach feeling, but event id 84 (probe_client) does not seem to indicate a problem with the 2.4 ghz radio. I'm constantly monitoring the iw event log and walked around with my phone - when walking out of range of AP 1, it reported "at the same moment":

1621928540.803986: wlan0-3 (phy #0): unknown event 84
1621928540.805934: wlan0-3: del station 48:2c:a0:xx:xx:xx

Taking another 2.4 ghz device, the same could be reproduced.

1621928678.485789: wlan1-1: del station 0e:9a:62:xx:xx:xx
1621928678.486927: wlan1-1 (phy #1): unknown event 84

So I'll still watch out for "iw event 64" and the "ath: phy1: Unable to reset channel, reset status -5" log lines - currently none of them was found.

1 Like

I think event 84 is a probe_client follow by del client, its 'too' late to trigger by then as the clients disconnect.
event 64 could be because I was hammering with iperf at the time and as a result a by product.

We need to trigger on 'something' before any clients get disconnected it my prefer solution if the underlying coding cannot be fixed.

I dont see reset channel on my device GL ar300M

Half way house, trigger on clients connect/disconnect?

1 Like

@sammo rarely getting "unknown event 128" here but cannot proove that we have had problems with the 2.4 G wifi at this time. Where can I look up what those error codes mean?