I haven't used the workaround yet as I first wanted to see what iw event shows in case of "wifi crash". Currently wifi is rock solid so still waiting for a normal spontaneous crash scenario to come.
@zen1932, thanks for the confirmation it works on the C7
you can always set a cron to run every minute for the scan
Periodic scan should prevent it locking. Scanning 1 channel is quicker and blocks the wifi for less time, we are talking millisecond
@Catfriend1, I am monitoring the event 84 and 64, otherwise it is rather verbose. but since the 2.4Ghz freeze randomly, I may wait long before getting the results.
@sammo, I had already set a cron job, but a full scan, now thanks to your finding I will replace it with your quicker scan.
Nice, but would it not be less CPU (given this is running every 30 to 60 seconds) to pre-compute the command once and just use the values that work for your deployment?
BTW- this works fine on MT7621 / 7603 as well as the C7 (I have both). So this workaround seems to point the finger at something common to several (all?) platforms.
Here is a NetSpot signal scan for the past 30 minutes showing how even in a crowded WiFi environment that causes the OpenWRT APs to lower their 2.4 radio signals, is maintained at a high level with a scan every 60 seconds. Top trace (Brown is the 2.4 radio, next one down (blue) is the 5Ghz radio. AP is 3' from my MacBook Pro running NetSpot.
I think the common problem reported is 2.4Ghz which uses the ath9k driver whereas 5Ghz uses ath10k.
I did 1 channel scan as when I do a full scan I noticed a bleep on my iperf. I think the scan 'blocks' data been sent/received.
Nice graph!!
OK I see what you mean find the wlanXXXXX, yes of course, unless someone did a uci set xxxx
Would be great if someone solves the underlining code than a workaround....
Yeah, do NOT do that regularly (at least not on the MT7621) as it drops the link for a second.
Maybe only do it when there is some other clear signal that there is an issue, and this might clear it up.
No drops on the 2.4 GHz radio when running these scans.
[edit: Note that the wlan1 shown here is a 5Ghz on the MT76, on the C7 the 2.4 is wlan1, I keep mixing up which is which (since I run both), but this example and in my prior post were from a test MT76, so use the right wlan ID. The C7 2.4 seems to do well with the probe every 60 seconds.]
I'm also seeing the following dmesg errors after the wifi locks up. Short iw scan does indeed solve the issue temporarily. Can we use this as a trigger for a short scan instead of scanning every 30 seconds?
[760751.300380] ath: phy1: Unable to reset channel, reset status -5
[761414.248012] ath: phy1: Unable to reset channel, reset status -5
[761582.062141] ath: phy1: Unable to reset channel, reset status -5
@peca89 Do you also have an Tplink Archer c7? I'll check my dmesg and if it reproduces those log lines will adjust my watchdog script.
I've checked my 5 production APs (they're all Archer C7's running non-ct ath10k drivers - if that does matter in regards to 2.4 GHz ath9k driver problems). None of them after ~ one week of runtime and a lot of load in our business environment shows those lines in "dmesg | grep ath".
ath: phy1: Unable to reset channel, reset status -5
Instead, every device has a log output similar to this (captured from AP 1 after 7 days of uptime).
[ 15.682419] ath10k_pci 0000:00:00.0: enabling device (0000 -> 0002)
[ 15.688963] ath10k_pci 0000:00:00.0: pci irq legacy oper_irq_mode 1 irq_mode 0 reset_mode 0
[ 16.956848] ath10k_pci 0000:00:00.0: qca988x hw2.0 target 0x4100016c chip_id 0x043202ff sub 0000:0000
[ 16.966260] ath10k_pci 0000:00:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0
[ 16.979337] ath10k_pci 0000:00:00.0: firmware ver 10.2.4-1.0-00047 api 5 features no-p2p,raw-mode,mfp,allows-mesh-bcast crc32 35bd9258
[ 17.549366] ath10k_pci 0000:00:00.0: board_file api 1 bmi_id N/A crc32 bebc7c08
[ 18.737064] ath10k_pci 0000:00:00.0: htt-ver 2.1 wmi-op 5 htt-op 2 cal file max-sta 128 raw 0 hwcrypto 1
[ 18.849005] ath: EEPROM regdomain sanitized
[ 18.849014] ath: EEPROM regdomain: 0x64
[ 18.849018] ath: EEPROM indicates we should expect a direct regpair map
[ 18.849036] ath: Country alpha2 being used: 00
[ 18.849039] ath: Regpair used: 0x64
[ 18.978956] ath: EEPROM regdomain sanitized
[ 18.978965] ath: EEPROM regdomain: 0x64
[ 18.978968] ath: EEPROM indicates we should expect a direct regpair map
[ 18.978987] ath: Country alpha2 being used: 00
[ 18.978990] ath: Regpair used: 0x64
[ 44.444997] ath: EEPROM regdomain: 0x8114
[ 44.449108] ath: EEPROM indicates we should expect a country code
[ 44.455297] ath: doing EEPROM country->regdmn map search
[ 44.460688] ath: country maps to regdmn code: 0x37
[ 44.465546] ath: Country alpha2 being used: DE
[ 44.470051] ath: Regpair used: 0x37
[ 44.473582] ath: regdomain 0x8114 dynamically updated by user
[ 44.479487] ath: EEPROM regdomain: 0x8114
[ 44.483563] ath: EEPROM indicates we should expect a country code
[ 44.489746] ath: doing EEPROM country->regdmn map search
[ 44.495131] ath: country maps to regdmn code: 0x37
[ 44.499988] ath: Country alpha2 being used: DE
[ 44.504493] ath: Regpair used: 0x37
[ 44.508032] ath: regdomain 0x8114 dynamically updated by user
[ 53.212353] ath10k_pci 0000:00:00.0: pdev param 0 not supported by firmware
@sammo It's just a stomach feeling, but event id 84 (probe_client) does not seem to indicate a problem with the 2.4 ghz radio. I'm constantly monitoring the iw event log and walked around with my phone - when walking out of range of AP 1, it reported "at the same moment":
1621928540.803986: wlan0-3 (phy #0): unknown event 84
1621928540.805934: wlan0-3: del station 48:2c:a0:xx:xx:xx
Taking another 2.4 ghz device, the same could be reproduced.
1621928678.485789: wlan1-1: del station 0e:9a:62:xx:xx:xx
1621928678.486927: wlan1-1 (phy #1): unknown event 84
So I'll still watch out for "iw event 64" and the "ath: phy1: Unable to reset channel, reset status -5" log lines - currently none of them was found.
I think event 84 is a probe_client follow by del client, its 'too' late to trigger by then as the clients disconnect.
event 64 could be because I was hammering with iperf at the time and as a result a by product.
We need to trigger on 'something' before any clients get disconnected it my prefer solution if the underlying coding cannot be fixed.
I dont see reset channel on my device GL ar300M
Half way house, trigger on clients connect/disconnect?
@sammo rarely getting "unknown event 128" here but cannot proove that we have had problems with the 2.4 G wifi at this time. Where can I look up what those error codes mean?