Archer C7 2.4 GHz wireless dies in 24~48 hours

Just a thought, if you have 30 clients connected and they all rise the event at the same time, you will have spawn 30 scans which could overwhelm the CPU. Need to prevent multi spawn

I tested this, when a scan is already triggered but not completed when the next scan commands fires a message is thrown from iw like "device busy - code XX".

What would be more smart: if the last scan was less than 10 seconds ago , sleep 10 seconds and then trigger a new scan. This would lower scan count in total and guarantee that every iw event 60 gets a scan afterwards. Not sure how to do it in script to cancel "iw event 60 pending buffer" once a scan just launched - or the described mechanism would queue up scans probably faster then the 10s + scan delay passes by.

if mkdir /tmp/scan.txt; then
       iw scan......
       (sleep 30 && rm -rf /tmp/scan.txt)&
fi

Quick and nasty

1 Like

My bad , I just move the goal post and replace scan with mkdir. Still external command call. Each event has a epoch time in the first field. Using built-in shell for checks

dt=0
iw event -f -t| while read line; do

if [ ${line%%.*} -gt $dt ] ; then
      scan
      dt=$(($(date +%s)+10))
fi

done

1 Like

Now creating a watchdog shell script.

  • /etc/rc.local
# Put your custom commands here that should be executed once
# the system init finished. By default this file does nothing.

/bin/bash /root/ath9k-watchdog.sh start &

exit 0
  • /root/ath9k-watchdog.sh
#/bin/bash
#
# Purpose:
# 	Watch ath9k 2.4 GHz wadio event 60 and if the event occurs, trigger a WiFi scan as a workaround to avoid issues
# 	slowly creeping up like higher WiFi latency or unexpected WiFi client disconnects.
#
# Installation:
#	bash "/root/ath9k-watchdog.sh" install
#
# Command line:
# 	bash "/root/ath9k-watchdog.sh" livelog
# 	bash "/root/ath9k-watchdog.sh" start &
# 	bash "/root/ath9k-watchdog.sh" stop
#
# For testing purposes only:
# 	bash "/root/ath9k-watchdog.sh" start
#
# Prerequisites:
#	BASH
#
# Script Configuration.
PATH=/usr/bin:/usr/sbin:/sbin:/bin
SCRIPT_FULLFN="$(basename -- "${0}")"
SCRIPT_NAME="${SCRIPT_FULLFN%.*}"
LOGFILE="/tmp/${SCRIPT_NAME}.log"
LOG_MAX_LINES="1000"
#
#
# -----------------------------------------------------
# -------------- START OF FUNCTION BLOCK --------------
# -----------------------------------------------------
logAdd ()
{
	TMP_DATETIME="$(date '+%Y-%m-%d [%H-%M-%S]')"
	TMP_LOGSTREAM="$(tail -n ${LOG_MAX_LINES} ${LOGFILE} 2>/dev/null)"
	echo "${TMP_LOGSTREAM}" > "$LOGFILE"
	echo "${TMP_DATETIME} $*" | tee -a "${LOGFILE}"
	return
}


serviceMain ()
{
	#
	# Usage:		serviceMain
	# Called By:	MAIN
	#
	logAdd "[INFO] === SERVICE START ==="
	#
	logAdd "[INFO] Waiting to discover 2.4 GHz radio interface ..."
	while (true); do
		RADIO_ATH9K="$(iw dev|grep "Interface\|channel\|type"|grep -B 2 'channel.*24..'|grep -B 1 'AP'|tail -n 2|grep Interface|awk '{print $2}')"
		if [ ! -z "${RADIO_ATH9K}" ]; then
			break
		fi
		sleep 10
	done
	#
	logAdd "[INFO] Setup iw_event_scan_trigger on interface [${RADIO_ATH9K}]"
	#
	dt=0
	iw event -t -f | while read line; do
		if $(echo -n "${line}" | grep -q "${RADIO_ATH9K}.*: unknown event 60"); then
			#
			# Check if the last scan was more than 10 seconds ago.
			if [ ${line%%.*} -gt ${dt} ] ; then
				echo "$(date +%Y-%m-%d_%H-%M-%S): ${dt} ${RADIO_ATH9K} scan ..."
				iw dev ${RADIO_ATH9K} scan trigger freq 2447 flush >/dev/null 2>&1
				dt=$(($(date +%s)+10))
			fi
		fi
	done
	#
	return 0
}
# ---------------------------------------------------
# -------------- END OF FUNCTION BLOCK --------------
# ---------------------------------------------------
#
# Check shell
if [ ! -n "${BASH_VERSION}" ]; then
	logAdd "[ERROR] Wrong shell environment, please run with bash."
	exit 99
fi
#
trap "" SIGHUP
trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT
#
if [ "${1}" = "install" ]; then
	if ( grep -q "$(which bash) $(readlink -f "${0}") start &" "/etc/rc.local"); then
		echo "[INFO] Script already present in startup."
		exit 0
	fi
	sed -i "\~^exit 0~i $(which bash) $(readlink -f "${0}") start &\n" "/etc/rc.local"
	echo "[INFO] Script successfully added to startup."
	exit 0
elif [ "${1}" = "livelog" ]; then
	tail -f "${LOGFILE}"
	exit 0
elif [ "${1}" = "start" ]; then
	serviceMain &
	#
	# Wait for kill -INT.
	wait
	exit 0
elif [ "${1}" = "stop" ]; then
	ps w | grep -v grep | grep "$(basename -- $(which bash)) .*$(basename -- ${0}) start" | sed 's/ \+/|/g' | sed 's/^|//' | cut -d '|' -f 1 | grep -v "^$$" | while read pidhandle; do
		echo "[INFO] Terminating old service instance [${pidhandle}] ..."
		kill -INT "${pidhandle}" 2>/dev/null
		kill "${pidhandle}" 2>/dev/null
	done
	#
	# Check if parts of the service are still running.
	if [ "$(ps w | grep -v grep | grep "$(basename -- $(which bash)) .*$(basename -- ${0}) start" | sed 's/ \+/|/g' | sed 's/^|//' | cut -d '|' -f 1 | grep -v "^$$" | wc -l)" -gt 0 ]; then
		logAdd "[ERROR] === SERVICE FAILED TO STOP ==="
		ps w | grep "iw event\|${SCRIPT_NAME}" | grep -v grep
		exit 99
	fi
	#
	killall iw 2>/dev/null
	#
	logAdd "[INFO] === SERVICE STOPPED ==="
	exit 0
fi
#
logAdd "[ERROR] Parameter #1 missing."
logAdd "[INFO] Usage: bash ${SCRIPT_FULLFN} {install|livelog|start|stop}"
exit 99

To see if the watchdog triggers the scans appropriately, put "iw event -t -f " into the SSH shell running in parallel. After "unknown event 60" lines there should be "unknown event 33/34" lines afterwards if the scan was triggered by the watchdog script.

Log should look like this:

1622199174.822862: wlan1-2 (phy #1): unknown event 60
1622199174.823196: wlan1-2 (phy #1): unknown event 60
1622199174.866459: wlan1 (phy #1): unknown event 33
1622199174.941817: wlan1 (phy #1): unknown event 34
1 Like

@pgn-1111 Thanks for contacting me to update my original ath10k-ct-watchdog script and put both to my openwrt "useful scripts" repo for the TP-Link Archer C7v2|5.

@sammo Perfect, I've intregrated your snippet into the script and it works well, now throttling scans if the last scan was less than 10 seconds ago.

We've now agreed to put the recent version of the script here:

1 Like

I would swap these 2 lines round. echo,grep mean more external command calls and using up resources

	if $(echo -n "${line}" | grep -q "${RADIO_ATH9K}.*: unknown event 60"); then
			#
			# Check if the last scan was more than 10 seconds ago.
			if [ ${line%%.*} -gt ${dt} ] ; then
1 Like

i found telegram script in your github, what is that ?

it's a little "library" easy to integrate into your own bash scripts and to send/edit telegram notifications via a telegram bot to your phone.

I use it for different things, lately to notify myself of unexpected router reboots.

1 Like

Can you tell me how to make it work ? example check connection or if some clients disconnect, the bot received a notification, thanks.

1 Like

It's just filling in your API key of your Telegram bot and the chatId from Telegram into https://github.com/Catfriend1/openwrt-presence/blob/master/scripts/telegram/lib_telegram_cfg_bot.sh .

You are then ready to execute the test/demo script : https://github.com/Catfriend1/openwrt-presence/blob/master/scripts/telegram/test_lib_telegram_messenger.sh

Hi, I'm trying to make ath9k-watchdog.sh work on my openwrt-sfe-flowoffload-ath79 image from https://github.com/gwlim/openwrt-sfe-flowoffload-ath79/tree/master/JUL-2020/openwrt-sfe-flowoffload-normal/mips74k/TP-Link%20Archer%20C7v2-mips74k-ath10k
For that I'd have to install bash if I got everything right. Unfortunately I cannot install bash normally ("/usr/sbin/opkg-key: line 22: usign: not found" when updating package lists) - I assume because of this custom image. I can't find a compatible ipk for manual install. Can you help me out?

While I've made a telegram notifier that pushes me a message when any of my Archer's output syslog line indicating an unexpected reboot occured (procd -- init complete --), I've discovered something interesting along the days:

  • all archers went fine and stable for 9 days without sudden reboots
  • one unit (that is very similar configured like the others) often has sudden reboots, after 8 days it went into the reboot and my script reported this immediately (from the syslog server to telegram).
  • the "problematic" unit did the reboot as soon I was turning on my old TV (in same room, short 2.4 GHz wifi distance) and using it to view video from a miniDLNA server.

So, I wonder if those "sudden reboots" have something to do with multicast/broadcast announcements of UPnP?!

btw /sys/kernel/debug/crashlog was empty. ( Source: Crashlog retrieval (MIPS) - #3 by hailfinger )

I've now hit a different log line for a PC associated with the AP:

[934478.437766] Rekeying PTK for STA ac:7b:a1:xx:yy:zz but driver can't safely do that.

What does that mean? Any negative consequences to be expected after that occured?

Hi there,

I put the workaround in my rc.local:
iw dev wlan_2g scan trigger freq 2437 flush >/dev/null 2>&1

It kind works, my wifi connection didnt stop anymore, for Youtube,Download,Browser is working fine. But when I'm using VPN or other specific apps they crash or disconnect every time, probably because of IW SCAN.

Anyone has experience that ?
1 Like

No I don't experience this problem. How is your wan connected? Ethernet or sta/ap?

1 Like

I'm connected via Ethernet on FIBER.

1 Like

I think I have identify the cause of ath9k slowness. I have an GL AR300m running OpenWRT 21.02 which only have a single 2.4Ghz radio. I run iperf and the slowness shows before 3hrs on AP. This rate increases in repeater mode with the repeater losing beacon and the WiFi stack reconnects.

I think the problem is txq buffer size.

Check your physical radio for 2.4ghz , it's either phy0 or phy1

Determine your phy 2.4ghz radio
iwinfo|grep -m 1 -A 10 '.*2\.4'|grep -Eo 'phy\d+$'

Check memory limit and double it. Mine was originally set at 4194304

iw phy phy0 get txq

iw phy phy0 set txq memory_limit 8388608

Don't forget to disable scan workaround if you got that running

I can confirm the below settting on a GL AR300 with 128M memory has resolved my issues.

root@repeater:~# iw phy phy0 get txq
Packet limit: 8192 pkts
Memory limit: 8388608 bytes
Quantum: 300 bytes
Number of queues: 4096
Backlog: 0 pkts
Memory usage: 0 bytes
Packet limit overflows: 0
Memory limit overflows: 0
Hash collisions: 3116

The above setting was testing for AP/STA. When testing for dumb AP, it dies alot quicker
so we know txq is part of the equation. I've also double the Packet limit to 16384 and currently testing

iw phy phy0 set txq limit 16384

4 Likes

OpenWrt 21.02.0-rc.3 on TP-Link Archer C7v2 here:

iw phy phy0 get txq

Packet limit:           8192 pkts
Memory limit:           16777216 bytes
Quantum:                300 bytes
Number of queues:       4096
Backlog:                0 pkts
Memory usage:           0 bytes
Packet limit overflows: 0
Memory limit overflows: 0
Hash collisions:        0

iw phy phy1 get txq

Packet limit:           8192 pkts
Memory limit:           4194304 bytes
Quantum:                300 bytes
Number of queues:       4096
Backlog:                0 pkts
Memory usage:           0 bytes
Packet limit overflows: 0
Memory limit overflows: 0
Hash collisions:        4546

@sammo Which value do you suggest to optimize here in order to stabilize the Archer C7v2 ath9k 2.4 GHz Wifi without using the ath9k-watchdog.sh script? Ahhh silly me, phy1 is the 2.4 GHz and yes, I'll gve the 8 M memory limit a try now :-). Thank you.

ToDo for me:

iw phy phy1 set txq memory_limit 8388608

and I'll test if it is enough to set it once after startup or if it must be regularly refreshed e.g. when the adapter restarts.

To see how long the change sticks:

while(true);do clear; iw phy phy1 get txq; sleep 1; done;

UPDATE: To make the setting stick, put it in /etc/rc.local , on top for example is fine.

# Put your custom commands here that should be executed once
# the system init finished. By default this file does nothing.

/usr/sbin/iw phy phy1 set txq memory_limit 8388608

exit 0

I can confirm these 2 txq setting makes a difference to wireless dying.
I dont know the optimum value or understand how the algorithm works, but these 2 setting makes a big difference. At least this pinpoints where the problem is and a developer can do further investigation

iw phy <PHY> set txq memory_limit 8388608
iw phy <PHY> set txq limit 16384
1 Like