19.07.xx: Strange kernel jffs2 bug

Hi,

I'm currently writing a script that allows for offline upgrading OpenWrt through a debian remote management server. The scenario is as follows:

MANUAL ONE-TIME PREPARATION:

  • I've put a line in /etc/rc.local that triggers "/bin/sh /root/post-upgrade.sh &" on every boot. post-upgrade.sh looks for a marker that gets removed by OpenWRT on sysupgrade (/etc/_OPKG_INSTALL_DONE). If the marker is missing, it one-time shots /root/opkg_reinstall.sh (which is capable to do offline reinstallation of packages uploaded to /root/packages/ before the sysupgrade commenced). After successful opkg reinstallation, it creates the marker (/etc/_OPKG_INSTALL_DONE).

SCRIPT RUN:

  • Management server downloads openwrt_release and board.json via ssh from the router we'd like to upgrade with a new (official stable) firmware image
  • It reaches out to downloads.openwrt.org to fetch the specified firmware version and the opkg lists consisting of multiple "packages.gz" files.
  • It extracts the package lists and looks through them for mission-critical packages that are required for a mesh AP to come online again via wireless (batman-adv, wpad-mesh-openssl, kmod-ath10k, non-ct firmware) because the stations I need to upgrade don't have LAN access.
  • It downloads the mission-critical IPK files
  • It uploads the new firmware image to /tmp/sysupgrade.bin of the AP and then uploads the IPK package files to /root/packages/ of the AP.
  • It remotely triggers "sysupgrade -k -v /tmp/sysupgrade.bin" to commence the firmware upgrade.
  • The router itself then does a reboot after flash.
  • The router automatically fires /root/post-upgrade.sh (and subsequently /root/opkg_reinstall.sh) after 30 secs time of sleep.
  • I'm logging what happens during the opkg_reinstall.sh to /root/opkg_reinstall.log so I can later have a look that everything went fine.
  • The scripts trigger "reboot" after they complete. Because batman-adv, kmod-ath10k, wpad-mesh-openssl and the ath10k-firmware need a reboot to bring up the wifi and get the device online via mesh again.

MY PROBLEM:

  • If I initiate the sysupgrade via the script on the management server, sysupgrade is carried out successfully and the OpenWRT router comes back to life short after, it preserved ALL settings and its IP address. So after the flash (first) reboot everything is fine.
  • The scripts complete and reboot a second time. According to the logs, every IPK was installed successfully. After the second reboot, my /etc/config has been reverted to stock defaults and I don't know why that (reproducibly) happens. The device is then reachable through 192.168.1.1/24 instead of the normal IP I expect it to have. The contents of /etc/config/ are defaults now, other files that are directly in /etc/ are still there (no revert occured) and /root/ is fully there (no revert occured). I can also see my logs on /root/.

WHAT I TRIED TO REMEDIATE THE PROBLEM:

  1. I've tried to remove the "reboot" command at the end of /root/post-upgrade.sh. When I log on via SSH, all IPK had been installed successfully and there is only one single indication of error in the dmesg.

[after 74s kernel boot time] jffs2: Newly-erased block contained word xxxxxxx at offset xxxxxxx

I think that's the reason why I've got the "/etc/config revert phenomenon on next reboot". I've checked, it does not matter if the script issues "reboot" or if I do that manually from SSH. On next reboot, my /etc/config is gone and the dmesg says "jffs2 whiteout" because of errors detected.

  1. I've verified flash is fine. I think it's not flash failure. I can manually put /etc/config back from my documentation, power cycle or reboot the router, and it will start operating correctly with all settings in place.

  2. I've verified my overlay doesn't run out of space. I've got around 9 MByte overlay free. With my 11 IPK files (temporarily) stores onto /root/packages, I'm around 47% space used. So no trouble expected from this.

  3. I've removed the "/bin/sh /root/post-upgrade.sh &" line from "/etc/rc.local". Started all over, let the sysupgrade flash go through (again triggered by the management server). Then I logged on via SSH and manually issued "/bin/sh /root/post-upgrade.sh" from command line. SUPRISINGLY - this always works (tried multiple times). The jffs2 doesn't "corrupt", a second reboot follows and the router has all settings (preserved and effective). Normal operation.

This somehow led me here to report it to you. Please let me know if I should file the bug elsewhere in another place if it fits better there. I suspect it could be a kernel/jffs2 bug.

Especially when changing my "opkg_reinstall.sh" file and removing the parts "opkg install|remove .... 2>&1 | tee -a /root/opkg_reinstall.log". If I remove them, I can also run from /etc/rc.local and the corruption does NOT occur.

Observing the dmesg/logread -f while the opkg_reinstall.sh takes place, I see kmodloader lines loading the new kernel modules (e.g. BATMAN-ADV).

May it be that the kernel doesn't allow writing to the overlay (for the log) for just some microseconds? It's really reproducible. Adding the "tee -a" log lines and running from "/etc/rc.local" kills my partition. Even if I wait for minutes (sleep command) before the second reboot. Removing the tee log to /root/... or running from SSH manually, no corruption occurs.

Thanks for your support on this.

Kind regards,
Catfriend1

Here is a list of files I'm using for reference, viewed on the management server:

For reference:

  • The management server runs this script to prepare offline sysupgrade and then upgrade remotely

Script name: /root/openwrt/remote_update.sh
Last updated: 2020-10-29
Verified working: 2020-10-29 on TP-Link Archer C7 v5 snapshot

#!/bin/sh
#
# Command line
# 	sh "/root/openwrt/remote_update.sh" "IP" "19.07.4"
# 	sh "/root/openwrt/remote_update.sh" "IP" "19.07.4" dry
# 	ssh -oStrictHostKeyChecking=no -i "/etc/ssh/ssh_host_rsa_key" "root@IP" "/bin/sh /root/opkg_reinstall.sh; reboot"
#
# Script configuration.
DRY_RUN=0
REMOTE_PACKAGE_CACHE_PATH="/root/packages"
SSH_PRIVATE_KEY="/etc/ssh/ssh_host_rsa_key"
UPLOAD_PACKAGE_CACHE=0
#
#########################
# FUNCTIONS BLOCK START	#
#########################
fetchPackageList () {
	# Syntax:
	#	fetchPackageList "[WEBSERVER_DIRECTORY]"
	#
	# Global vars.
	# 	[IN] FP_PACKAGES
	#
	# Called By:
	# 	fetchPackageLists
	#
	# Variables.
	TMP_FPL_URL="${1}"
	#
	echo "[INFO] Downloading ${TMP_FPL_URL}/Packages.gz"
	TMP_FPL_RESULT="$(wget -q -O - -c "${TMP_FPL_URL}/Packages.gz" | gzip -d 2> /dev/null)"
	if [ -z "${TMP_FPL_RESULT}" ]; then
		echo "[ERROR] fetchPackageList: Failed to download and unpack ${TMP_FPL_URL##*/}/Packages.gz"
		return 1
	fi
	echo "${TMP_FPL_RESULT}" | egrep "^Package:|^Filename:" | sed -E "s^\^Filename:\s+^URL:"${TMP_FPL_URL}"/^" >> "${FP_PACKAGES}"
	return 0
}

fetchPackageLists () {
	# Syntax:
	# 	fetchPackageLists
	#
	# Global vars.
	# 	[IN] ARCH
	# 	[IN] FP_PACKAGES
	# 	[IN] TARGET
	# 	[IN] UPDATE_TO_VERSION
	#
	# Called By:
	# 	fetchPackages
	#
	# Check prerequisites.
	if [ -z "${FP_PACKAGES}" ]; then
		return 1
	fi
	#
	echo "[INFO] fetchPackageLists"
	#
	# Get dir listing and extract kmods/[KMOD_DIR_NAME], e.g. "4.14.195-1-b84a5a29b1d5ae1dc33ccf9ba292ca1d"
	# This information is also available from "/etc/opkg/distfeeds.conf" if we would unpack the new firmware image. 
	KMOD_DIR_NAME="$(wget -q -O - "http://downloads.openwrt.org/releases/${UPDATE_TO_VERSION}/targets/${TARGET}/generic/kmods" | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d' | grep -v "^/" | sed -e "s/\/$//")"
	if [ -z "${KMOD_DIR_NAME}" ]; then
		echo "[ERROR] fetchPackageLists: Failed to determine kmods dir name on server."
		return 1
	fi
	#
	rm -f "${FP_PACKAGES}"
	PACKAGE_LISTS_REL_PATHS="targets/${TARGET}/generic/packages targets/${TARGET}/generic/kmods/${KMOD_DIR_NAME} packages/${ARCH}/base packages/${ARCH}/luci packages/${ARCH}/packages packages/${ARCH}/routing packages/${ARCH}/telephony"
	for packagelistrelpath in ${PACKAGE_LISTS_REL_PATHS}; do
		fetchPackageList "http://downloads.openwrt.org/releases/${UPDATE_TO_VERSION}/${packagelistrelpath}"
		if [ ! "$?" = 0 ]; then
			return 1
		fi
	done
	#
	return 0
}

fetchPackage () {
	# Syntax:
	# 	fetchPackage "[PACKAGE_NAME]"
	#
	# Global vars.
	#	[IN] DEVICE_IP
	# 	[IN] DOWNLOAD_DIR
	# 	[IN] FP_PACKAGES
	#
	# Called By:
	# 	fetchPackages
	#
	# Variables.
	TMP_FP_PACKAGE_NAME="${1}"
	#
	# Check prerequisites.
	if [ -z "${FP_PACKAGES}" ] || [ -z "${TMP_FP_PACKAGE_NAME}" ] ; then
		return 1
	fi
	#
	TMP_FP_URL="$(cat "${FP_PACKAGES}" 2> /dev/null | grep "^URL:" | grep "/"${TMP_FP_PACKAGE_NAME}"_.*\.ipk$" | sed -e "s/^URL://" | head -n 1)"
	if [ -z "${TMP_FP_URL}" ]; then
		echo "[ERROR] fetchPackage: Package [${TMP_FP_PACKAGE_NAME}] not found."
		return 1
	fi
	#
	echo "[INFO] Downloading ${TMP_FP_URL}"
	wget -q -O "${DOWNLOAD_DIR}/${TMP_FP_URL##*/}" "${TMP_FP_URL}"
	if [ ! "$?" = "0" ]; then
		echo "[ERROR] fetchPackage: Failed to download ${TMP_FP_URL##*/}"
		return 1
	fi
	#
	return 0
}

fetchPackages () {
	# Syntax:
	# 	fetchPackages
	#
	# Global vars.
	# 	[IN] ARCH
	# 	[IN] DOWNLOAD_DIR
	# 	[IN] TARGET
	# 	[IN] UPDATE_TO_VERSION
	#
	# Called By:
	# 	MAIN
	#
	# Variables.
	FP_PACKAGES="${DOWNLOAD_DIR}/packages.txt"
	#
	echo "[INFO] fetchPackages"
	fetchPackageLists
	if [ ! "$?" = 0 ]; then
		return 1
	fi
	#
	echo "[INFO] fetchPackages: Downloading IPK files for offline update"
	PACKAGE_NAMES="ath10k-firmware-qca988x batctl-full kmod-ath10k kmod-batman-adv kmod-crypto-crc32c kmod-crypto-hash kmod-lib-crc16 kmod-lib-crc32c libopenssl1.1 librt wpad-mesh-openssl"
	for package in ${PACKAGE_NAMES}; do
		fetchPackage "${package}"
		if [ ! "$?" = 0 ]; then
			return 1
		fi
	done
	#
	return 0
}

scpDownloadFile () {
	# Syntax:
	# 	scpDownloadFile "[DEVICE_IP]" "[SSH_PRIVATE_KEY]" "[REMOTE_FILE]" "[LOCAL_FILE]"
	#
	# Called By:
	# 	MAIN
	#
	# Variables.
	TMP_SDF_DEVICE_IP="${1}"
	TMP_SDF_SSH_PRIVATE_KEY="${2}"
	TMP_SDF_REMOTE_FILE="${3}"
	TMP_SDF_LOCAL_FILE="${4}"
	#
	# Create local parent directory.
	mkdir -p "${TMP_SDF_LOCAL_FILE%/*}"
	#
	# Download file.
	scp -oStrictHostKeyChecking=no -i "${TMP_SDF_SSH_PRIVATE_KEY}" "root@${TMP_SDF_DEVICE_IP}:${TMP_SDF_REMOTE_FILE}" "${TMP_SDF_LOCAL_FILE}"
	#
	return $?
}

scpUploadFile () {
	# Syntax:
	# 	scpDownloadFile "[DEVICE_IP]" "[SSH_PRIVATE_KEY]" "[LOCAL_FILE]" "[REMOTE_FILE]"
	#
	# Called By:
	# 	MAIN
	#
	# Variables.
	TMP_SDF_DEVICE_IP="${1}"
	TMP_SDF_SSH_PRIVATE_KEY="${2}"
	TMP_SDF_LOCAL_FILE="${3}"
	TMP_SDF_REMOTE_FILE="${4}"
	#
	# Download file.
	scp -oStrictHostKeyChecking=no -i "${TMP_SDF_SSH_PRIVATE_KEY}" "${TMP_SDF_LOCAL_FILE}" "root@${TMP_SDF_DEVICE_IP}:${TMP_SDF_REMOTE_FILE}"
	#
	return $?
}
#########################
# FUNCTIONS BLOCK END	#
#########################
#
# Verify SSH prerequisites.
if [ ! -f "${SSH_PRIVATE_KEY}" ]; then
	echo "[ERROR] SSH_PRIVATE_KEY=[${SSH_PRIVATE_KEY}] is missing."
	exit 99
fi
chmod 0600 "${SSH_PRIVATE_KEY}"
#
# Get command line params.
DEVICE_IP="${1}"
UPDATE_TO_VERSION="${2}"
if ( echo "${*}" | grep -q "dry" ); then
	DRY_RUN="1"
fi
if [ "${DRY_RUN}" = "1" ]; then
	echo "[INFO] DRY_RUN set."
fi
#
# Verify command line params.
if [ -z "${DEVICE_IP}" ]; then
	echo "[ERROR] Param #1 DEVICE_IP missing."
	exit 99
fi
## echo "[INFO] DEVICE_IP=[${DEVICE_IP}]"
#
if [ -z "${UPDATE_TO_VERSION}" ]; then
	echo "[ERROR] Param #1 UPDATE_TO_VERSION missing."
	exit 99
fi
## echo "[INFO] UPDATE_TO_VERSION=[${UPDATE_TO_VERSION}]"
#
# Firmware download dir
DOWNLOAD_DIR="/tmp/device_${DEVICE_IP}_${UPDATE_TO_VERSION}"
#
echo "[INFO] Getting device info from [${DEVICE_IP}] ..."
BOARD_JSON="${DOWNLOAD_DIR}/board.json"
scpDownloadFile "${DEVICE_IP}" "${SSH_PRIVATE_KEY}" "/etc/board.json" "${BOARD_JSON}"
#
OPENWRT_RELEASE="${DOWNLOAD_DIR}/openwrt_release"
scpDownloadFile "${DEVICE_IP}" "${SSH_PRIVATE_KEY}" "/etc/openwrt_release" "${OPENWRT_RELEASE}"
#
# Generate firmware download URL.
#
## Example: MODEL="archer-c7-v5"
MANUFACTURER="$(cat "${BOARD_JSON}" | grep "\"id\"" | cut -d '"' -f 4 | cut -d "," -f 1)"
echo "[INFO] MANUFACTURER=[${MANUFACTURER}]"
#
## Example: MODEL="archer-c7-v5"
MODEL="$(cat "${BOARD_JSON}" | grep "\"id\"" | cut -d '"' -f 4 | cut -d "," -f 2)"
echo "[INFO] MODEL=[${MODEL}]"
#
## Example: ARCH="mips_24kc"
ARCH="$(cat "${OPENWRT_RELEASE}" | grep "^DISTRIB_ARCH=" | cut -d "'" -f 2 | cut -d "/" -f 1)"
echo "[INFO] ARCH=[${ARCH}]"
#
## Example: TARGET="ath79"
TARGET="$(cat "${OPENWRT_RELEASE}" | grep "^DISTRIB_TARGET=" | cut -d "'" -f 2 | cut -d "/" -f 1)"
echo "[INFO] TARGET=[${TARGET}]"
#
## Example: http://downloads.openwrt.org/releases/19.07.3/targets/ath79/generic/openwrt-19.07.3-ath79-generic-tplink_archer-c7-v5-squashfs-sysupgrade.bin
FIRMWARE_DOWNLOAD_URL="http://downloads.openwrt.org/releases/${UPDATE_TO_VERSION}/targets/${TARGET}/generic/openwrt-${UPDATE_TO_VERSION}-${TARGET}-generic-${MANUFACTURER}_${MODEL}-squashfs-sysupgrade.bin"
SYSUPGRADE_IMAGE="${DOWNLOAD_DIR}/openwrt-${UPDATE_TO_VERSION}-${TARGET}-generic-${MANUFACTURER}_${MODEL}-squashfs-sysupgrade.bin"
#
SHASUMS_DOWNLOAD_URL="http://downloads.openwrt.org/releases/${UPDATE_TO_VERSION}/targets/${TARGET}/generic/sha256sums"
SHASUMS="${DOWNLOAD_DIR}/sha256sums"
if [ ! -f "${SHASUMS}" ]; then
	echo "[INFO] Downloading [${SHASUMS_DOWNLOAD_URL}] to [${SHASUMS}]..."
	wget -qq -O "${SHASUMS}" "${SHASUMS_DOWNLOAD_URL}"
fi
#
echo "[INFO] Checking for existing firmware image ..."
CHECKSUM_RESULT="$(cd "${SYSUPGRADE_IMAGE%/*}"; cat "${SHASUMS}" | grep "${SYSUPGRADE_IMAGE##*/}" | sha256sum -c 2> /dev/null)"
if ( ! echo "${CHECKSUM_RESULT}" | grep -q "OK" ); then
	if [ -f "${SYSUPGRADE_IMAGE}" ]; then
		echo "[WARN] Firmware image FAILED checksum test."
	fi
	echo "[INFO] Downloading firmware image from [${FIRMWARE_DOWNLOAD_URL}] to [${SYSUPGRADE_IMAGE}] ..."
	wget -qq -O "${SYSUPGRADE_IMAGE}" "${FIRMWARE_DOWNLOAD_URL}"
	CHECKSUM_RESULT="$(cd "${SYSUPGRADE_IMAGE%/*}"; cat "${SHASUMS}" | grep "${SYSUPGRADE_IMAGE##*/}" | sha256sum -c 2> /dev/null)"
fi
#
if ( ! echo "${CHECKSUM_RESULT}" | grep -q "OK" ); then
	echo "[ERROR] Firmware image FAILED checksum test."
	exit 99
fi
echo "[INFO] Firmware image PASSED checksum test."
#
rm -f ${DOWNLOAD_DIR}/*.ipk
fetchPackages
if [ ! "$?" = 0 ]; then
	echo "[ERROR] fetchPackages FAILED."
	exit 99
fi
#
if [ "${DRY_RUN}" = "0" ] && [ "${UPLOAD_PACKAGE_CACHE}" = "1" ]; then
	ssh -oStrictHostKeyChecking=no -i "${SSH_PRIVATE_KEY}" "root@${DEVICE_IP}" "rm -rf ${REMOTE_PACKAGE_CACHE_PATH}; mkdir -p ${REMOTE_PACKAGE_CACHE_PATH}"
	for ipk in ${DOWNLOAD_DIR}/*.ipk; do
		echo "[INFO] Uploading ${ipk##*/}"
		scpUploadFile "${DEVICE_IP}" "${SSH_PRIVATE_KEY}" "${ipk}" "${REMOTE_PACKAGE_CACHE_PATH}/${ipk##*/}"
		if [ ! "$?" = "0" ]; then
			echo "[ERROR] Upload FAILED."
			exit 99
		fi
	done
else
	echo "[INFO] DRY_RUN flag set. Skipping package upload."
fi
#
if [ "${DRY_RUN}" = "0" ]; then
	echo "[INFO] Uploading firmware image to device ..."
	scpUploadFile "${DEVICE_IP}" "${SSH_PRIVATE_KEY}" "${SYSUPGRADE_IMAGE}" "/tmp/firmware.bin"
	if [ ! "$?" = "0" ]; then
		echo "[ERROR] Upload FAILED."
		exit 99
	fi
else
	echo "[INFO] DRY_RUN flag set. Skipping image upload."
fi
#
if [ "${DRY_RUN}" = "0" ]; then
	echo "[INFO] Flashing image ..."
	ssh -oStrictHostKeyChecking=no -i "${SSH_PRIVATE_KEY}" "root@${DEVICE_IP}" "sync; echo 3 > /proc/sys/vm/drop_caches; sysupgrade -v /tmp/firmware.bin"
else
	echo "[INFO] DRY_RUN flag set. Skipping flash."
fi 
#
echo "[INFO] Done."
#
exit 0

The OpenWrt device (TP-Link Archer c7 v2 / v5 run the following


/etc/rc.local

# Put your custom commands here that should be executed once
# the system init finished. By default this file does nothing.

( /bin/sleep 30; /bin/sh /root/opkg_reinstall.sh ) &

exit 0

/etc/sysupgrade.conf

/root

Script name: /root/opkg_reinstall.sh
Last update: 2020-10-29
Verified working: 2020-10-29 on TP-Link Archer c7 v2/v5

#!/bin/sh
trap "" SIGHUP
#
#
# Command line
## sh /root/opkg_reinstall.sh force
#
# Notes
## opkg remove ath10k-firmware-qca988x kmod-ath10k kmod-batman-adv wpad-mesh-openssl; opkg remove batctl-full;
## opkg install ath10k-firmware-qca988x-ct kmod-ath10k-ct
#
# Consts
PATH=/usr/bin:/usr/sbin:/sbin:/bin
#
# Note: If we log to "/root/" during "insmod", "opkg", "rmmod" commands affecting kernel modules, the kernel will panic and the device will reboot.
LOGFILE="/root/opkg_reinstall.log"
LOG_MAX_LINES="1000"
#
LOG_COLLECTOR_HOSTNAME="HST-WifiAP-01"
INSTALL_BATMANADV="1"
INSTALL_OPENVPN="1"
INSTALL_RELAYD="0"
INSTALL_WIFI_NON_CT_DRIVERS="1"
PACKAGE_CACHE="/root/packages"
#
# Runtime vars
INSTALL_ONLINE="1"
REBOOT_REQUIRED="0"
#
#
# -----------------------------------------------------
# -------------- START OF FUNCTION BLOCK --------------
# -----------------------------------------------------
logAdd ()
{
	TMP_DATETIME="$(date '+%Y-%m-%d [%H-%M-%S]')"
	TMP_LOGSTREAM="$(tail -n ${LOG_MAX_LINES} ${LOGFILE} 2>/dev/null)"
	echo "${TMP_LOGSTREAM}" > "$LOGFILE"
	echo "${TMP_DATETIME} $*" | tee -a "${LOGFILE}"
	return
}


opkgInstall () {
	# Syntax:
	# 	opkgInstall "[PACKAGE_NAME]"
	#
	# Online mode.
	if [ "${INSTALL_ONLINE}" = "1" ]; then
		RESULT="$(eval "opkg install ${@}" 2>&1)"
		logAdd "[INFO] opkgInstall: - ${RESULT}"
		if ( echo "${RESULT}" | grep -q "Collected errors:"); then
			return 1
		fi
		return 0
	fi
	#
	# Offline mode
	IPKG_FULLFN=""
	for package_name in $@; do
		if ( ! ls -1 ${PACKAGE_CACHE} | grep -q "^${package_name}_" ); then
			logAdd "[ERROR] opkgInstall: Package missing in cache - [${package_name}]"
			continue
		fi
		IPKG_FULLFN="${IPKG_FULLFN} ${PACKAGE_CACHE}/$(ls -1 ${PACKAGE_CACHE} | grep "^${package_name}_")"
	done
	if [ -z "${IPKG_FULLFN}" ]; then
		return 0
	fi
	RESULT="$(eval "opkg --cache ${PACKAGE_CACHE} install ${IPKG_FULLFN}" 2>&1)"
	if ( echo "${*}" | grep -q "kmod-" ); then
		REBOOT_REQUIRED="1"
		# logAdd "[INFO] opkgInstall: Kernel module added. Will sleep a bit to avoid crash."
		sleep 10
	fi
	logAdd "[INFO] opkgInstall: - ${RESULT}"
	if ( echo "${RESULT}" | grep -q "Collected errors:"); then
		return 1
	fi
	return 0
}


opkgRemove () {
	# Syntax:
	# 	opkgRemove "[PACKAGE_NAME]"
	#
	for package_name in $@; do
		logAdd "[INFO] opkgRemove: Removing ${package_name}"
		RESULT="$(eval "opkg remove ${package_name}" 2>&1)"
		if ( echo "${package_name}" | grep -q "^kmod-" ); then
			REBOOT_REQUIRED="1"
			# logAdd "[INFO] opkgRemove: Kernel module removed. Will sleep a bit to avoid crash."
			sleep 10
		fi
		logAdd "[INFO] opkgRemove: - ${RESULT}"
		if ( echo "${RESULT}" | grep -q "Collected errors:"); then
			return 1
		fi
	done
	return 0
}


runInstall () {
	# Syntax:
	#	runInstall
	#
	# Global vars
	# 	[IN] INSTALL_ONLINE
	#
	# Variables
	PKG_TO_INSTALL=""
	PKG_TO_REMOVE=""
	#
	# If we are offline, check if we have a package cache available.
	if [ "${INSTALL_ONLINE}" = "0" ] && [ ! -d "${PACKAGE_CACHE}" ]; then
		logAdd "[ERROR] runInstall: We are offline but don't have a package cache available at ${PACKAGE_CACHE}"
		return 1
	fi
	#
	if [ "${INSTALL_ONLINE}" = "1" ]; then
		logAdd "[INFO] Downloading package information ..."
		RESULT="$(opkg update | grep "http://" 2>&1)"
		logAdd "[INFO] opkg update: - ${RESULT}"
	fi
	#
	# Add packages to the install and remove queue.
	PKG_TO_REMOVE="${PKG_TO_REMOVE} wpad wpad-basic wpad-basic-wolfssl"
	PKG_TO_INSTALL="${PKG_TO_INSTALL} libopenssl1.1 wpad-mesh-openssl"
	#
	# Replace ct with non-ct drivers
	if [ "${INSTALL_WIFI_NON_CT_DRIVERS}" = "1" ]; then
		# Remove order is important
		PKG_TO_REMOVE="${PKG_TO_REMOVE} kmod-ath10k-ct ath10k-firmware-qca988x-ct"
		PKG_TO_INSTALL="${PKG_TO_INSTALL} ath10k-firmware-qca988x kmod-ath10k"
	fi
	#
	# batman-adv
	if [ "${INSTALL_BATMANADV}" = "1" ]; then
		PKG_TO_INSTALL="${PKG_TO_INSTALL} batctl-full kmod-batman-adv kmod-crypto-crc32c kmod-lib-crc16 kmod-lib-crc32c kmod-crypto-hash librt"
	fi
	#
	# Only install these packages if we are online
	if [ "${INSTALL_ONLINE}" = "1" ]; then
		#
		# Base system
		PKG_TO_INSTALL="${PKG_TO_INSTALL} bash curl htop lua luafilesystem mailsend nano terminfo tcpdump wget"
		#
		# FTP service
		PKG_TO_INSTALL="${PKG_TO_INSTALL} vsftpd"
		#
		# Relayd
		if [ "${INSTALL_RELAYD}" = "1" ]; then
			PKG_TO_INSTALL="${PKG_TO_INSTALL} luci-proto-relay relayd"
		fi
		#
		# OpenVPN
		if [ "${INSTALL_OPENVPN}" = "1" ]; then
			PKG_TO_INSTALL="${PKG_TO_INSTALL} luci-app-openvpn openvpn-easy-rsa openvpn-openssl"
		fi
		#
		# Syslog-ng, logd
		if ( echo "${HOSTNAME}" | grep -q "^${LOG_COLLECTOR_HOSTNAME}" ); then
			PKG_TO_REMOVE="${PKG_TO_REMOVE} logd"
			PKG_TO_INSTALL="${PKG_TO_INSTALL} syslog-ng"
		else
			PKG_TO_REMOVE="${PKG_TO_REMOVE} syslog-ng"
			PKG_TO_INSTALL="${PKG_TO_INSTALL} logd"
		fi
		#
		# USB storage drivers
		PKG_TO_INSTALL="${PKG_TO_INSTALL} block-mount e2fsprogs kmod-fs-ext4 kmod-fs-msdos kmod-scsi-core kmod-usb-storage libncurses libpcre"
	fi
	#
	logAdd "[INFO] runInstall: Remove packages"
	opkgRemove "${PKG_TO_REMOVE}"
	if [ ! "$?" = "0" ]; then
		return $?
	fi
	#
	logAdd "[INFO] runInstall: Install packages"
	opkgInstall "${PKG_TO_INSTALL}"
	if [ ! "$?" = "0" ]; then
		return $?
	fi
	#
	return 0
}


waitForInternetConnection () {
	# Syntax:
	# 	waitForInternetConnection
	#
	# Assume we can wait for internet connection if hostapd started fine.
	# If it did not start, we maybe miss internet connectivity via mesh.
	#
	if [ -z "$(iw dev)" ]; then
		logAdd "[INFO] WiFi is not initialized"
		return 1
	fi
	#
	# Give WiFi mesh interface time to connect and get internet connectivity.
	logAdd "[INFO] Waiting for WiFi to initialize"
	UPTIME_IN_SECONDS="$(cat /proc/uptime | cut -d "." -f 1)"
	SECONDS_TO_SLEEP="$((120-${UPTIME_IN_SECONDS}))"
	sleep "${SECONDS_TO_SLEEP}"
	#
	return 0
}
# ---------------------------------------------------
# -------------- END OF FUNCTION BLOCK --------------
# ---------------------------------------------------
#
#
# Check command line.
if ( echo "${*}" | grep -q "force" ); then
	runInstall
	exit 0
fi
#
# Check if the script should run on boot.
if [ -f "/etc/_OPKG_INSTALL_COMPLETE" ]; then
	echo "[INFO] /etc/_OPKG_INSTALL_COMPLETE exists."
	exit 99
fi
#
waitForInternetConnection
# 
logAdd "[INFO] Checking internet connection ..."
if ( ! echo -e "GET / HTTP/1.1\r\nHost: downloads.openwrt.org\r\n" | nc downloads.openwrt.org 80 > /dev/null ); then
	logAdd "[INFO] No internet connection. Switching to offline mode."
	INSTALL_ONLINE="0"
fi
#
runInstall
if [ ! "$?" = "0" ]; then
	logAdd "[ERROR] One or more packages FAILED to install"
else
	logAdd "[INFO] All packages installed successfully"
	touch "/etc/_OPKG_INSTALL_COMPLETE"
fi
# 
logAdd "[INFO] Cleanup"
if [ ! -z "${PACKAGE_CACHE}" ]; then
	rm -rf "${PACKAGE_CACHE}"
fi
#
if [ "${REBOOT_REQUIRED}" = "1" ]; then
	logAdd "[INFO] Rebooting device"
	reboot -d 3
	exit 0
fi
#
logAdd "[INFO] Done."
exit 0

see above......

Newly erased block contains word... is likely a hardware issue-- the block was erased but then found not to be blank. Testing on different hardware should be considered.

Another possibility is the jffs is being overfilled, which causes a lot of strange things to happen. Don't let the jffs get overfilled.

2 Likes

@mk24 It's 47% filled at max. Really, I just thought the same like you at first but I have 4 to 5 MBytes free there. I can also reproduce it on another Archer C7v2. One hardware is months old, the other was bought back in 2016.

I thought again why it works executing the script via SSH and why the autorun way via /etc/rc.local brings up that kernel error. Maybe it plays a role if I have this " | tee -a ...log" after opkg in place that SSH will slow down the commands a little bit by waiting for the "screen output pipe". SSH isn't that fast than local execution without monitoring the terminal. I've had a similar issue on a third TPLink Archer c7 last year when I made the watchdog script which should monitor ath10k_pci failing via logread and then remove the kmod and readd it back without reboot to get WiFi working again. The watchdog does its job pretty well until today, but at my first try I also had a " > /root/mylogfile.log" in place behind the "rmmod" and "insmod" commands and I suffered kernel crashes then.

It all looks like the kernel is microseconds unavailable to serve the jffs2 filesystem and using exact that timeframe to write to it (because of the log pipe to /root/...) causes the above failure to occur. I'm sorry I cannot do any kernel debugging but closely observing and telling you there might be a bug.

So to sum up: Avoid to pipe "rmmod", "insmod", "opkg" output to a log placed on /root/... and the jffs2 "corruption" does not take place. I can read/write heavily on the whole partition if I don't do these things.

1 Like

So after doing more tests, I think this a multicausal problem. Because I did revert /etc/rc.local to NOT contain anything, the router flashes, comes up with the settings preserved, I'm looking at dmesg (in this case tested on 19.07.2) and always find the message

jffs2: Newly-erased block contained word 0x19852003 at offset 0x00000000

I bet this is NOT hardware failure. This is also exactly the same message (0x19852003) as in:

To get this nailed down to root cause and be able to report back to you, I need an exact info in which OpenWrt stable 19.07.xx release the fix mentioned in the other thread ( ref: https://github.com/gl-inet/openwrt/commit/4b1f073f843dc4be655a868f6a6e31f74baa727c ) is contained, if yet released. @mk24 Can you please shed some light?

1 Like

Did you notice that that code is not even from an OpenWrt repo?
It is from gl-inet's repo of their derivative firmware for their devices based on an old kernel version...
(Both the ar71xx target and also the kernel 4.9 have already been deprecated in OpenWrt)

@hnyman I'm just a user, so just deploying and upgrading devices when needed. I'm not so into it what which commit or repo means. Can you offer a solution because it seems to be exactly the problem I'm suffering.

  • I've tested on 19.07.2 upgrading to 19.07.2 -> jffs error occurs.
  • I've tested on 19.07.2 upgrading to 19.07.3 -> jffs error occurs.
  • I've tested on 19.07.3 upgrading to 19.07.3 -> jffs error occurs.
  • I've tested on 19.07.3 upgrading to 19.07.4 -> jffs error occurs.
  • I've tested on 19.07.4 upgrading to 19.07.4 -> jffs error occurs.

So it seems the SPI write flash issue is indeed a software issue. Testing around, I assume that during a sysupgrade the old kernel still boots one time, flashes the new image and then directly fires up everything without rebooting. So that might be why I'm experiencing the config loss on next reboot (no matter if my scripts do it or if I do it manually). If this is true, that explains why going from an unaffected firmware to another one works without the config loss, e.g. 19.07.4 to 19.07.4. That's why I asked for confirmation if OpenWrt did solve this issue in the meantime. Can you answer it? If OpenWrt didn't do anything about it, maybe it will return some day and all of us should be aware of this.

Just another thesis: It maybe that no one experiences the issue if only a "small" amount of overlay data is persisted during sysupgrade. I've got 9 MB overlay capacity and using only 3 MB of it to persist, so that definitely is well calculated and is expected to work. If not, OpenWrt (or kernel) has a bug and it should be investigated and fixed.

This is one of the drawbacks of MIPS memory-mapping. If you use memory-mapping to read data smaller than 32bytes from flash, then erase flash, and then use memory-mapping to read data smaller than 32bytes from the same location, the data will wrong.
As far as I know, the latest OpenWRT does not fix this problem.

1 Like

@luochongjun Can I avoid this issue somehow? I'm asking because ocassionally logging important things to /root/ from my 24/7 running bash scripts didn't cause any "config loss on reboot" issues over years. What's also unclear to me is why only /etc/config/* is lost while the rest, for example, /etc/passwd, /root/* and soon is still there. Before the second reboot after flash, everything - including /etc/config/ - is correctly in place. What could OpenWrt cause to just drop a valid /etc/config ? If I connect to 192.168.1.1 after the config loss and reupload it from documentation, reboot, the config is well running (so no errors in it) and peristed across future reboots.

I think I'm having the same problem like described in [ SOLVED ] Getting error failed to sync jffs2 overlay where jeff posted that it's "bad timing handling the erasure mark on the tplink archer c7v5". What is different in my case is that the only jffs2 error in dmesg is the above mentioned "word at offset" thing. I do not get the "failed to sync overlay" message.

It can be reproduced scp uploading ca. 4 mbytes to the tp link archer c7v5's /root . Set sysupgrade.conf to persist /root . Upload openwrt-sysupgrade.bin image to /tmp and run sysupgrade -k via ssh. When the router reboots after flash (like @jeff described it as first reboot after flash in the topic) I had 28 failures and 3 successes trying the same procedure over and again. (This can be reproduced without using the offline installer scripts from my first post.) Failure means the jffs2 erase error message was there.

1 Like

Don't you think if someone took time to mention it, that it may be important?

  • Commit (or 4b1f073f843dc4be655a868f6a6e31f74baa727c) is the serial number the versioning system gave to the specific change/addition/removal of code to a repository (or code repo)
  • Repo (or code repository) is where you're downloading software from (it's kind of a security issue that you don't know where the code is from) - It means your script is not from OpenWrt, just look at the link. Have you asked GL-Inet?

The I'd suggest not developing a script to upgrade it. Simply use the normal sysupgrade method.

Sure, I even loaded the files manually to /root I need persisted and no matter what they are, I go to web Ui, flash upgrade, and the jffs2 error sometimes comes up and some times not . It has nothing to do with the scripts. If it would work fine the manual way in all cases the script would also be okay.

Having kilobytes to persist -- always works
Having megabytes to persist (part free over 70% still) -- often fails.

@lleachii please help correcting the bug

is this not a sysupgrade data size thing?

2 Likes

I'm not sure if you're purposely ignoring me; or didn't understand again...

...or perhaps the information you offered prior to tagging me wasn't in response to my post. I asked why haven't you inquired with the developers of these scripts (it isn't OpenWrt):

screen115

Are you sure you're calculating this correctly...because the sysupgrade image itself is bigger than the 3 MB you just quoted:

I'm estimating you have less than a few hundred KBs...before the sysupgrade bundles all persistent data.

I'm quite confused on how your logic always comes to a bug. Also, please be mindful, even if you use the approach to "scream" there's a bug to get assistance, you still have to clearly identity it. Lastly, why would OpenWrt investigate problem in code written by another firmware manufacturer - for their firmware? :confused:

Without your cooperation (i.e. mentioning things should work manually, but not informing us if things do work manually now) and gleaning information from the OpenWrt Wiki and this post, I'm simply starting to agree with @mk24 that you may have overfilled your device.

Can you actually provide the real free and file totals; and other information previously requested?

I think you'll find it's over 9 MB.

@lleachii I'm NOT using this gl-inet stuff , I use official openwrt stable builds, 19.07.4 for tpl archer c7v5 . So why getting a foreign image,discuss about foreign code or sth? Just the problem seems to be the same looking at glinet.

If I upgrade via web ui - problem. Via ssh sysupgrade, too.

Ssh, running df -h , find /overlay , du -sh /overlay made sure its not over filled. Having ca. 2.2 MByte filled before running sysupgrade, while it executes it goes up to 4.4, i guess because it backups things for later.

Sysupgrade image is bigger, but not relevant for the used space in the overlay. Web ui and myself if doing via scp places the image in /tmp/firmware.bin . There are over 50 mbytes free on it.

@lleachii do you need screenshots? Or is it enough if some experienced it expert says he's tested 2 days and is sure there is sth wrong and he has observed it correctly.

@Catfriend1, I'm lost at how you think someone has tested two days.

And if you're referring to yourself, please provide the solution in a code report, developer list, etc.. To reiterate, just saying "there's a bug" doesn't identify where it is; nor get it fixed. OK, [hypothetically], there's a bug...now where is it and what's the fix?

This is community-based support and you provide little and conflicting information. I've had the following draft post saved for those 2 days:


:man_facepalming:

The script if from GL-INET! Look at your link:

:arrow_down:

GL-Inet does-not-equal OpenWrt. Are you identifying that code as the isue?

If so, locate the code in OpenWrt and note the bug there.

:open_mouth: /tmp and overlay are not the same...I'm not even sure how you honestly presented a number greater than 3 MB; but greater than your installed flash chip...hummm.

Yea, I think you may not understand.

@lleachii yeah your comments come up to me like side tracking my facts by interpreting my posts misunderstandingly and not willing to help. I'm no coder but admin and user. I know a lot from those perspectives about openwrt and I do know how to make a good bug report, investigate and clearly sum up what has proved to be wrong. I need help from a developer willing to take the report serious and looking at code with his expertise. I don't need your comments far away from doing so and accusing me why I can't communicate like a dev and do the bug hunting in code myself. I have done hours of flashing, resetting, starting afresh to prove there is a problem when openwrt handles the spi reads concurrently while restoring a 2 mbyte jffs overlay after sysupgrade. Period. Don't troll me. Let the others read , understand and help. You just did jump roughly over what I wrote in an offensive sounding way. I did not use glinet, so it's okay not to search if it is their bug too. But that is written there is a good pointer what could be the cause of this verified symptom in openwrt. If it's not useful, okay, then an expert needs to make theories, but the problem is real and will stay.

Please explain the Gl-INET link then (because I know I'm not the only once confused)...then you said:

So please explain what the link means???

Maybe this will help.

:laughing:

(A developer, or at least a Core Team Member, responded to you already! - See: Post No. 7 - You refused to explain why there's a GL-INET link.)