I think there may be a bug causing intermittent sysupgrade flash failures on the TP-Link TL-WR710N and similar models. I recently lost four devices to random sysupgrade failures and I'm reading a number of similar reports from others in recent months. I don't know if there is a real issue here or not, but I'm at least going to report it.
I have a small fleet of TP-Link TL-WR710N hardware version 1.0 devices. I build my own custom images with the packages I need and then send out the image to my fleet with some scripts that automatically manage the process.
Several weeks ago I upgraded six devices and two of them failed for unknown reasons. They simply didn't recover after the sysupgrade reboot. All six devices are configured almost identically and use the same image. Timestamp date of the build was 20171008020100 based on v17.01.3 / head df54a8f583a9afad356fb99a575d75b69c8c0dd4. Also note that this build was tested on a test unit where I flashed it at least twice without any problem.
I sent out two spares to the affected locations and the failed units were sent back to me. The spares were upgraded to my current build at this time without any trouble.
I thought that maybe the image didn't have an included hash in the file format and that maybe the image had been corrupted in transit, so I added a step in my script that does an sha256sum of the file before the sysupgrade happens. I thoroughly tested this verification system it worked perfectly. My test device was sysupgrade-flashed at least three times during this testing.
Two days ago I decided to push out another upgrade. This one had a build timestamp of 20171103195944, v17.01.4 / head 444add156f2a6d92fc15005c5ade2208a978966c. Another two out of six systems failed their sysupgrade. I don't have these units in my possession yet, but I assume they failed in the same way. One of the failed units was a replacement which I had sent out, and the other was one of the four which had successfully sysupgraded the first time.
So that's a total of eight devices, flashed in two events, and four failures total.
Just to be clear, my build doesn't have any patches or modifications to the system which I think could be causing this. I've added a few files and I customize the packages on the system, but that's about it -- nothing crazy going on there. The same image is being used on all of the devices.
Also, all of these devices have been upgraded many many times in previous years; at least 20 times each. No issues before this recent trouble.
I have attached a boot log from one of the failed devices below.
Similar reports I found:
https://bugs.lede-project.org/index.php?do=details&task_id=478
This one is a different model but very similar behavior (intermittent failure) and it has a very similar/same SoC, but he has a hardware modified flash chip:
https://bugs.lede-project.org/index.php?do=details&task_id=48
This looks similar. Has an ar9xxx series SoC too. Mentions gargoyle so I'm not sure this is actually related:
https://eko.one.pl/forum/viewtopic.php?id=15628
In this thread someone mentions an unexpected softbrick on a WR1043ND and his log shows similar errors:
Finally, WTF? No text file or log uploads? "Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif)."
Log below since I can't attach it.