Unexplained hangs freezes and reboots with Archer c2600

ergamus · January 13, 2021, 4:24pm

Uptime with scaling governor workaround: 35d 1h 23m 59s.

What changed for me was getting rid of my WDS link and instead moving over to Powerline to bridge my network.

I could go on about what I feel or think might be the cause, but I'm just going to jot down my numbers:

Wireless bridge (WDS) + normal scaling governor settings: 10m to 7days until crashing, random. The higher the wireless utilization, the more chance of the device crashing.
Wireless bridge (WDS) + performance scaling governor: Average around 14 days until crash, max was 16 days.
Powerline + performance governor: Not a single crash since the change. The wireless interfaces on both 2,4Ghz and 5Ghz are under 24/7 constant load, minor (10%) reduction in overall traffic.

I'm also using the following firmware for ath10k-ct: ath10k-10.4b-ct-htt-9980-fH-13-6d73a309a. I tried the beta ath10k-ct firmware from CandelaTech in the hope of getting better stability. Didn't notice any change in that regard.

With my faster connection, I usually slam Core1 to 100% utilization when downloading anything thanks to SQM.

edit: Restarted at 48d to apply the 19.07.6 update.

frollic · March 27, 2021, 8:27pm

the one unit I have with /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor performance is now at ~140 days, but it's still on 19.07.3.

The other two run 19.07.6, and are at 21 and 47 days.

stripwax · April 18, 2021, 10:22am

I have an Archer VR2600 (non-V variant i.e. VR2600 not VR2600V), same cpu (IPQ806x), experiencing occasional random hangs (+automatic reboot) on 19.07, (but I'm fairly new to OpenWRT so I haven't been using older builds). It feels like I'm getting these more often with 19.07.6 and 19.07.7 compared to earlier, but that's a non-scientific hunch. However, I'm reading this thread with interest and will post my findings (if any). For now, I'll try the govenor change to start with. Quick question, other than serial, is there any way to capture the kernel oops? There's no persistence after reboot, correct?

frollic · April 18, 2021, 10:31am

ergamus · May 27, 2021, 3:42pm

There is an interesting commit in the master / 21.02 branch:

ipq806x: improve system latency

Various report and data show that the freq 384000 is too low and cause some
extra latency to the entire system. OEM qsdk code also set the min frequency
for this target to 800 mhz.
Also some user notice some instability with this idle frequency, solved by
setting the min frequency to 600mhz. Fix all these kind of problem by
introducing a boot init.d script that set the min frequency to 600mhz and set
the ondemand governor to be more aggressive. The script set these value only if
the ondemand governor is detected. 384 mhz freq is still available and user can
decide to restore the old behavior by disabling this script.

Signed-off-by: Ansuel Smith ansuelsmth@gmail.com
(cherry picked from commit 861b82d36ae43efec8d16e61b82482e38996af92)

Based on our experiences here, this might have been the cause of the crashes related to using the ondemand scheduler. I'm gonna build a new 21.02 image and set the scheduler back to ondemand, hoping this is it!

dogtopus · May 27, 2021, 8:38pm

Nice find. I also removed my cpufreq scheduler override and changed it to do what the above script would do. Let's see if it crashes or not after a few days.

ergamus · May 29, 2021, 9:26am

Mine crashed after less than 20 hours of uptime. I'm going to try tweaking the values in the script to be even more aggressive. One really wonders about the quality of the silicon in these devices.

For now I'll up the base frequency from 600->650 but leave sampling_down_factor and up_threshold at default values. And I'll continue bumping the frequency after crashes up to the 800MHz min frequency mentioned in the commit message.

farr4744 · May 29, 2021, 10:29am

I changed the CPU governor to performance as has been commented. So far I am at 9+ days uptime. We'll see how long it lasts...

stripwax · May 29, 2021, 2:12pm

There's almost certainly nothing wrong with the silicon. The issue is almost entirely related to the fact that there is no documentation available on how to correctly write software for this hardware. Sadly with most consumer devices, this is the way. The hardware is more than capable if you know how to use it properly, but that's a big if, and finding out by experimentation is often the only way, in the absence of any actual official documentation.

otnert · May 30, 2021, 1:26am

My last reboot was an unexpected power outage, been up for 33d 15h 30m 28s.

Not sure if this help other users, I have this TP-Link Archer C2600 performance with 17.01.4 - #43 by otnert in my rc.local file, and running irqbalance.

ergamus · May 30, 2021, 5:36pm

Your post makes sense to me, binning and all. But here is what I don't understand: If the devs that worked with Qualcomm decided to stick with 800MHz minimum for stability sake, why do we go lower? It's not like these devices are overheating out of the box, nice aluminum heatsink and the PSU itself seems pretty damn good.

My uptime is higher now @ 650 versus 600, but based on what you said trying to change the min frequency higher is not worth it then. If all are binned equally, then I shouldn't be crashing @ 600.

Previously when running at max frequency 24/7, from the day I flashed the firmware to the day I flashed the next build it was 100% stable. I always assumed something went wrong during frequency changes, based on the above experience. However I also might have mistakenly assumed all processors would be more stable if you underclocked them, assuming the VID table or your undervolt wasn't too aggressive.

I'll skip 700-750 and go back to the default 600MHz minimum if it crashes the next time, and tweak the scheduler values little by little (+/-) and see if anything changes for the better. And if not, try 800MHz with stock scheduler values. And if that fails, back to aggressive.

edit: A bit over 2 days uptime with 650MHz. Now back to 600MHz and tweaking scheduler values.
edit2: I feel a bit stupid, but these are the following valid steps for clock frequency on the C2600 at least: 384000, 600000, 800000, 1000000, 1200000, 1400000. So me setting the base from 600-650 had no effect in reality.

Right now I'm trying the following settings. On my usual load one core will stay at 1,4GHz, the other will move between idle and max freq. It's more sustained/less dynamic with the frequency selection. I'm getting the current frequency data from: /sys/devices/system/cpu/cpufreq/policy*/cpuinfo_cur_freq

    echo 600000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
    echo 600000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq
    echo 20 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor (def: 10)
    echo 30 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold (def: 50)

edit3: Currently at almost 8 days uptime with the modified ondemand governor values at stock 600MHz frequency. Better than the previous two crashes (a couple of hours & less than 2 days at default values).

edit4: Crashed after 10 days roughly. I've decreased up_threshold to 20. This will probably be my final adjustment to the scheduler on 600MHz, after that I'll try default OpenWRT values but with 800MHz set as the minimum frequency.

anon72830772 · June 10, 2021, 11:02pm

Just a heads up regarding irqbalance,
Following posted in another topic:

After installation of irqbalance,
Users only needs to enable it in

/etc/config/irqbalance

dogtopus · June 12, 2021, 2:29am

Mine crashed recently too. It was running continuously for about 2 weeks.

ergamus · June 14, 2021, 11:15pm

I've had irqbalance running for several months on the C2600 with no effect on stability that I could detect. Might be more effective now in combination with the tweaks to minimum frequency and scheduler values, but it still crashes regardless.

dogtopus · June 19, 2021, 1:40am

OK it crashed again today with 800MHz min frequency. So a little bit more than 1 week uptime. I think it went back to just like before without disabling the ondemand scheduler.

ergamus · June 27, 2021, 12:33pm

I'm creeping up on two weeks of uptime with the following settings:

    echo 600000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
    echo 600000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq
    echo 20 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor
    echo 20 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold

Which pretty much means I never sit on min freq unless the device is idle. I suspect if I decided to loop a command to disable WAN, enable, disable to force load/idle/load it would crash a lot faster.

edit: Nevermind, crashed. I give up, it's a waste of my time to try and tweak around with something that is obviously broken. Back to performance.

otnert · July 5, 2021, 4:44am

Since the last power outage...

 OpenWrt 19.07.7, r11306-c4a6851c72
 -----------------------------------------------------
root@OpenWrt:~# uptime
 14:08:32 up 69 days, 18:47,  load average: 0.00, 0.04, 0.06

I have been using this.....

since chasing OpenVPN thoughput back then, now I run WireGuard though but have left the same in the rc.local.

If your after any other settings let me know.

EDIT Ahh! my router also succumbed to a spontaneous reboot - all up 72 days!

ergamus · July 8, 2021, 1:51pm

I'd guess a drop in traffic caused the scheduler to decrease frequency, and triggered some firmware/hardware bug. We all have varying success with the various fixes, but I think hardware revision differences also play a role. My C2600 is one of the earlier revisions, and I'm assuming that's why mine crashes often with ondemand, when others are reporting their problems fixed outright on the new scheduler script. Either way, it's been a problem for years now, and I've lost all hope of ever fixing it.

dogtopus · August 13, 2021, 2:05am

Some update: Been running on 800MHz-max with ondemand governer for 55 days now and not a single reboot during these days. So 800MHz seems to be the safe limit.

(I think my previous post was not accurate since I forgot to uncomment the lines that set the min freq to 800MHz, so it ran on the default low and therefore still crashed after a few weeks.)

dogtopus · August 13, 2021, 2:14am

Maybe you should just use performance then. Could actually be bad silicon if 800MHz min freq doesn't resolve it or at least make the reboots a lot less often.