Unexplained hangs freezes and reboots with Archer c2600

I spoke too early. The device crashed with 'performance' governor around the 14 day mark.

1 Like

OK mod installed and rolled back my firmware to stable with ath10k-ct-htt firmware. If it ever crashes again I'll log the serial output.

Yep, you're right about that. My router just rebooted 3 hours ago, reaching ~14d uptime as well. Impossible for me to tell if it crashed just before or after reaching the 14d mark though.

I have three of these, running as APs with 802.11r enabled, and noticed the same behaviour.

Two of them were restarted when we had the power meter replaced, so their uptime is very much in sync.
Their reboots appear to be in sync too, just 3 hrs differerance in uptime.

All three run OpenWrt 19.07.3 and use ath10k-firmware-qca99x0-ct v2019-10-03-d622d160-1.

I will set one to use the non -ct packages (242),
one to /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor performance (241),
and leave one as it is (240)

The (24x) are just my references so I know what's been done to which unit :wink:

Not sure 802.11r will work across -ct and non-ct packages.

2 Likes

I had a planned reboot 2 days ago so my uptime was cleared. However I still haven't got any unplanned reboots yet. Waiting for the 15 day mark to pass...

Well, the one with the swapped ath10k packages froze, so now it'll be a clean reboot, and a new count down.
I'd say it was spot on 14 days, but since I have three, I seldom notice if just one (or even two) go down/reboot, because of the redundancy :wink:

It miight be related to the package swap, so the next two weeks will be interesting.

Reading this post with great interest. I'm having also a C2600 in a wireless bridge mode. Unfortunately I'm also seeing sometimes a lost connection on the 5G Hz bridge and also kernel crashes... Replacing the drivers did not result in a more stable situation. Any tips or useful commands to find the cause if this problem would be highly appreciated!

Logging from systemlog with unexpected behaviour:

[47068.768203] ------------[ cut here ]------------
[47068.768235] WARNING: CPU: 0 PID: 0 at /builder/shared-workdir/build/build_dir/target-arm_cortex-15+neon-vfpv4_musl_eabi/linux-ipq806x_generic/ath10k-ct-regular/ath10k-ct-2019-09-09-e8cd86f/ath10k-4.19/htt_rx.c:1206 0xbf2d7be0 [ath10k_core@bf2ba000+0x57000]
[47068.772303] Modules linked in: pppoe ppp_async ath10k_pci ath10k_core ath pppox ppp_generic nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat ledtrig_usbport nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_of_simple ohci_platform ohci_hcd phy_qcom_dwc3 ahci ehci_platform
[47068.843139]  sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd gpio_button_hotplug
[47068.865364] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.195 #0
[47068.874484] Hardware name: Generic DT based system
[47068.880738] Function entered at [<c030f1c4>] from [<c030b390>]
[47068.885423] Function entered at [<c030b390>] from [<c07c1064>]
[47068.891238] Function entered at [<c07c1064>] from [<c031f878>]
[47068.897054] Function entered at [<c031f878>] from [<c031f964>]
[47068.902869] Function entered at [<c031f964>] from [<bf2d7be0>]
[47068.908709] Function entered at [<bf2d7be0>] from [<bf2d91e0>]
[47068.914502] Function entered at [<bf2d91e0>] from [<bf2d99d0>]
[47068.920317] Function entered at [<bf2d99d0>] from [<bf315758>]
[47068.926140] Function entered at [<bf315758>] from [<c06a9d8c>]
[47068.931950] Function entered at [<c06a9d8c>] from [<c03015c8>]
[47068.937766] Function entered at [<c03015c8>] from [<c0323e18>]
[47068.943583] Function entered at [<c0323e18>] from [<c03629c8>]
[47068.949398] Function entered at [<c03629c8>] from [<c0301488>]
[47068.955213] Function entered at [<c0301488>] from [<c030bf8c>]
[47068.961030] Exception stack(0xc0a01f48 to 0xc0a01f90)
[47068.966853] 1f40:                   00000001 00000000 00000000 c0315100 ffffe000 c0a03cb8
[47068.971981] 1f60: c0a03c6c 00000000 00000000 c092ea28 00000000 00000000 c0a01f90 c0a01f98
[47068.980136] 1f80: c030854c c0308550 60000013 ffffffff
[47068.988287] Function entered at [<c030bf8c>] from [<c0308550>]
[47068.993321] Function entered at [<c0308550>] from [<c0358828>]
[47068.999051] Function entered at [<c0358828>] from [<c0358b70>]
[47069.004865] Function entered at [<c0358b70>] from [<c0900c58>]
[47069.010763] ---[ end trace 32dc9c995eb0b581 ]---

Well so far my tests are inconclusive.

The untouched router has passed the 15 days mark, and so has the one with
scaling_governor performance.

The one with the non -ct packages haven't reached 14 days yet.

OK it's been 16 days now since the last planned reboot. Still haven't got any unplanned reboots yet.

So,

the untouched, and the one with scaling_governor are now at 23 days.
The non -ct one is at 17 days.

Still inconclusive :frowning:

Got above 16 days uptime before I had to reboot. I went from a 15/15 to 150/15 connection now. I wonder if the speed difference will have any effect on stability.

Almost a month and still no reboot. Could it be that the reboot issues are actually caused by multiple reasons?

Anyway the DVFS/cpufreq driver glitch is definitely one of the reasons so at least we made it more stable (when it actually got fixed properly).

1 Like

The one with the non-ct package rebooted itself after roughly 29 days, the other two have passed 34 days.

2 Likes

rebooted itself again after 3 days, the other two are at 40.

I'm considering using another power supply, just to try something else.

I had a C2600 that is probably bricked that I'm trying to get a serial connection on for recovery. It had a similar problem locking up. I think it got bricked doing the firmware update to 19.07.05

I'm now using a LinksysEA8500 with IPQ8064 as well... which is running 19.07.05 and it's locking up as well.

Someone mentioned this might be the issue?

Not seeming very stable to me.

U-boot is pretty hard to kill, TFTP recovery works in most cases.

Regarding stability, the one unit I have with /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor performance is at 65 days uptime.

The other two at 26 (non -ct package), and 16 (untouched, with -ct package)

The C2600 pulled it via TFTP but never finished. Can I solder onto the pads on JP4 and TTL for serial?

What's the non -ct package?

You should be able to rerun the TFTP.

No idea about the serial, never had to use it, but if it's documented, go for it
Note, it's using 1.8V, not 3.3V.

The -ct is one version of the wifi drivers, read earlier posts in this thread.

Uptime with scaling governor workaround: 35d 1h 23m 59s.

What changed for me was getting rid of my WDS link and instead moving over to Powerline to bridge my network.

I could go on about what I feel or think might be the cause, but I'm just going to jot down my numbers:

  • Wireless bridge (WDS) + normal scaling governor settings: 10m to 7days until crashing, random. The higher the wireless utilization, the more chance of the device crashing.
  • Wireless bridge (WDS) + performance scaling governor: Average around 14 days until crash, max was 16 days.
  • Powerline + performance governor: Not a single crash since the change. The wireless interfaces on both 2,4Ghz and 5Ghz are under 24/7 constant load, minor (10%) reduction in overall traffic.

I'm also using the following firmware for ath10k-ct: ath10k-10.4b-ct-htt-9980-fH-13-6d73a309a. I tried the beta ath10k-ct firmware from CandelaTech in the hope of getting better stability. Didn't notice any change in that regard.

With my faster connection, I usually slam Core1 to 100% utilization when downloading anything thanks to SQM.

edit: Restarted at 48d to apply the 19.07.6 update.

1 Like

the one unit I have with /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor performance is now at ~140 days, but it's still on 19.07.3.

The other two run 19.07.6, and are at 21 and 47 days.

1 Like