IPQ4018 Linksys EA6350v3 WiFi Dead After ~24-48 hrs

Looking at my IPQ4019-based EA8300s that I run 802.11s and batman-adv on, notable config changes include

# CONFIG_PACKAGE_ath10k-firmware-qca4019-ct is not set
# CONFIG_PACKAGE_ath10k-firmware-qca9888-ct is not set

Last build was off

2019-09-17 22:17:45 +0200

commit 2b342d01a2
Author: Hans Dedecker <redacted>

    glibc: update to latest 2.27 commit (BZ#23637)

Trying a new build off today's master.

Edit: Not connecting on 802.11s for some reason, so no immediate data available.

1 Like

I'm seeing more of same from 19.07 rc2 on my first floor EA6350v3 (IPQ4018 256MB memory) AP:

Climbing/falling and even "baseline" EA6350v3 memory usage look pretty sad in comparison to memory usage on my second floor EA8500 (IPQ8064 512MB memory) running 19.07 rc2 and performing essentially the same wired AP duties (actually, with both children streaming Netflix simultaneously on it now, probably more duty):

On the plus side, I scored a second EA8500 on ebay for $45 shipped last week :man_dancing:

1 Like

I really think that the problem is not in master. Maybe trying master can help.

See from my master build:

If you wondering why 30% of cache... That's because I'm using it as a NAS.

Is "master" different from "snapshot"?

I've tried the latest (as of a few days ago) snapshot builds - same problem. If "master" is building from latest source myself, that will need to wait until my next long weekend and time to relearn some rusty skills, which could be awhile.

The goal being functional main OpenWrt builds notwithstanding, I'll probably flash your build next weekend NoTengoBattery - that will give one more data point. Seeing a couple others are having similar problems on zyxel 6617 and RT-AC58U at least tells me I'm not going crazy. Both of those are IPQ4018 (as is the EA6350v3), not IPQ4019. Might be a clue there - or not.

There used to be 19.07 snapshots as well (I think the 19.07 branch is now switched to RC's) but in general the term 'snaphot' refers to builds off the master branch.

Like @slh pointed out the RT-AC58U is crippled by design by its lack of RAM, so it's a bad device to profile issues. It's like pointing to an immunodeficiency patient and saying he's ill a lot.

1 Like

on my device the usage ram don't look so bad, zyxel 6617, i have it in wds client on 5ghz, and as access point on 2.4. it disconnect from 5ghz client mode, and the 2.4 still work. so i'm not sure if the problem here is the ram.

root@zyxel:~# free -m
              total        used        free      shared  buff/cache   available
Mem:         250952       79496      160464         272       10992      142708
Swap:             0           0           0

Don't get it wrong. My build is actually tuned and configured in a different way. I've seen this problem some time ago, but it eventually vanished. So I can't directly confirm if it's a misconfiguration.

LOL NoTengo... misconfiguration is definitely an option. I've no pride there, and some worry. On that note, I did a quick full reset to defaults this morning, and re-installed only the luci_statistics package. I did pull up a backup for a guide, but minimally restored my 4 wifi interfaces (primary and secondary guest on 2.4G and 5G), VLANs, and firewall rules for same; saved the config files and rebooted. Looked great for 20 minutes. Went to work, came home, looked at logs and statistics graphs, and, well, same old problem. No luck there.

This doesn't completely rule out a config problem of course. I've made my share of stupid mistakes to be sure. However, the config on this unit is essentially a twin of the EA8500 AP I use on the second floor of the house, and that has been rock solid since day one on 18.04, snapshots, and 19.07 rc2. So that gives me a little bit of confidence that I've got it configured OK.

I'm going to try removing ath10k-ct and replacing with non ct next. A little easier then flashing your build (will save for weekend), and something else to try during the week. Might be a few days, but I'll report what I learn here, whether nothing or something.

I mean misconfiguration of, for example, the kernel/firmware that is fixed at build time and can't be changed later. Not a misconfiguration of you. Just to clarify, when I said I configured the firmware, I'm referring to that because I build the firmware.

I'm not into my user's mind to configure something once deployed and installed; I just can configure my device myself, so it might be a misconfiguration during the build process.

Anyway, it looks like a bug and we want to know what's wrong. That's why I'm suggesting to try master snapshots. If it doesn't work, you can try my build that I build with a different configuration (i.e. different packages, different layout, different kernel version/patches and configuration, etc) so we can track down the bug to OpenWrt's stock firmware.

Clarify again: I'm not blaming your configuration, I'm blaming OpenWrt's build configuration against my build configuration.

@NoTengoBattery - I understand what you mean now. Thank you for the clarification. I'm linking to my config backups below anyway - not because I think you suggested a problem, but because who knows-maybe there is?

In other news, I tried dropping the ath10K-CT for non-CT with nothing good to report. Rather than repeat myself, see this post: https://forum.openwrt.org/t/openwrt-19-07-0-first-release-candidate/48040/165?u=eginnc

I'll give NoTengoBattery's build a try this weekend and see how that works next. I also cleaned up the backup archives of my gateway Edgerouter X and the EA6350v3 problem child plugged into it as an AP to remove passwords, etc. and posted them here if anyone cares to peruse. Never know. Maybe there is a clue in there someone else can see.

EA6350v3 Config
Edge Router X Gateway Config

I don't see anything weird in your config, so I'm keep thinking it's a bug. Because there is no way (that's not always true, but there should not be way) to break a kernel driver from a misconfiguration.

So if you somehow managed to do just that (I have to explain, I'm talking in rhetoric) then it's the driver's fault!

Torvalds (the creator or Linux) always said just that, the kernel should not break from userspace actions.

Eureka! I swapped the EA6350v3 for the extra ea8500 this weekend - configured same - same problem. So far, so same. But one difference from my other EA8500 AP - the problem AP uses an IOT VLAN. I went into the interfaces physical settings menu for this interface and checked "bridge interfaces" (even though this VLAN is only bridged with itself - so I'm thinking there should be no need for this). Memory use has stayed low and stable for ~18 hours (<60 MB) now. Don't know why it works, but it seems to be working so far. This last debug is all with snapshot r11626 on the EA8500 and r11625 on the ER-X.

I think ath10k-ct driver has memory leak bug. Though my ASUS-AC58U only has 128M memony and I know ath10k-ct needs much more memory, when I use ath10k driver, It's stable. If I switch the driver to ath10k-ct and use iperf3 to test it, the memory will be exhausted in several minutes and the router will reboot immediately. Hi @eginnc, can you test the ath10k-ct driver by iperf3 and watch the memory usage?
I viewed the forum about ipq40xx and ipq80xx SoC, It seems that the ath10k-ct is much more suitable for the ipq80xx but not stable on the ipq40xx.
Any ideas?

1 Like

ath10k and ath10k-CT are very stable with three radios running on the IPQ4019-based Linksys EA8300.

The Asus units with only 128 MB of memory are reported to be unstable, in general.

There is some talk on the mailing list to patch the ath10k-CT drivers for reduced RAM, but I don’t think that will fix the 128 MB issues. I recall one recent post indicating instability without wireless enabled.

One of my many past debug attempts on the ipq4018 was to remove ath10k-ct and replace with ath10k. It made no difference - memory use climbed unto death with either driver and ath10k was much worse than ath10k-ct with respect to connection speed, link speed and signal to noise. I don't think the problem in my case is unique to ath10k-ct. EA8500 memory use has been steady around ~50-60MB ever since enabling bridge interfaces on the IOT network interface with ath10k-ct and all has been well since.

For completeness, I'll risk the wrath of my family (mostly my insolent children using "air quotes" on my "fixing" the wifi again) and swap out the EA8500 for the EA6350 again, with bridged interfaces set for the IOT network interface. We'll see what happens and report back. Probably be the weekend before I get to that though.

Unless interfaces are supposed to be bridged to themselves to work properly, I do think there remains a memory killing ghost in the machine tied to not using this option, but I seem to have confirmed the problem is not unique to ath10k-ct or ipq4018, and I've confirmed it is present on both ipq8064 and ipq4018.

Sorry for the inconvenience.
As I have a ASUS RT-AC58U and I use it as an AP. The ath10k driver can provide about 40MB/s (5G wifi) download speed on iperf3 test. But the ath10k-ct can't finish the test because the memory is exhausted. So I want to know the performance on a 256M memory device.
Thank you.

When I was testing the IPQ4019-based Linksys EA8300 I recall seeing throughput over 500 Mbps over 802.11ac with a nearby MacBook Pro client. I didn't quantify it further as, for me, performance under "real" conditions (moderate and weak signal) is more important. I swapped out my trusty Archer C7v2 units for the EA8300s as the 802.11ac performance was noticeably better, even with 2x2 vs. 3x3 on the older chip set in the C7v2 units.

Right now, on a relatively low SNR link (-83 dBm), I'm getting 50-60 Mbps over a batman-adv link.

@sotux Sorry it took me longer than expected to swap out routers again. I did swap the EA6350v3 back into service as our first floor access point. It's running fine on snapshot r11735 (with ath10K-ct) now that I have bridge interface checked. I ran repeated iperf tests on the 5G wifi achieving ~350 Mbits between the AP and my Acer Swift 1 laptop across ~30 feet of air, line of sight. I could induce a blip in memory use, but nothing broke.

1 Like

Thanks, @eginnc. I do make a mistake. The ath10k-ct driver do work well and no memory leaks.
I've tried @chunkeey's patch to reduce the memory usage on my asus RT-AC58U and tested it by iperf3. No reboot occured. Sorry for the noise. @jeff you are right.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.