Potential Memory Leak Introduced in Snapshot (August - Sept 2023?)

Hi,

Before creating a github issue, I wanted to do a check-in here first.
I'm running snapshot builds on qualcommax/ipq807x (Xiaomi AX3600), built using ASU.

Up-to and including build a33f1d3515 the router is very stable. I have this flash file saved locally.

I just did an ASU to ce7209b and did a large transfer which saturated my bandwidth. After about 10-20 minutes of continuous transfer, I noticed that the router was giving me incorrect password error in trying to connect to Luci. at times bad gateway, etc.

The router didn't crash, but I had to reboot it.

I resumed my transfer while watching Luci, and noticed that the free memory was dropping FAST from ~120MB free. I killed my transfer with ~20MB free memory left.

Flashing back to a33f1d3515 and resuming the download showed minimal drop in memory memory, and stability was back.

The transfers were all via wired Ethernet. As I have other APs, they took over the wifi devices after the reboot, and the memory dropped happened before most wifi came back to this router.

Anyone else seeing this / able to reproduce?

I've rolled back to a33f1d for the time being. As I'm not sure what's causing it, I'll leave this as is in case someone else is noticing the same.

1 Like

Your should edit the discussion thread title should to be more clear if you suspect a problem with pbr or other additional packages, instead of a bug in the plain core system. (Currently the title just blames the core system)

Regarding the memory leak, you might try using top htop etc. to try to identify which process consumes memory in growing amounts. And also test, if the memory leak seems to happen also without pbr.

2 Likes

if it's PBR related, perhaps it should be posted in Policy-Based-Routing (pbr) package discussion ?

Thanks both. (Again makes me smile seeing your name frollic as "we" go back to Twonky days).

Anyhow, it appears that it is not pbr related. I tried it last night without pbr (i.e. a direct ASU from aa33f1d3515) the system doesn't exhibit the behavior as far as I can tell.

What's interesting is that I did try htop, and sorting by m_share doesn't show any specific process' memory ballooning. All under 1%... But clearlly memory was dropping like a rock.

EDIT: It would be good to have other ipq807 folks check I think. This target is running 6.1, and between these to versions there have been kernel backports (eg 0b6d62c) as well.

At least on my own ipq807x DL-WRX36 the memory usage has been rather stable since yesterday evening when I flashed up-to-date master r23995-ce7209bd21 (https://git.openwrt.org/?p=openwrt/openwrt.git;a=shortlog;h=ce7209bd21661e3daa4a7f2f58dafdff990da19f)

1 Like

Thank you for sharing that. If you don't mind keeping an eye it would be great (especially with larger downloads).
I'll update as new snapshots become available and keep testing/keeping eye on it as well.

As there has been no wide-spread bug flood about that, it may be something to relatd to your specific package/traffic combination.

You should try to identify the minimal config that procudes the error. (shut down unnecessary services etc.)

Is is about certain download, torrents etc. Something running in the router or just traffic flowing through?

I am going to need somebody to track this down to a specific OpenWrt commit at least as otherwise its way too broad.

1 Like

It is going to be very difficult for a true bisect I'm afraid as it's my production router.
Would it be acceptable / heplful at all if I start at what appears to be a working build (a33f1d35), and in groups install the packages for which there are updates available?

Any thoughts on this approach, and if acceptable, on the order and grouping of items (groups separated by newline)?

ipq-wifi-xiaomi_ax3600 - 2023-06-03-cd9c30ca-2 - 2023-09-16-57aa1b15-1
curl - 8.2.1-1 - 8.3.0-1
libcurl4 - 8.2.1-1 - 8.3.0-1
luci-mod-status - git-23.248.23193-2ca117c - git-23.261.42034-318ef4c

procd - 2023-06-25-2db83655-2 - 2023-06-25-2db83655-3
procd-seccomp - 2023-06-25-2db83655-2 - 2023-06-25-2db83655-3
procd-ujail - 2023-06-25-2db83655-2 - 2023-06-25-2db83655-3

netifd - 2023-08-31-1a07f1df-3 - 2023-09-19-7a58b995-1

wpad-openssl - 2023-09-08-e5ccbfc6-2 - 2023-09-08-e5ccbfc6-3
hostapd-common - 2023-09-08-e5ccbfc6-2 - 2023-09-08-e5ccbfc6-3

base-files - 1535-r23939-a33f1d3515 - 1535-r23995-ce7209bd21

Well, I just OOM'd on a33f1d35.

I'll try to dig, but for now I'll update the title of this thread, but at this point I realize this isn't going to be helpful in tracking anything down given that I can't offer any information (e.g. even when did it start given I just haven't noticed).

Well, the thing is that opkg updates of individual components may in turn break stuff, especially the recent wifi configuration rework that included multiple packages will break if they are not in sync so individual opkg packages updates are meant for optional packages only, not core packages.

Yup agree.

I looked through my files and found that I had 7f54d9ba1a build on disk as well. So flashed that, and putting load on it seemed fine. I couldn't get the memory to fall much. I pulled about 10Gigs down. I'm going to stick with this build for a while to see if the memory stays fairly consistent.

The changes I see between 7f54d9ba1a and a33f1d3515 pertain pretty much only to hostapd and netifd.

Of the updated list based on this build (below), as a sanity check I updated the packages for curl, libcurl4, dawn, procd, procd-seccomp, procd-jail, openvpn-ssl. With these updated, putting load on did not cause memory to drop.

I believe base-files packages is updated due to the 2 second wait for sysupgrade (to fix the ipq807 upgrade issue). So I didn't upgrade that.

wifi items I didn't touch.

netifd I couldn't update without breaking things to your point. Although this to me seems like the most logical place where the issue might be given the transfers are all on wired side.

********  NO LEAK AFTER UPGRADING THESE PACKAGE TO LATEST WITH 7f54d9ba1a AS BASE ********  
openvpn-openssl - 2.6.5-1 - 2.6.6-1
dawn - 2022-07-24-9e8060ea-3 - 2023-05-14-e036905a-1
curl - 8.2.1-1 - 8.3.0-1
libcurl4 - 8.2.1-1 - 8.3.0-1
procd - 2023-06-25-2db83655-2 - 2023-06-25-2db83655-3
procd-seccomp - 2023-06-25-2db83655-2 - 2023-06-25-2db83655-3
procd-ujail - 2023-06-25-2db83655-2 - 2023-06-25-2db83655-3
ipq-wifi-xiaomi_ax3600 - 2023-06-03-cd9c30ca-2 - 2023-09-16-57aa1b15-1
ath11k-firmware-ipq8074 - 2023-07-28-006a4e2a-1 - 2023-08-22-d8f82a98-1


********  DID NOT UPGRADE PACKAGES *****
base-files - 1534-r23900-7f54d9ba1a - 1535-r23995-ce7209bd21
netifd - 2023-08-31-1a07f1df-3 - 2023-09-19-7a58b995-1
kmod-ath11k - 6.1.52+6.1.24-4 - 6.1.52+6.5-1
luci-mod-status - git-23.248.23193-2ca117c - git-23.261.42034-318ef4c
wpad-openssl - 2023-06-22-599d00be-2 - 2023-09-08-e5ccbfc6-3
kmod-mac80211 - 6.1.52+6.1.24-4 - 6.1.52+6.5-1
kmod-ath11k-ahb - 6.1.52+6.1.24-4 - 6.1.52+6.5-1
kmod-ath - 6.1.52+6.1.24-4 - 6.1.52+6.5-1
hostapd-common - 2023-06-22-599d00be-2 - 2023-09-08-e5ccbfc6-3
kmod-cfg80211 - 6.1.52+6.1.24-4 - 6.1.52+6.5-1

You also seem to have several other packages installed. They might react badly to netifd, mac80211 or hostapd changes with certain configs.

Based on the thread, you have installed (and running?) at least PBR, dawn, openvpn, ...

Did you try disabling services with the crashing firmware? (to get into a minimal config that still causes OOMs)

(You might try disabling at least all wifi stuff including dawn, and then test if the router still OOMs.)

I thought I tried that, but I don't think all at once.

Ok, will wait until 3a5ad is built for ASU, install that version (as it speaks to a patch from 2 days ago being broken), and try.

I have performed the following additional tests:
(Note: I don't use the IoT Antenna. It is always off, and I kept it off for all tests.)

I read these results to mean that although my downloads are end-to-end on wired side, the hostapd and/or netifd commits are the suspects.

  • Upgrade to e3559f
    ** SQM, OpenVPN, PBR, DAWN off; Wifi on --> Large transfer from Usenet WAN to LAN on wired --> Yes memory leak
    ** SQM, OpenVPN, PBR, DAWN on; Wifi off --> Large transfer from Usenet WAN to LAN on wired --> No memory leak

From this point SQM, OpenVPN, PBR, DAWN on:

  • Downgrade to 7f54d
    ** Wifi on --> Large transfer from Usenet WAN to LAN on wired --> No memory leak
    ** Same as above, but update packages from commits 033069 (ath11k-firmware: update to stable WLAN.HK.2.9.0.1-01890) and eb8ddf (ipq-wifi-xiaomi_ax3600 ) --> Large transfer from Usenet WAN to LAN on wired --> No memory leak

  • Upgrade to a33f1d
    ** SQM, OpenVPN, PBR, DAWN on; Wifi on --> Large transfer from Usenet WAN to LAN on wired --> --> Yes memory leak. Slow leak; Definitely not as fast as e3559fb445
    ** SQM, OpenVPN, PBR, DAWN on; Wifi off --> Large transfer from Usenet WAN to LAN on wired --> No memory leak

  • Upgrade to e3559f
    ** Disable 2.4GHz antenna, leave 5.4GHz antenna enabled --> No memory leak as far as I can tell
    ** Leave 2.4GHz antenna enabled, disable 5.4GHz antenna --> No memory leak as far as I can tell
    ** Both 2.4GHz and 5.4GHz antenna enabled --> --> Yes memory leak

I have reverted back to 7f54d which is stable.

1 Like

I updated to build ff95f859eb. Same issue persists.

My hypotheses if not a leak is that the updates to hostapd have resulted in either a leak OR has resulted in increased memory consumption in the process. Coupled to the high memory usage of the IPQ807 wifi, the 512MB profile (at least for me) buckles under high data transfer stress.

As I had mentioned, my IoT radio (Qualcomm Atheros QCA9887 802.11ac/b/g/n) is always disabled. For the last 2 days I have left the other 2.4GHz radio (Qualcomm Atheros IPQ8074 802.11ax/b/g/n) disabled; Having just transfered roughtly 15GB of data, the memory moved from ~160MB Free down to ~110MB Free, and back up when transfer completed.

Compared to the radio being enabled, and the Free+Cached dropping to under 10MB and OOMing if the large transfers, averging 160Mbit/s continue too long.

I'll upon a ticket for this with this info.

1 Like

I have created issue13562.

An update to the above, I have found that if a wireless stream (e.g. youtube) is started while the large transfer is being undertaken, the memory useage grows significantly more than prior to the recent hostapd updates.

Previous test were undertaken when family was home - so there must have been wireless streaming going on. Today, I was able to test with nobody else at home.

I have put this relevant info in the ticket:

DOWNGRADE to build 7f54d9b
DISABLED: Qualcomm Atheros QCA9887 802.11ac/b/g/n
ENABLED: Qualcomm Atheros IPQ8074 802.11ac/ax/n
ENABLED: Qualcomm Atheros IPQ8074 802.11ax/b/g/n

Reboot Router. After ~5 minutes:

Total Available:   ~174 MiB / 406.23 MiB
Used:   ~260 MiB / 406.23 MiB
Cached	~65 MiB / 406.23 MiB

Transfer ~5GB at roughly 22MiB/Sec WAN->Wired LAN download and simultaneously start wifi data stream (e.g. Youtube) - this case the device was attached to 5GHz Radio:
Total Available: ~138 MiB / 406.23 MiB
Used: ~295 MiB / 406.23 MiB
Cached ~65 MiB / 406.23 MiB

UPGRADE to build ff95f85

DISABLED: Qualcomm Atheros QCA9887 802.11ac/b/g/n
ENABLED: Qualcomm Atheros IPQ8074 802.11ac/ax/n
DISABLED: Qualcomm Atheros IPQ8074 802.11ax/b/g/n

Reboot Router. After ~5 minutes:

Total Available:   ~182 MiB / 406.23 MiB
Used:   ~245 MiB / 406.23 MiB
Cached	~58 MiB / 406.23 MiB

Transfer ~5GB at roughly 22MiB/Sec WAN->Wired LAN download and simultaneously start wifi data stream (e.g. Youtube) - this case the device was attached to 5GHz Radio:
Total Available: ~75 MiB / 406.23 MiB
Used: ~350 MiB / 406.23 MiB
Cached ~58 MiB / 406.23 MiB

DISABLED: Qualcomm Atheros QCA9887 802.11ac/b/g/n
ENABLED: Qualcomm Atheros IPQ8074 802.11ac/ax/n
ENABLED: Qualcomm Atheros IPQ8074 802.11ax/b/g/n

Reboot Router. After ~5 minutes:

Total Available:   ~143 MiB / 406.23 MiB
Used:   ~284 MiB / 406.23 MiB
Cached	~59 MiB / 406.23 MiB

Transfer ~5GB at roughly 22MiB/Sec WAN->Wired LAN download and simultaneously start wifi data stream (e.g. Youtube) - this case the device was attached to 5GHz Radio:
Total Available: ~33 MiB / 406.23 MiB
Used: ~368 MiB / 406.23 MiB
Cached ~32 MiB / 406.23 MiB

1 Like

I have updated to snapshot a181b9f0f9. The issue is still there. With no wireless activity, starting a fast.com speed test WAN -> Wired LAN consumes 34MB of RAM.

I disabled SQM and rebooted before this test.

Rolling back resolves the issue.

1 Like

Hi,
I noticed this trouble too...
Since I moved from wpad-openssl to wpad-basic-mbedtls, it seems to be stable the way it was before.
Cannot be 100% sure, more tests needed...

1 Like