Debugging memory use / OOM crashes

I am running OpenWRT 24.10 on an Archer MR 600 v2 with 128 MB of RAM. Every now and then the device runs out of memory and either reboots or slowly recovers. I am trying to understand what (kernel? Drivers? conntrack?) is using memory.

In normal idle operation the device uses 60-70 mb of 118 listed as available by free. During periods of heavy network traffic, I see used memory shooting up to 100mb or more. If memory use goes above that I first see sys CPU usage spiking - I suspect the kernel is decompressing the same squashfs pages again and again), the OOM killer kicking in or the device outright rebooting.

I looked at the process list, /proc/meminfo, conntrack -L (or rather the count of connections it prints) as well as /tmp file size while idle and in cases of high memory load, but I can not find out what allocates memory during the spikes. The sum of memory listed by ps is almost the same - in fact it is 100kb higher in the idle case. /proc/meminfo lists less memory as free and available, but doesn’t show an equivalent amount of increase in any of the other fields. If I sum up all rows in /proc/meminfo in the busy case and the idle case, about 30mb of memory seem to be missing in the busy case.

Conntrack shows approximately 200 connections in either case. It can fall lower, but I haven’t seen it go much above 200.

The memory is not leaked though. If the router doesn’t crash, it will eventually free whatever it allocated, assuming it doesn’t crash. Something uses it, but it doesn’t show up in /proc/meminfo, ps, and df (for tmpfs mounts).

What am I missing? What else can use considerable amounts of memory (in the order of 30-40 mb) and not show up in the numbers I looked at?

Here is /proc/meminfo in the busy case:

MemTotal: 118820 kB
MemFree: 23692 kB
MemAvailable: 4416 kB
Buffers: 0 kB
Cached: 8120 kB
SwapCached: 20 kB
Active: 4256 kB
Inactive: 6616 kB
Active(anon): 1624 kB
Inactive(anon): 1240 kB
Active(file): 2632 kB
Inactive(file): 5376 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 59388 kB
SwapFree: 51384 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 2744 kB
Mapped: 5168 kB
Shmem: 112 kB
KReclaimable: 2592 kB
Slab: 22844 kB
SReclaimable: 2592 kB
SUnreclaim: 20252 kB
KernelStack: 920 kB
PageTables: 500 kB
SecPageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 118796 kB
Committed_AS: 20240 kB
VmallocTotal: 1048372 kB
VmallocUsed: 3344 kB
VmallocChunk: 0 kB
Percpu: 288 kB

And this is the idle case:

MemTotal: 118820 kB
MemFree: 41564 kB
MemAvailable: 21388 kB
Buffers: 0 kB
Cached: 6504 kB
SwapCached: 124 kB
Active: 3488 kB
Inactive: 5988 kB
Active(anon): 1716 kB
Inactive(anon): 1548 kB
Active(file): 1772 kB
Inactive(file): 4440 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 59388 kB
SwapFree: 51836 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 2956 kB
Mapped: 3588 kB
Shmem: 284 kB
KReclaimable: 2584 kB
Slab: 22832 kB
SReclaimable: 2584 kB
SUnreclaim: 20248 kB
KernelStack: 920 kB
PageTables: 500 kB
SecPageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 118796 kB
Committed_AS: 20132 kB
VmallocTotal: 1048372 kB
VmallocUsed: 3344 kB
VmallocChunk: 0 kB
Percpu: 288 kB

MemFree and MemAvailable show a clear difference, but where did it go?

Does this system actually have swap space? Otherwise normally both should be 0.

zram-swap it is.

1 Like

What are you running on the system? Normal fresh start would be in the 20-30MB range then adding up 1MB per client.

take conntrack out of picture, it pre-allocates net.netfilter.conntrack_max memory, 230 or so bytes per entry, 2 entries per connection.

1 Like

Doesn't this sound like zram-swap malfunction?

Nope, smth huge crashed before “loaded” picture was made. All eyes on OOM from klog.

Yes, the “swap” is zram-swap. I tried disabling it and it (subjectively) made the problem worse. So I think zram is working as intended.

Here is what’s running on the device:

wpad-full (for WPA3), dnsmasq-full, udhcpc, miniupnpd
LuCI (uhttpd), DropBear
https-dns-proxy
nlbwmon, nft-qos, collectd

One thing that drives up the static use of RAM is that I am using OpenSSL rather than mbedtls as the TLS library. Using mbedtls frees up about 10 MB in total, but in particular with LuCI via https the speed difference is night and day. And it doesn’t solve the problem, it just mitigates it somewhat. I still get memory use spikes of dozens of MBs without any clue where they are going.

I grabbed the /proc/meminfo (and ps,free,df -h) I posted in the second post while (high mem use) and after (low mem use) one of my computers was running downloading in Battle.net . The low use state came after the high use one (which to me shows there is no classic leak where something loses track of allocated memory). The OOM killer was also - barely - not triggered, but the kernel did re-read and re-read data from squashfs for 10 or so seconds, suggesting it had to throw away data from the page cache that was actually needed. If the OOM killer gets triggered it grabs a random victim - most of the time uhttpd, dnsmasq or https-dns-proxy and kills it. procd then tries to restart it. I don’t think either of those processes is actually at fault for the high memory consumption though.

My Internet connection is an LTE connection with a bandwidth of 30 mbit/second only, so I am not moving around massive amounts of data during those downloads. The OpenWRT device is managing this connection.

I put the process list, /proc/meminfo, free output into a LibreOffice Calc sheet for easy comparison. Sadly the forum doesn’t allow me to attach it.

My OpenWRT image is a self-compiled image with everything in squashfs. I reduced the squashfs block size from the default 256kb to 128kb in the hope that it makes discarding data from the page cache and re-reading it a bit more efficient and because I still have a few MB of flash space to spare.

I have uploaded my memory comparison sheet to google docs, https://drive.google.com/drive/folders/1eQDGzEu7jdpxbInuk0V4QPFRa-fYEkrq should allow you to view/download it. This folder also contains my .config and built image that I have uploaded to help a user in TP-Link Archer MR600 exploration - #213 by Iggy87100 test a wifi patch.

I do not recall any similar issue described on this forum. Obviously I'm not reading every thread so I might be missing something. But to me this sounds like a unique issue. Considering that (based on the information you provided so far) nothing in your usage pattern or config is unique, but you are running a self-built image with some pretty uncommon changes (particularly the block size), my best guess is that something in those modifications is the source of the problem.

1 Like

Thanks for looking at my problem description anyhow :slight_smile: .

The block size change was something I tried in response to the memory issues - I saw the same with the default squashfs settings.

A while ago I came across a claim that the Mediathek Ethernet driver is pretty greedy and allocates a few MB for DMA or so. I don’t remember where I read that though and at the moment can’t find it again. I think it was a pull request that suggested to drop support for 64mb ram ramips devices. But I doubt that it is causing my issues. While I do have two wired devices attached, one of them (an Android TV) is usually off, and the other one (A Rasperry pi) is idle.

What is the purpose/application of dnsmasq-full?

How big are datasets from 2 competing monitoring tools?

dnsmasq-full (as opposed to the default one) is for classless-static-route, i.e.

dhcp-option=tag:sometag,option:classless-static-route,0.0.0.0/0,192.168.0.1,10.x.y.z/24,192.168.0.254

To send some traffic through a VPN running on my raspberry pi. I couldn’t get this to work with the stripped down dnsmasq, even if I replaced the string “classless-static-route” with a numerical value.

The (compressed) nlbwmon data is a few kb on the file system, and it is written to disk:

root@tplink:~# ls -lh /nlbwmon/
-rw-r----- 1 root root 2.4K Jun 1 00:00 20250501.db.gz
-rw-r----- 1 root root 1.1K Jul 1 00:00 20250601.db.gz
-rw-r----- 1 root root 1006 Aug 2 21:04 20250701.db.gz
-rw-r----- 1 root root 2.1K Aug 10 11:17 20250801.db.gz

collectd (which keeps track of CPU and RAM use, which nlbwmon doesn’t track) stores 560 kb in /tmp/rrd. The total tmpfs use is between 928 kb after boot up to 1100 kb after a day of uptime. I haven’t seen it grow beyond 1100 kb. I’m not persisting collectd data on disk.

I found the old 64mb ramips runs out of RAM PR:

Comment https://github.com/openwrt/openwrt/pull/16692#issuecomment-2409004924 points towards similar issues with memory use on MT7621, like my device, and links it to the number of CPU cores.

I also found pull request 17628, which wasn’t merged so far - I’ll give it a try and see what happens.

I read the comments in that PR and I failed to find similarities to your case. There is one user saying that MT7621 build consumes significantly more memory compared to MT7620, however: 1. this is only one report which may or may not be accurate and representative, 2. the report (to my understanding) is about static increase in memory use, rather than dynamic spikes as in your case.

1 Like

Actually most of pull request 17628 was merged in 15887235 and cherry-picked into 24.10 in 642b5b61, but the part that adjusted the DMA size in mtk_eth_soc.c seems to be missing. I have manually added this to my build and it reduced the memory used immediately after boot from 60mb to 54mb. I’ll see if it has an influence on spikes during load.

Yeah the issues described in that PR aren’t the same, but it is the best breadcrumb I have for now. I’ll see what that dma_size change brings. The two answers I am hoping for:

  1. Does it magically reduce dynamic memory use too? (Probably not)
  2. Are the 6mb of static use freed enough to make the router survive the dynamic spikes without crashing? (more likey, but not a given)

At least wifi<->ethernet speed seems unaffected by the change. Wrt the actual impact I’ll have to gather data for a few days.

Since that one comment talks about the number of CPU cores: I have packet steering set to “Enabled (all CPUs)”. It did improve speed between wifi and ethernet a little bit. Disabling it is something I’ll test eventually.

Flow offloading is disabled as it breaks QoS (tested) and probably nlbwmon too (assumed, not tested). Software flow offloading works in the sense that it reduces the CPU load, but it doesn’t have a noticeable effect on memory use (I tested that a while ago). There’s no HW flow offloading support on my hardware.

If neither the DMA change nor flow offloading bring an answer I’ll flash the prebuilt image and see if I can reproduce the problem with it.

1 Like

So here are a few hours of data. The mt2701_data.{rx, tx}.dma_size = MTK_DMA_SIZE(512) seems to have made a surprisingly big difference. This is from free:

total used free shared buff/cache available
Mem: 118820 59780 30404 732 28636 19780
Swap: 59388 512 58876

Only 512kb swap space used, still 28mb still used for disk cache.

Here’s the statistics graph:

I sadly don’t have one without the change, but the “used” there used to hover around 60-70 on idle, with only about 10mb used for the page cache and 5-8mb in swap. The highest spike I saw in the graph earlier was 85mb; If it went higher the system rebooted and the collectd data was lost. (semi off topic rant: Every tool seems to have a different idea what used, cached, and free memory means)

Yes, I did test downloading things. I updated some games with Wargaming.net Game Center - which uses libtorrent under the hood. It did not cause any spikes yet. I’ll have to keep this running for a while to be sure. Furthermore my family is out of the house, so there are fewer WiFi clients connected than usual.

I can’t entirely explain this yet. Static use reduced ok. But why would changing the ethernet driver eliminate the spikes?