I've been fighting with a Kernel Memleak for a bit now. Originally, it was suspected to be dnsmasq or adblock, but I've since ruled those MOSTLY out. Oom-killer: dnsmasq when Physical Free RAM remains
A few base-line facts:
This does not happen under 5.4.x
This DOES happen under 5.10 and 5.15
KMEMLEAK reports the ethernet driver - thousands of times
3a) However; according to the maintainer:
Those are not real memory leaks. If you unload the driver, and run the
kmemleak again you'll see they are gone.
The reason kmemleak thinks those are unreferenced is because those
memory buffers are given to FPA, and they are not visible to kmemleak
until FPA gives them back.
And while, yes, if I unload the octeon-ethernet.ko (or don't install it at all), KMEMLEAK stops reporting, but also the Memory Leak goes away. Similarly, if I stop the network, no leaks.
root@OpenWrt:/# echo scan > /sys/kernel/debug/kmemleak
root@OpenWrt:/# service network start
root@OpenWrt:/# [ 1083.489783] br-lan: port 1(eth1) entered blocking state
[ 1083.495104] br-lan: port 1(eth1) entered disabled state
[ 1083.501355] device eth1 entered promiscuous mode
[ 1083.533991] br-lan: port 2(eth2) entered blocking state
[ 1083.539353] br-lan: port 2(eth2) entered disabled state
[ 1083.545620] device eth2 entered promiscuous mode
[ 1087.744465] eth0: 1000 Mbps Full duplex, port 0, queue 0
[ 1087.749826] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
root@OpenWrt:/# echo scan > /sys/kernel/debug/kmemleak
[ 1092.778041] kmemleak: 3054 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
@Ansuel - I'll ping you on this so you can kind of keep track and if you have any more suggestions
I need help with how to quantify this, trace it, or at least report it so it gets taken seriously (or at the very least, provides a fix for my issue, even if it isn't the driver). This seems to be effects ALL Octeon targets under 5.10+...
@Ansuel Would getting the overview as 'easy' as checking the git history for drivers/staging/octeon?
Because that's quickly found and the changes don't seem that many (going by number of commits, not code wise). Last commit in 5.4 that concerned octeon was 9cbc634.
Of course might just as well be something else in the network code wreaking havoc on Octeon networking.
I have a spare EdgeRouter Lite (Octeon Plus) I can test on. My EdgeRouter 4 (Octeon III) is in 'production'.
@Grommish I assume you're still seeing the memleak? You have an Octeon III as well right (Itus Shield Pro)?
Yes, I have 2 Shields (Octeon 3 CN7020), one I run in production using 5.4 and a dev box using 5.4 for my rust testing, but nothing I can't blow up to test.
a leak in the generic network code is hard and it would be already fixed/noticed by others... Could be that an API is used in a bad way and the driver lacks some memory release.
Bisecting the code a check if 5.4 version compile on 5.10 would be a quick and dirty way to check if the leak is caused by the driver itself or the kernel introduced more tracking to the memory usage and probably the leak was already present in 5.4 but not tracked.
OK so I gave this a first try by throwing the 5.4 staging/drivers/octeon codebase into the 5.10 tree and seeing where that gets me (compilation errors ).
There's also a TODO in that directory that says the following:
This driver is functional and supports Ethernet on OCTEON+/OCTEON2/OCTEON3
chips at least up to CN7030.
TODO:
- general code review and clean up
- make driver self-contained instead of being split between staging and
arch/mips/cavium-octeon.
So here's stuff in arch/mips/cavium-octeon as well, not just in staging/. I used the 5.4.162 codebase, I'll be giving it another shot with the most recent 5.4.166 release (running 5.4.161 on my EdgeRouter 4 without issues, don't expect the newer bumps to be any different).
Going to see if this patch set makes any difference in getting the 5.4 octeon Ethernet driver to compile on 5.10:
This is the diff between 5.10.83 and 5.4.166. From what I can tell, that's pretty much this patch set sent in in October 2019 and some other minor stuff. I don't know off hand where to get those patches in a downloadable format, so I patched them into the 5.4.166 codebase manually, paste here.
Hey, I would like to confirm that I experienced this too with self-compiled linux kernels on Octeons (Ubiquiti ER-4 and ER-6) running Debian. Specifically kernel 5.14 would OOM all the time, mysteriously, with plenty of memory available.
I was tearing my hair out. I tried all three flavors (mips, mipsel, mips64el) of userspace and both kernel endiannesses.
Thank you so, so much to @Grommish for posting this. Without your post I would never have figured out the problem.
Downgraded to 5.4.163 (also self-compiled), problems disappeared, rock solid.
This is definitely an upstream problem. The Linux octeon maintainers need to acknowledge this bug. IIRC octeon-ethernet got ripped out of the kernel around 5.9.x and then hastily re-added just in time for the 5.10 LTS release.
Drivers are in staging/ which means they're far from optimal, and noone seems to bother (least of all Cavium or Marvell, the new owners). I put my EdgeRouter 4 up for sale and the EdgeRouter Lite is gonna go too.
Edit: I meant nobody upstream or at the companies owning the IP seems to bother (and kernel maintainers typically don't for stuff in staging/, I reckon).
It made it because the leak was reduced, but I'm still seeing. The alternative was to set Octeon to source-only and end the build-bot builds. As far as I can tell, the memleak is still present, more with networking on then off, but still a persistent leak even with no networking.
Personally, I've reverted 3a14580411 kernel: delete Linux 5.4 config and patches because 5.10 I consider unusable (up to and include 5.10.108). I'm waiting for 5.15 to fully drop for testing to see whether it's a lost cause or not.
Even with Networking Off and Disabled (service network stop && service network disabled && reboot) I'm still seeing a small leak (~500k every 5-10 minutes).. This is a combination of things.
At this point, I'm going thru and disabling services and seeing what continues to leak. The Octeon Maintainers upstream say it isn't the driver, and unfortunately, the only way to test this is to remove the driver, which removes anything else that might be tangentially related to networking that could be leaking.