Upstream Kernel Memleak 5.10+? octeon-ethernet.ko

I've been fighting with a Kernel Memleak for a bit now. Originally, it was suspected to be dnsmasq or adblock, but I've since ruled those MOSTLY out. Oom-killer: dnsmasq when Physical Free RAM remains

A few base-line facts:

  1. This does not happen under 5.4.x
  2. This DOES happen under 5.10 and 5.15
  3. KMEMLEAK reports the ethernet driver - thousands of times
    3a) However; according to the maintainer:
Those are not real memory leaks. If you unload the driver, and run the
kmemleak again you'll see they are gone.

The reason kmemleak thinks those are unreferenced is because those
memory buffers are given to FPA, and they are not visible to kmemleak
until FPA gives them back.

And while, yes, if I unload the octeon-ethernet.ko (or don't install it at all), KMEMLEAK stops reporting, but also the Memory Leak goes away. Similarly, if I stop the network, no leaks.

root@OpenWrt:/# echo scan > /sys/kernel/debug/kmemleak
root@OpenWrt:/# service network start
root@OpenWrt:/# [ 1083.489783] br-lan: port 1(eth1) entered blocking state
[ 1083.495104] br-lan: port 1(eth1) entered disabled state
[ 1083.501355] device eth1 entered promiscuous mode
[ 1083.533991] br-lan: port 2(eth2) entered blocking state
[ 1083.539353] br-lan: port 2(eth2) entered disabled state
[ 1083.545620] device eth2 entered promiscuous mode
[ 1087.744465] eth0: 1000 Mbps Full duplex, port 0, queue 0
[ 1087.749826] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

root@OpenWrt:/# echo scan > /sys/kernel/debug/kmemleak
[ 1092.778041] kmemleak: 3054 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

@Ansuel - I'll ping you on this so you can kind of keep track and if you have any more suggestions

@daniel Maybe you have some ideas?

I need help with how to quantify this, trace it, or at least report it so it gets taken seriously (or at the very least, provides a fix for my issue, even if it isn't the driver). This seems to be effects ALL Octeon targets under 5.10+...

1 Like

Can you build your own image?

That's all i build :slight_smile:

Currently master at 49f615022c, but I can rebase

A crazy idea would be check the changes in the oceton driver and put them on top of 5.10 and check if it does compile....

@Ansuel Would getting the overview as 'easy' as checking the git history for drivers/staging/octeon?

Because that's quickly found and the changes don't seem that many (going by number of commits, not code wise). Last commit in 5.4 that concerned octeon was 9cbc634.

Of course might just as well be something else in the network code wreaking havoc on Octeon networking.

I have a spare EdgeRouter Lite (Octeon Plus) I can test on. My EdgeRouter 4 (Octeon III) is in 'production'.

@Grommish I assume you're still seeing the memleak? You have an Octeon III as well right (Itus Shield Pro)?

Yes, I have 2 Shields (Octeon 3 CN7020), one I run in production using 5.4 and a dev box using 5.4 for my rust testing, but nothing I can't blow up to test.

a leak in the generic network code is hard and it would be already fixed/noticed by others... Could be that an API is used in a bad way and the driver lacks some memory release.
Bisecting the code a check if 5.4 version compile on 5.10 would be a quick and dirty way to check if the leak is caused by the driver itself or the kernel introduced more tracking to the memory usage and probably the leak was already present in 5.4 but not tracked.

1 Like

OK so I gave this a first try by throwing the 5.4 staging/drivers/octeon codebase into the 5.10 tree and seeing where that gets me (compilation errors :stuck_out_tongue_winking_eye: ).

There's also a TODO in that directory that says the following:

This driver is functional and supports Ethernet on OCTEON+/OCTEON2/OCTEON3
chips at least up to CN7030.

TODO:
- general code review and clean up
- make driver self-contained instead of being split between staging and
arch/mips/cavium-octeon.

Contact: Aaro Koskinen aaro.xxxxxxxx@xxx.xx

So here's stuff in arch/mips/cavium-octeon as well, not just in staging/. I used the 5.4.162 codebase, I'll be giving it another shot with the most recent 5.4.166 release (running 5.4.161 on my EdgeRouter 4 without issues, don't expect the newer bumps to be any different).

Going to see if this patch set makes any difference in getting the 5.4 octeon Ethernet driver to compile on 5.10:

https://www.spinics.net/lists/kernel/msg3281112.html

This is the diff between 5.10.83 and 5.4.166. From what I can tell, that's pretty much this patch set sent in in October 2019 and some other minor stuff. I don't know off hand where to get those patches in a downloadable format, so I patched them into the 5.4.166 codebase manually, paste here.

Hey, I would like to confirm that I experienced this too with self-compiled linux kernels on Octeons (Ubiquiti ER-4 and ER-6) running Debian. Specifically kernel 5.14 would OOM all the time, mysteriously, with plenty of memory available.

I was tearing my hair out. I tried all three flavors (mips, mipsel, mips64el) of userspace and both kernel endiannesses.

Thank you so, so much to @Grommish for posting this. Without your post I would never have figured out the problem.

Downgraded to 5.4.163 (also self-compiled), problems disappeared, rock solid.

This is definitely an upstream problem. The Linux octeon maintainers need to acknowledge this bug. IIRC octeon-ethernet got ripped out of the kernel around 5.9.x and then hastily re-added just in time for the 5.10 LTS release.

Thanks again!

Drivers are in staging/ which means they're far from optimal, and noone seems to bother (least of all Cavium or Marvell, the new owners). I put my EdgeRouter 4 up for sale and the EdgeRouter Lite is gonna go too.

Edit: I meant nobody upstream or at the companies owning the IP seems to bother (and kernel maintainers typically don't for stuff in staging/, I reckon).

Sad when things go that way.... well this situation looks to be very similar to mwlwifi... Driver that was in the process to be mature and then RIP.

2 Likes

I've not given up hope yet, but my inexperience with things means it'll be a long, drawn out process to try and figure out what broke.