Upstream kernel memleak 5.10+? octeon-ethernet.ko

I've been fighting with a Kernel Memleak for a bit now. Originally, it was suspected to be dnsmasq or adblock, but I've since ruled those MOSTLY out. Oom-killer: dnsmasq when Physical Free RAM remains

A few base-line facts:

  1. This does not happen under 5.4.x
  2. This DOES happen under 5.10 and 5.15
  3. KMEMLEAK reports the ethernet driver - thousands of times
    3a) However; according to the maintainer:
Those are not real memory leaks. If you unload the driver, and run the
kmemleak again you'll see they are gone.

The reason kmemleak thinks those are unreferenced is because those
memory buffers are given to FPA, and they are not visible to kmemleak
until FPA gives them back.

And while, yes, if I unload the octeon-ethernet.ko (or don't install it at all), KMEMLEAK stops reporting, but also the Memory Leak goes away. Similarly, if I stop the network, no leaks.

root@OpenWrt:/# echo scan > /sys/kernel/debug/kmemleak
root@OpenWrt:/# service network start
root@OpenWrt:/# [ 1083.489783] br-lan: port 1(eth1) entered blocking state
[ 1083.495104] br-lan: port 1(eth1) entered disabled state
[ 1083.501355] device eth1 entered promiscuous mode
[ 1083.533991] br-lan: port 2(eth2) entered blocking state
[ 1083.539353] br-lan: port 2(eth2) entered disabled state
[ 1083.545620] device eth2 entered promiscuous mode
[ 1087.744465] eth0: 1000 Mbps Full duplex, port 0, queue 0
[ 1087.749826] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

root@OpenWrt:/# echo scan > /sys/kernel/debug/kmemleak
[ 1092.778041] kmemleak: 3054 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

@Ansuel - I'll ping you on this so you can kind of keep track and if you have any more suggestions

@daniel Maybe you have some ideas?

I need help with how to quantify this, trace it, or at least report it so it gets taken seriously (or at the very least, provides a fix for my issue, even if it isn't the driver). This seems to be effects ALL Octeon targets under 5.10+...

2 Likes

Can you build your own image?

That's all i build :slight_smile:

Currently master at 49f615022c, but I can rebase

A crazy idea would be check the changes in the oceton driver and put them on top of 5.10 and check if it does compile....

@Ansuel Would getting the overview as 'easy' as checking the git history for drivers/staging/octeon?

Because that's quickly found and the changes don't seem that many (going by number of commits, not code wise). Last commit in 5.4 that concerned octeon was 9cbc634.

Of course might just as well be something else in the network code wreaking havoc on Octeon networking.

I have a spare EdgeRouter Lite (Octeon Plus) I can test on. My EdgeRouter 4 (Octeon III) is in 'production'.

@Grommish I assume you're still seeing the memleak? You have an Octeon III as well right (Itus Shield Pro)?

Yes, I have 2 Shields (Octeon 3 CN7020), one I run in production using 5.4 and a dev box using 5.4 for my rust testing, but nothing I can't blow up to test.

a leak in the generic network code is hard and it would be already fixed/noticed by others... Could be that an API is used in a bad way and the driver lacks some memory release.
Bisecting the code a check if 5.4 version compile on 5.10 would be a quick and dirty way to check if the leak is caused by the driver itself or the kernel introduced more tracking to the memory usage and probably the leak was already present in 5.4 but not tracked.

1 Like

OK so I gave this a first try by throwing the 5.4 staging/drivers/octeon codebase into the 5.10 tree and seeing where that gets me (compilation errors :stuck_out_tongue_winking_eye: ).

There's also a TODO in that directory that says the following:

This driver is functional and supports Ethernet on OCTEON+/OCTEON2/OCTEON3
chips at least up to CN7030.

TODO:
- general code review and clean up
- make driver self-contained instead of being split between staging and
arch/mips/cavium-octeon.

Contact: Aaro Koskinen aaro.xxxxxxxx@xxx.xx

So here's stuff in arch/mips/cavium-octeon as well, not just in staging/. I used the 5.4.162 codebase, I'll be giving it another shot with the most recent 5.4.166 release (running 5.4.161 on my EdgeRouter 4 without issues, don't expect the newer bumps to be any different).

Going to see if this patch set makes any difference in getting the 5.4 octeon Ethernet driver to compile on 5.10:

https://www.spinics.net/lists/kernel/msg3281112.html

This is the diff between 5.10.83 and 5.4.166. From what I can tell, that's pretty much this patch set sent in in October 2019 and some other minor stuff. I don't know off hand where to get those patches in a downloadable format, so I patched them into the 5.4.166 codebase manually, paste here.

Hey, I would like to confirm that I experienced this too with self-compiled linux kernels on Octeons (Ubiquiti ER-4 and ER-6) running Debian. Specifically kernel 5.14 would OOM all the time, mysteriously, with plenty of memory available.

I was tearing my hair out. I tried all three flavors (mips, mipsel, mips64el) of userspace and both kernel endiannesses.

Thank you so, so much to @Grommish for posting this. Without your post I would never have figured out the problem.

Downgraded to 5.4.163 (also self-compiled), problems disappeared, rock solid.

This is definitely an upstream problem. The Linux octeon maintainers need to acknowledge this bug. IIRC octeon-ethernet got ripped out of the kernel around 5.9.x and then hastily re-added just in time for the 5.10 LTS release.

Thanks again!

Drivers are in staging/ which means they're far from optimal, and noone seems to bother (least of all Cavium or Marvell, the new owners). I put my EdgeRouter 4 up for sale and the EdgeRouter Lite is gonna go too.

Edit: I meant nobody upstream or at the companies owning the IP seems to bother (and kernel maintainers typically don't for stuff in staging/, I reckon).

Sad when things go that way.... well this situation looks to be very similar to mwlwifi... Driver that was in the process to be mature and then RIP.

3 Likes

I've not given up hope yet, but my inexperience with things means it'll be a long, drawn out process to try and figure out what broke.

1 Like

How is the status for this because it seems that octeon made it in to the 22.03 branch as far as I can see?

It made it because the leak was reduced, but I'm still seeing. The alternative was to set Octeon to source-only and end the build-bot builds. As far as I can tell, the memleak is still present, more with networking on then off, but still a persistent leak even with no networking.

Personally, I've reverted 3a14580411 kernel: delete Linux 5.4 config and patches because 5.10 I consider unusable (up to and include 5.10.108). I'm waiting for 5.15 to fully drop for testing to see whether it's a lost cause or not.

3 Likes

Well.. There goes that..

Under 5.15 - https://gist.github.com/Grommish/2cede3cceff2b8671b8f31ebeafd74fc

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=1fa8780056a8c7a2e26c8b4d5e6979232f117349

And so, it is going source-only for now. Building octeon targets will become a one-off for anyone who wants to use it.

I'm still looking, but it's not looking good.

For anyone who wants to try 5.15, here is the draft-PR I put up: https://github.com/openwrt/openwrt/pull/9614

2 Likes

Too bad not even 5.15 seems to show improvements :frowning:.

Its not surprising considering that these old Octeons have been dead upstream

2 Likes

Hmm, sound like I have to find something new again that is mountable in 19” rack and have high performans for AES encryption.

Even with Networking Off and Disabled (service network stop && service network disabled && reboot) I'm still seeing a small leak (~500k every 5-10 minutes).. This is a combination of things.

At this point, I'm going thru and disabling services and seeing what continues to leak. The Octeon Maintainers upstream say it isn't the driver, and unfortunately, the only way to test this is to remove the driver, which removes anything else that might be tangentially related to networking that could be leaking.

2 Likes

Interesting.. dnsmasq being turned off (in addition to network) seems to stop the residual leak.

root@OpenWrt:/# cat /etc/memlog_nonet_nodnsmasq.log
Fri Apr  1 17:51:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41364      907844          28       16800      891656
Swap:             0           0           0
Fri Apr  1 17:56:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41140      908052          28       16816      891872
Swap:             0           0           0
Fri Apr  1 18:01:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       40840      908344          28       16824      892168
Swap:             0           0           0
Fri Apr  1 18:06:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41060      908120          28       16828      891948
Swap:             0           0           0
Fri Apr  1 18:11:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       40856      908324          28       16828      892152
Swap:             0           0           0
Fri Apr  1 18:16:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41336      907844          28       16828      891672
Swap:             0           0           0
Fri Apr  1 18:21:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41288      907892          28       16828      891720
Swap:             0           0           0
Fri Apr  1 18:26:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41276      907904          28       16828      891732
Swap:             0           0           0
Fri Apr  1 18:31:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41268      907912          28       16828      891740
Swap:             0           0           0
Fri Apr  1 18:36:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41300      907880          28       16828      891708
Swap:             0           0           0
Fri Apr  1 18:41:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41048      908132          28       16828      891960
Swap:             0           0           0
Fri Apr  1 18:46:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41484      907696          28       16828      891524
Swap:             0           0           0
Fri Apr  1 18:51:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41280      907900          28       16828      891728
Swap:             0           0           0
Fri Apr  1 18:56:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41464      907712          28       16832      891540
Swap:             0           0           0
Fri Apr  1 19:01:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41212      907964          28       16832      891792
Swap:             0           0           0
Fri Apr  1 19:06:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41208      907964          28       16836      891796
Swap:             0           0           0
Fri Apr  1 19:11:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41484      907688          28       16836      891520
Swap:             0           0           0
Fri Apr  1 19:16:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41232      907940          28       16836      891772
Swap:             0           0           0
Fri Apr  1 19:21:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41460      907708          28       16840      891540
Swap:             0           0           0
Fri Apr  1 19:26:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41412      907756          28       16840      891588
Swap:             0           0           0
Fri Apr  1 19:31:11 UTC 2022
              total        used        free      shared  buff/cache   available
Mem:         966008       41392      907772          28       16844      891604
Swap:             0           0           0
root@OpenWrt:/#

If anyone has any suggestions, I'm open to them.

1 Like