iperf with static ip and no other service should be a good test
but again it can't be something in userspace as we should have the same leak in every other target
the fact that the driver is in staging in the kernel makes me think that the assertion of the maintainers that the driver doesn't leak is BS...
Under the 5.4 kernel, I was seeing no issues. When testing under the 5.10 kernel, I am showing a series of memory leaks. I've enabled KMEMLEAK and begun trying to trace this down. I am not a programmer, but I can follow direction. In this case, it was decided I should talk to you, as the Cavium-Octeon listed maintainer.
Any suggestion or insight you could provide would be much appreciated. This seems to be affecting Octeon+/Octeon2/Octeon3 targets currently supported by OpenWrt.
I've only included the first four, though there were thousands on boot
unreferenced object 0x8000000005cc6800 (size 2048):
comm "swapper/0", pid 1, jiffies 4294937844 (age 3174.260s)
hex dump (first 32 bytes):
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
backtrace:
[<0000000092547866>] __kmalloc+0x1c4/0x768
[<000000002902bd0d>] cvm_oct_mem_fill_fpa+0x60/0x1a8
[<00000000b84a9f23>] cvm_oct_probe+0xb4/0xab8
[<00000000bdc4ede7>] platform_drv_probe+0x28/0x88
[<00000000a567e8b8>] really_probe+0xfc/0x4e0
[<0000000096837a2a>] device_driver_attach+0x120/0x130
[<00000000ec1cb103>] __driver_attach+0x7c/0x148
[<00000000fb6265da>] bus_for_each_dev+0x68/0xa8
[<000000004feb0e7d>] bus_add_driver+0x1d0/0x218
[<0000000069658853>] driver_register+0x98/0x160
[<00000000cec7f896>] do_one_initcall+0x54/0x168
[<0000000035c2e6f9>] kernel_init_freeable+0x280/0x31c
[<0000000046a35530>] kernel_init+0x14/0x104
[<00000000991d0df4>] ret_from_kernel_thread+0x14/0x1c
unreferenced object 0x8000000005323680 (size 216):
comm "softirq", pid 0, jiffies 4295252469 (age 28.020s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000078af28d6>] kmem_cache_alloc+0x1ac/0x708
[<000000001d074ea2>] __build_skb+0x34/0xd8
[<00000000339b0f83>] __netdev_alloc_skb+0x118/0x1f0
[<00000000db3556b0>] cvm_oct_mem_fill_fpa+0x154/0x1a8
[<00000000a49e80de>] cvm_oct_napi_poll+0x4c0/0x988
[<00000000d0c3cba0>] __napi_poll+0x3c/0x158
[<00000000bb0c10eb>] net_rx_action+0xe8/0x210
[<000000003322eb9f>] __do_softirq+0x168/0x360
[<00000000d4037fcb>] irq_exit+0x9c/0xe8
[<00000000224da306>] plat_irq_dispatch+0x48/0xd0
[<00000000327ba56b>] handle_int+0x14c/0x158
[<000000003eae4681>] __r4k_wait+0x20/0x40
unreferenced object 0x80000000052fd880 (size 216):
comm "softirq", pid 0, jiffies 4295253111 (age 21.600s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000078af28d6>] kmem_cache_alloc+0x1ac/0x708
[<000000001d074ea2>] __build_skb+0x34/0xd8
[<00000000339b0f83>] __netdev_alloc_skb+0x118/0x1f0
[<00000000db3556b0>] cvm_oct_mem_fill_fpa+0x154/0x1a8
[<00000000a49e80de>] cvm_oct_napi_poll+0x4c0/0x988
[<00000000d0c3cba0>] __napi_poll+0x3c/0x158
[<00000000bb0c10eb>] net_rx_action+0xe8/0x210
[<000000003322eb9f>] __do_softirq+0x168/0x360
[<00000000d4037fcb>] irq_exit+0x9c/0xe8
[<00000000224da306>] plat_irq_dispatch+0x48/0xd0
[<00000000327ba56b>] handle_int+0x14c/0x158
[<000000003eae4681>] __r4k_wait+0x20/0x40
unreferenced object 0x80000000052fc680 (size 216):
comm "softirq", pid 0, jiffies 4295253112 (age 21.590s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000078af28d6>] kmem_cache_alloc+0x1ac/0x708
[<000000001d074ea2>] __build_skb+0x34/0xd8
[<00000000339b0f83>] __netdev_alloc_skb+0x118/0x1f0
[<00000000db3556b0>] cvm_oct_mem_fill_fpa+0x154/0x1a8
[<00000000a49e80de>] cvm_oct_napi_poll+0x4c0/0x988
[<00000000d0c3cba0>] __napi_poll+0x3c/0x158
[<00000000bb0c10eb>] net_rx_action+0xe8/0x210
[<000000003322eb9f>] __do_softirq+0x168/0x360
[<00000000d4037fcb>] irq_exit+0x9c/0xe8
[<00000000224da306>] plat_irq_dispatch+0x48/0xd0
[<00000000327ba56b>] handle_int+0x14c/0x158
[<000000003eae4681>] __r4k_wait+0x20/0x40
The response
Those are not real memory leaks. If you unload the driver, and run the
kmemleak again you'll see they are gone.
The reason kmemleak thinks those are unreferenced is because those
memory buffers are given to FPA, and they are not visible to kmemleak
until FPA gives them back.
This is FACTUALLY true, from what I can tell (and what others have told me).. I even went so far as to kmod-octeon-ethernet.ko so I could unload it.. However, if the driver is the issue, then removing the driver would stop the leak and we have no way to say differnetly
Then why not use the damn kmemleak_not_leak and correctly handle them?
Also what if the driver just alloc mem and just frees all of it on unload? aka there is a logic error where buffer is never freed but only alloc every time? (and only freed on driver unload?)
Is this something I should be doing, or was that directed towards the upstream? I'm very far outside my knowledge zone on this, but I don't want to see the Octeon tree die because, frankly, I'm not up to replacing the routers that work great (when they aren't leaking like a sieve).
I'm open to testing anything that needs to be done, and spend whatever time is required for it, but I had to find out what kmemleak was and how to use it before sending it to the Maintainer.
Both... If the buffer is used like that then why they don't use the correct way by flagging the memory as not leaking?
The problem is always the same... understand if it's actually the driver or something else. For sure it's something in the kernel badly handled by the driver.
So not strictly the driver but a defect in the driver handling things.
Could totally be that in the net code they changed something and nobody cared to fix this in the staging driver.
That would also explain why with the same code the driver leaks on 5.10 but doesn't leak in 5.4
Right, I looked at that, but when I asked about bisecting the OpenWrt kernel, I got looks of pity and condolences and I'm not entirely sure how to go about it.. One of those "If you have to ask..." kinda things I guess
I know there were some krb_free changes (I think?) or something similar, but I couldn't take it any further than that.
I think it's more an issue where the arch isn't widely used (I mean, I put a PR for the 5.15 kernel and it involved, what.. 6 patches that haven't really changed since 4.19), and the "serious" devs don't have/can't test. The SNIC was a good shot at getting interest because it was cheap and plentiful, but the people who can test, and have the knowledge, just don't intersect.
If you see below, the Memleak isn't there when I turn network and dnsmasq off, but then when I start turning additional services off (dropbear, uhttpd, sysntpd) the memleak reappears???? What the actual hell?
Fri Apr 1 17:51:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41364 907844 28 16800 891656
Swap: 0 0 0
Fri Apr 1 17:56:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41140 908052 28 16816 891872
Swap: 0 0 0
Fri Apr 1 18:01:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 40840 908344 28 16824 892168
Swap: 0 0 0
Fri Apr 1 18:06:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41060 908120 28 16828 891948
Swap: 0 0 0
Fri Apr 1 18:11:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 40856 908324 28 16828 892152
Swap: 0 0 0
Fri Apr 1 18:16:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41336 907844 28 16828 891672
Swap: 0 0 0
Fri Apr 1 18:21:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41288 907892 28 16828 891720
Swap: 0 0 0
Fri Apr 1 18:26:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41276 907904 28 16828 891732
Swap: 0 0 0
Fri Apr 1 18:31:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41268 907912 28 16828 891740
Swap: 0 0 0
Fri Apr 1 18:36:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41300 907880 28 16828 891708
Swap: 0 0 0
Fri Apr 1 18:41:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41048 908132 28 16828 891960
Swap: 0 0 0
Fri Apr 1 18:46:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41484 907696 28 16828 891524
Swap: 0 0 0
Fri Apr 1 18:51:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41280 907900 28 16828 891728
Swap: 0 0 0
Fri Apr 1 18:56:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41464 907712 28 16832 891540
Swap: 0 0 0
Fri Apr 1 19:01:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41212 907964 28 16832 891792
Swap: 0 0 0
Fri Apr 1 19:06:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41208 907964 28 16836 891796
Swap: 0 0 0
Fri Apr 1 19:11:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41484 907688 28 16836 891520
Swap: 0 0 0
Fri Apr 1 19:16:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41232 907940 28 16836 891772
Swap: 0 0 0
Fri Apr 1 19:21:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41460 907708 28 16840 891540
Swap: 0 0 0
Fri Apr 1 19:26:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41412 907756 28 16840 891588
Swap: 0 0 0
Fri Apr 1 19:31:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41392 907772 28 16844 891604
Swap: 0 0 0
Fri Apr 1 19:36:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41668 907496 28 16844 891328
Swap: 0 0 0
Fri Apr 1 19:41:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41416 907748 28 16844 891580
Swap: 0 0 0
Fri Apr 1 19:46:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41620 907544 28 16844 891376
Swap: 0 0 0
Fri Apr 1 19:51:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41412 907752 28 16844 891584
Swap: 0 0 0
Fri Apr 1 19:56:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41604 907560 28 16844 891392
Swap: 0 0 0
Fri Apr 1 20:01:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41388 907776 28 16844 891608
Swap: 0 0 0
Fri Apr 1 20:06:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 42088 907076 28 16844 890908
Swap: 0 0 0
Fri Apr 1 20:11:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41868 907296 28 16844 891128
Swap: 0 0 0
Fri Apr 1 20:16:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41808 907356 28 16844 891188
Swap: 0 0 0
Fri Apr 1 20:21:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41852 907312 28 16844 891144
Swap: 0 0 0
Fri Apr 1 20:26:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 42036 907124 28 16848 890960
Swap: 0 0 0
Fri Apr 1 20:31:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 42032 907128 28 16848 890964
Swap: 0 0 0
Fri Apr 1 20:36:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41812 907348 28 16848 891184
Swap: 0 0 0
Fri Apr 1 20:41:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41764 907396 28 16848 891232
Swap: 0 0 0
Fri Apr 1 20:46:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 42016 907144 28 16848 890980
Swap: 0 0 0
Fri Apr 1 20:51:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 41776 907380 28 16852 891220
Swap: 0 0 0
Usage: service <service> [command]
/etc/init.d/boot enabled stopped
/etc/init.d/cron enabled stopped
/etc/init.d/dnsmasq disabled stopped
/etc/init.d/done enabled stopped
/etc/init.d/dropbear disabled stopped
/etc/init.d/firewall enabled stopped
/etc/init.d/gpio_switch enabled stopped
/etc/init.d/led enabled stopped
/etc/init.d/log enabled running
/etc/init.d/network disabled stopped
/etc/init.d/odhcpd enabled running
/etc/init.d/rpcd enabled running
/etc/init.d/sysctl enabled stopped
/etc/init.d/sysfixtime enabled stopped
/etc/init.d/sysntpd disabled stopped
/etc/init.d/system enabled stopped
/etc/init.d/ucitrack enabled stopped
/etc/init.d/uhttpd disabled stopped
/etc/init.d/umount enabled stopped
/etc/init.d/urandom_seed enabled stopped
/etc/init.d/urngd enabled running
Fri Apr 1 20:56:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 54876 894272 24 16860 878116
Swap: 0 0 0
Fri Apr 1 21:01:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 54672 894476 24 16860 878320
Swap: 0 0 0
Fri Apr 1 21:06:11 UTC 2022
total used free shared buff/cache available
Mem: 966008 65176 883968 24 16864 867816
Swap: 0 0 0
root@OpenWrt:/#
Well, it's getting closer and there has been some folks bravely trying to assist.
On 5.15, If I turn off ALL services but networking, it still leaks.. However, if I then remove the lan interface and wan6? You get the below.
Current theory by @neg2led is that its the way UDP packets are handled (or not) to 0.0.0.0 and :: address binding.. Which seems to correlate. Hopefully it'll let the smart folks start tracing where it might be. But! BUT! Remember, I had network running the entire time.. Granted, I had to set it static, but I am not sure we can directly blame the octeon-ethernet.ko driver exclusively, or at the very least, can narrow it even further.
what a joke for 9 kernel version the entire page allocation logic was flawed.... (fun stuff also show how little the upstream kernel is used for this kind of arch)
My only issue with this is that, at least in 5.15, they are actively updating code. The e300 was included in the upstream, so someone is submitting it and someone is accepting it. If they want it to die and be left to those old SDKs, they should do that and just properly tell us to bugger off. shrug
I think you understood my reply wrong, upstream obviously doesn't want an arch to die, and it's great that somebody is sending stuff upstream.
My point is that people always seem pissed when there is a regression or a bug found after X releases and are like "Why wasn't this found in testing, we need more testing, etc".
They think that there has got to be some huge testing and automation setup for everything, while the truth is that unless somebody notices the bug then it wont get caught, the more obscure the arch the lower the chance.