Vxlan related memory leak?

Hello,

I have a hardware which is not officially supported - D-LINK DIR-2150 A1, but i don't believe it's a hardware problem, because the router has been running for a while without any issues, thanks to @Lucky1 great work. It would seem that the problem is related to VXLAN part.

I decided to configure a separate SSID for my IOT devices. This SSID is connected to a bridge together with a VXLAN tunnel to the homelab. Since the image i build myself a while ago didn't have vxlan support, i have built this from yesterdays snapshot ( 12.Oct.2023 ) and configured. Now, about 1 day later the router crashed.

It seems, that the cause of the crash is a memory leak. The librenms memory utilization graph went like this :

Last entries in the syslog before the crash were like this :

[86348.987973] ubi0 warning: ubi_io_read: error -77 (ECC error) while reading 4096 bytes from PEB 196:32768, read only 4096 bytes, retry	

Apparently, something is fishy here and i don't really know what. Is there anything obviously wrong ?

Here's an uci export:

I highly doubt that the kernel vxlan module is to blame. You see, vxlan is iirc in the mainline kernel since 4.19 (2017?) And it is used everywhere around the globe within nearly every cloud environment. Also vxlan is a really dead plain simple udp encapsulation so there shouldn't any high memory consumption ever been seen...

Tldr spend your time and dig at another place but I would bet a couple of beers that it is not vxlan related at least not the kernel module. If you use any fancy helper shizzle maybe but then I would consider this tool highly broken.

The thing is, the only change since previous state, other than newer snapshot was the VXLAN tunnel. I had virtually the same setup before, but the IOT traffic was trunked via another vlan into OPNsense which terminated the tunnel. Unfortunately this didn't really work for some reason.

Is there some tool to tell me which process is taking all the memory ?

atop/top seems rather lacking. free only offers summary. It seems to be happening again after reboot.

Edit : The only helper, as far as I know is luci, used to configure all this.

As you use a snapshot could it be that there was a recent change (like WiFi driver or the like) which might produce the memory leak?

There is https://www.kernel.org/doc/html/v4.14/dev-tools/kmemleak.html but you have to recompile your image.
I would at first check the bug tracker and/or use a newer snapshot or one of the current releases candidates.
If you use only the Luci "plugin" to configure the vxlan interface I would not blame it :person_shrugging:

I think I got the culprit. It's the SNMP daemon. It used 20 megs of RAM and restarting it caused this dent in the mem usage

I will try reconfig, replacement by the -mini or simply periodic restart using cron and post whatever helped.

Thank you for your help

1 Like

Ok, periodic snmpd restarts fixed it :frowning:

done via
crontab -e

2 0 0 0 0 /etc/init.d/snmpd restart

at sat ~noon. Since then, memory usage is not rising visibly.

It's a rather crude solution, root cause is still unknown, but it works.

Did you found a similar bug report for snmpd? If not, please make one :smiley:

They currently have 12 open issues for memory leaks in their github :sweat_smile:

ohdear ohdear.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.