Nlbwmon hangs after few days of use

In my case 1MB did not help.

I am currently almost 6 days uptime with 8MB parameters. So far so good...

I am also logging every hour nlbwmon errors in logread using crontab to file on usb drive so I will not miss it.

No errors after 4 days with 1MB. If it does error out, I'll either find another solution, or schedule a nightly restart of nlbwmon. I also have USB storage setup and it should write/read from USB storage on restart.

8MB seems excessive for a socket buffer, something is up

FYI, Updates every 1hr to a basic USB stick might kill it. That's 24 X 30 = 720 writes to the same flash for a typical month. Unless you know it's MLC, it's a risk. I'm pretty sure most USB thumb drives do not have wear leveling.

I use a commit interval of 4 hours to avoid wearing out the flash.

My experience: 1440 pictures x 5 cameras = 7200 pictures per day on a Sandisk Cruzer. Working since 20.01.2020.

YMMV

That is the kind of use case that cheap USB flash drives excel at. Adding many large files spreads the wear over more flash vs updating the same small file.

I was not thinking about that to be honest. Thanks for pointing that to me.

I have attached SanDisk Cruzer Fit 3.1. I am also writing vnstat database for more than a year now. I am saving both vnstat and nlbwmon every single minute. I have not noticed any problem with the drive so far...

If this post is true then it looks like all SanDisk flash drive have wear leveling implemented:

Back to the topic...

with 8MB i have managed to survive almost 11 days of uptime. As @dan3 suggested 8MB can be excessive so I am lowering down the parameters to 2MB as 1MB did not work for me. This way i will try to find lowest working parameters values rounded to MB.

Interesting post on wear leveling, maybe I'm too paranoid about wearing out the flash. I'm using a 64GB lexar drive that should last for decades with basic wear leveling.

In other news, my nlbwmon is down again. GUI says it's unable to fetch statistics, and nlbw -c show results in Error while processing command: Bad file descriptor

logread | grep nlbw returns nothing

Not sure where to go from here. ps does show something weird, the subnets being monitored don't match what I configured... I only want to see 192.168.1.0 but it shows three:
/usr/sbin/nlbwmon -o /mnt/sda1/nlbwmon -b 1048576 -i 4h -r 1m -p /usr/share/nlbwmon/protocols -G 24 -I 1 -L 10000 -Z -s 192.168.1.0/24 -s 192.168.1.1/24 -s fdba:843f:53b5::1/60

I've modified /etc/config/nlbwmon to remove list local_network 'lan'
Now ps shows only my one subnet being monitored, instead of the duplicate ipv4 subnet and unused ipv6. Lets see how long it goes this time....

With

cat /etc/sysctl.conf

net.core.rmem_default=2097152
net.core.wmem_default=2097152
net.core.rmem_max=2097152
net.core.wmem_max=2097152

I have managed to survive 12 days of upload without nlbwmon getting stuck...

No errors logged using mentioned before cron logread job.

I am using 1gbit/40mbit internet link.

1 Like

Nice!

Removing list local_network 'lan' has helped. I'm at 17 days with no crash, while I only made it 5 days last time.

All my net.core.* settings are at 1048576

With 2MB values i have managed to survive 21 days of uptime without nlbwmon getting stuck. I have to restart router today due to updates. I think in my case 2MB solves the issue. I will still leave cron job and will check the log for eventual errors. In case something will come up I will post about it.

2 Likes

I'm now at 32 days with 1MB buffers and removal of duplicate ipv4 subnet and removal of ipv6.

December shows 620GB total data, 32 hosts, and 5 million connections. That's more than I thought. The #1 downloader is the smart TV, while the #1 uploader is my work laptop.

1 Like

It looks like the nlbwmon being stuck is not related on how long router is running without restart. With 2MB values i got error:

Fri Jan 20 04:54:13 2023 daemon.err nlbwmon[2553]: Netlink receive failure: Out of memory
Fri Jan 20 04:54:13 2023 daemon.err nlbwmon[2553]: Unable to dump conntrack: No buffer space available

just after 2 days of uptime and nlbwmon stopped counting again...

I am starting to doubt if nlbwmon can be trusted as reliable source of information.

I am increasing values to 4MB and still will be logging for errors.

Did you try my solution in my Dec 22 post? I was failing every few days, but have not failed since Dec 22.

1 Like

My last reply here unless someone needs help. Now at 3+ months stable without a restart.

1 Like

I lowered the values to 1048576 and have removed duplicate interface from the config file. It is now 28 days of uptime and nlbwmon is still counting. This seems to be the real solution to the problem so far...

Thanks dan3 for your solution.

2 Likes

Currently, I'm testing the proposed solution to remove list local_network 'lan' from /etc/config/nlbwmon.
The nlbwmon stopped updating the counters in less than 12 hours.
This is the config.

config nlbwmon
	option netlink_buffer_size '1048576'
	option commit_interval '24h'
	option database_directory '/var/lib/nlbwmon'
	option database_generations '10'
	option database_interval '1'
	option database_limit '10000'
	option protocol_database '/usr/share/nlbwmon/protocols'
	option refresh_interval '5m'
	list local_network '192.168.0.0/16'

I do not know if that matter but I have removed duplicate with ip address and left the one with interface name.

The reason of that is that when you add new interface for counting (WAN for example just for a test purpose) it adds it with interface name not IP address/mask.

But really no clue if it is relevant...

Can you show your nlbwmon config.

There you go:

config nlbwmon
        option netlink_buffer_size '1048576'
        option database_interval '1'
        option protocol_database '/usr/share/nlbwmon/protocols'
        option database_limit '0'
        option database_generations '0'
        option commit_interval '60s'
        option database_directory '/mnt/usb/nlbwmon'
        option refresh_interval '60'
        list local_network 'lan'