Nlbwmon hangs after few days of use

I was not thinking about that to be honest. Thanks for pointing that to me.

I have attached SanDisk Cruzer Fit 3.1. I am also writing vnstat database for more than a year now. I am saving both vnstat and nlbwmon every single minute. I have not noticed any problem with the drive so far...

If this post is true then it looks like all SanDisk flash drive have wear leveling implemented:

Back to the topic...

with 8MB i have managed to survive almost 11 days of uptime. As @dan3 suggested 8MB can be excessive so I am lowering down the parameters to 2MB as 1MB did not work for me. This way i will try to find lowest working parameters values rounded to MB.

Interesting post on wear leveling, maybe I'm too paranoid about wearing out the flash. I'm using a 64GB lexar drive that should last for decades with basic wear leveling.

In other news, my nlbwmon is down again. GUI says it's unable to fetch statistics, and nlbw -c show results in Error while processing command: Bad file descriptor

logread | grep nlbw returns nothing

Not sure where to go from here. ps does show something weird, the subnets being monitored don't match what I configured... I only want to see 192.168.1.0 but it shows three:
/usr/sbin/nlbwmon -o /mnt/sda1/nlbwmon -b 1048576 -i 4h -r 1m -p /usr/share/nlbwmon/protocols -G 24 -I 1 -L 10000 -Z -s 192.168.1.0/24 -s 192.168.1.1/24 -s fdba:843f:53b5::1/60

I've modified /etc/config/nlbwmon to remove list local_network 'lan'
Now ps shows only my one subnet being monitored, instead of the duplicate ipv4 subnet and unused ipv6. Lets see how long it goes this time....

With

cat /etc/sysctl.conf

net.core.rmem_default=2097152
net.core.wmem_default=2097152
net.core.rmem_max=2097152
net.core.wmem_max=2097152

I have managed to survive 12 days of upload without nlbwmon getting stuck...

No errors logged using mentioned before cron logread job.

I am using 1gbit/40mbit internet link.

1 Like

Nice!

Removing list local_network 'lan' has helped. I'm at 17 days with no crash, while I only made it 5 days last time.

All my net.core.* settings are at 1048576

With 2MB values i have managed to survive 21 days of uptime without nlbwmon getting stuck. I have to restart router today due to updates. I think in my case 2MB solves the issue. I will still leave cron job and will check the log for eventual errors. In case something will come up I will post about it.

2 Likes

I'm now at 32 days with 1MB buffers and removal of duplicate ipv4 subnet and removal of ipv6.

December shows 620GB total data, 32 hosts, and 5 million connections. That's more than I thought. The #1 downloader is the smart TV, while the #1 uploader is my work laptop.

1 Like

It looks like the nlbwmon being stuck is not related on how long router is running without restart. With 2MB values i got error:

Fri Jan 20 04:54:13 2023 daemon.err nlbwmon[2553]: Netlink receive failure: Out of memory
Fri Jan 20 04:54:13 2023 daemon.err nlbwmon[2553]: Unable to dump conntrack: No buffer space available

just after 2 days of uptime and nlbwmon stopped counting again...

I am starting to doubt if nlbwmon can be trusted as reliable source of information.

I am increasing values to 4MB and still will be logging for errors.

Did you try my solution in my Dec 22 post? I was failing every few days, but have not failed since Dec 22.

1 Like

My last reply here unless someone needs help. Now at 3+ months stable without a restart.

1 Like

I lowered the values to 1048576 and have removed duplicate interface from the config file. It is now 28 days of uptime and nlbwmon is still counting. This seems to be the real solution to the problem so far...

Thanks dan3 for your solution.

2 Likes

Currently, I'm testing the proposed solution to remove list local_network 'lan' from /etc/config/nlbwmon.
The nlbwmon stopped updating the counters in less than 12 hours.
This is the config.

config nlbwmon
	option netlink_buffer_size '1048576'
	option commit_interval '24h'
	option database_directory '/var/lib/nlbwmon'
	option database_generations '10'
	option database_interval '1'
	option database_limit '10000'
	option protocol_database '/usr/share/nlbwmon/protocols'
	option refresh_interval '5m'
	list local_network '192.168.0.0/16'

I do not know if that matter but I have removed duplicate with ip address and left the one with interface name.

The reason of that is that when you add new interface for counting (WAN for example just for a test purpose) it adds it with interface name not IP address/mask.

But really no clue if it is relevant...

Can you show your nlbwmon config.

There you go:

config nlbwmon
        option netlink_buffer_size '1048576'
        option database_interval '1'
        option protocol_database '/usr/share/nlbwmon/protocols'
        option database_limit '0'
        option database_generations '0'
        option commit_interval '60s'
        option database_directory '/mnt/usb/nlbwmon'
        option refresh_interval '60'
        list local_network 'lan'

Oh, I forgot to mention that I survived 40 days of uptime and nlbwmon was still counting just before reboot due to updates.

I don't understand how your config list local_network 'lan' follows this

Hi All,

OK, few things. If your nlbwmon is crashing, first thing to do is see what parameters were passed to start it.

ps | grep nlbwmon

Look at the -s parameter, if you have multiple monitored networks with overlapping address spaces you should remove list local_network 'lan' from /etc/config/nlbwmon

As of today I'm at 129 days uptime, and still functional nlbwmon. My router is on a UPS, and yeah, I need to patch. This is my config:

cat /etc/config/nlbwmon

config nlbwmon
        option netlink_buffer_size '1048576'
        option database_interval '1'
        option database_limit '10000'
        option protocol_database '/usr/share/nlbwmon/protocols'
        option database_generations '24'
        option database_directory '/mnt/sda1/nlbwmon'
        option commit_interval '4h'
        option refresh_interval '1m'
        list local_network '192.168.1.0/24'

cat /etc/sysctl.d/12-nlbwmon.conf

net.core.rmem_default=1048576
net.core.wmem_default=1048576
net.core.rmem_max=1048576
net.core.wmem_max=1048576

ps | grep nlb

12378 root      1456 SN   /usr/sbin/nlbwmon -o /mnt/sda1/nlbwmon -b 1048576 -i 4h -r 1m -p /usr/share/nlbwmon/protocols -G 24 -I 1 -L 10000 -Z -s 192.168.1.0/24
1 Like

After 157 days of uptime, my nlbwmon has crashed :frowning:

Fri May  5 08:11:11 2023 daemon.err nlbwmon[12378]: Netlink receive failure: Out of memory
Fri May  5 08:11:11 2023 daemon.err nlbwmon[12378]: Unable to dump conntrack: No buffer space available
Fri May  5 17:13:54 2023 daemon.err nlbwmon[12378]: Netlink receive failure: Object busy
Fri May  5 17:13:54 2023 daemon.err nlbwmon[12378]: Unable to dump conntrack: I/O error

I suppose I should patch, then reboot at least once a month. Ug

1 Like