Nlbwmon stops gathering data

I've been using OpenWRT for years now, as well as nlbwmon. I'm currently using OpenWrt 21.02.2 r16495-bf0c965af0 on an x86-64 PC.

nlbwmon has started getting into a state where it no longer gathers new statistics. The process is running, but no host shows increased traffic, even if I do something like a speed test or a 6 GB Steam download.

If I run /etc/init.d/nlbwmon restart, the statistics start gathering again, but usually within an hour it stops.

I've considered putting the restart into cron and restarting every 15 minutes or so, but that seems like an awfully large hammer.

Any suggestions on where I might look to figure this out?

I know that nlbwmon clears counters when it gets stats. I had been playing with collectd too, and it occurred to me that the two could possibly be stomping on each other (although I swear I'd used both in the past). I've turned off colletctd (disabled, stopped). I don't know if there's any other package that could also be somehow stomping on the counters that nlbwmon is using.

Thank you for any thoughts/input.

1 Like

Check logread for nlbwmon messages, things to consider:

  • disable flow-offloading, offloaded traffic isn't seen by the kernel
  • netlink_buffer_size (/etc/config/nlbwmon) probably needs to be increased (you should see a message about this in logread)
  • the net.core.rmem_max sysctl value probably needs to be bumped (/etc/sysctl.d/XXX.conf) correspondingly.
1 Like

Thank you for the response. I have software/hardware offloading off, primarily because I wanted to make sure SQM was working. I might be able to turn software offloading on, but on this hardware, it doesn't really matter, and so I've left that all off.

logread|grep nlbwmon shows nothing, even if I restart nlbwmon. That's part of my problem: I see no logging/info for nlbwmon at all. I'm not aware of any config info for nlbwmon that enables logging.

I read through logread to see if nlbwmon messages were being logged without "nlbwmon" in them. I didn't see anything. In fact, if I do a logread | grep -v dnsmasq, there's only 13 lines: dropbear and upstream DHCP renewal. That's it.

The only nlbwmon settings I have changed are refresh stuff. netlink_buffer_size is the default. I have plenty of RAM on this box, so I could increase it, but your comment about seeing a message in logread makes me think this isn't the problem, as there aren't any messages.

net.core.rmem_max = 212992

Is that too low? It is the default, no setting anywhere for that.

nlbwmon is a binary. Guess the next step is to read some source code to see how it is logging/what I

Here's /etc/config/nlbwmon:

config nlbwmon
        option netlink_buffer_size '524288'
        option refresh_interval '30s'
        option database_directory '/var/lib/nlbwmon'
        option database_interval '1'
        option protocol_database '/usr/share/nlbwmon/protocols'
        option commit_interval '15m'
        list local_network '192.168.0.0/16'
        list local_network '172.16.0.0/12'
        list local_network '10.0.0.0/8'
        list local_network 'lan'
        option database_limit '0'
        option database_generations '0'

The only thing in the crontab is a call to acme.

This is probably noise, but this is the process table after the [] system processes. I was worried that something else was running, but I don't see anything there that I don't recognize/know why it is running.

ubus      1648     1  0 Apr08 ?        00:00:51 /sbin/ubusd
root      1649     1  0 Apr08 ttyS0    00:00:00 /sbin/askfirst /usr/libexec/login.sh
root      1650     1  0 Apr08 tty1     00:00:00 /sbin/askfirst /usr/libexec/login.sh
root      1680     1  0 Apr08 ?        00:03:04 /sbin/urngd
root      1889     2  0 Apr08 ?        00:00:00 [ixgbe]
logd      2317     1  0 Apr08 ?        00:00:00 /sbin/logd -S 64
root      2369     1  0 Apr08 ?        00:00:22 /sbin/rpcd -s /var/run/ubus/ubus.sock -t 30
root      3165     1  0 Apr08 ?        00:00:00 /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
root      3275     1  0 Apr08 ?        00:01:04 /sbin/netifd
root      3332     1  0 Apr08 ?        00:02:26 /usr/sbin/odhcpd
root      3469     2  0 Apr08 ?        00:00:00 [kworker/0:2-events]
root      3491     1  0 Apr08 ?        00:00:01 /usr/sbin/crond -f -c /etc/crontabs -l 5
root      3742     2  0 Apr08 ?        00:00:00 [wg-crypt-vpn]
root      5016     1  0 Apr08 ?        00:00:05 /sbin/blockd
root      5982     1  0 Apr08 ?        00:00:00 /usr/sbin/ntpd -n -N -S /usr/sbin/ntpd-hotplug -p 192.168.10.240
root      7278  3275  0 Apr08 ?        00:00:00 udhcpc -p /var/run/udhcpc-eth1.pid -s /lib/netifd/dhcp.script -f -t 0 -i eth1 -x hostname:guardhouse -C -R -O 121
root      7279  3275  0 Apr08 ?        00:00:32 odhcp6c -s /lib/netifd/dhcpv6.script -Ntry -P56 -t120 eth1
dnsmasq   8858     1  0 Apr08 ?        00:00:28 /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
root      9244     1  0 00:00 ?        00:00:01 /usr/sbin/uhttpd -f -h /www -r guardhouse -x /cgi-bin -u /ubus -t 60 -T 30 -k 20 -A 1 -n 3 -N 100 -R -p 0.0.0.0:80 -p [::]:80 -C /etc/acme/guardhouse.kevbo.o
root     12106  3165  0 19:51 ?        00:00:00 /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
root     12112 12106  0 19:51 pts/0    00:00:00 -ash
root     13110     2  0 12:17 ?        00:00:14 [kworker/u8:0-events_unbound]
root     13746     2  0 20:02 ?        00:00:00 [kworker/u8:1-events_unbound]
root     13753  3165  0 20:02 ?        00:00:00 /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
root     13800 13753  0 20:02 pts/1    00:00:00 -ash
root     13970 14749  0 20:03 ?        00:00:00 sleep 600
root     14312 17383  0 20:06 ?        00:00:00 sleep 600
root     14535 16769  0 20:07 ?        00:00:00 sleep 600
root     14537 17056  0 20:07 ?        00:00:00 sleep 600
root     14613 16451  0 20:07 ?        00:00:00 sleep 600
root     14690 16109  0 20:07 ?        00:00:00 sleep 600
root     14749     1  0 Apr08 ?        00:00:10 /bin/sh /usr/lib/ddns/dynamic_dns_updater.sh -v 0 -S gandi_comshackkevboorg -- start
root     14757     2  0 20:07 ?        00:00:00 [kworker/u8:2-events_unbound]
root     14934     1  0 20:08 ?        00:00:00 /usr/sbin/nlbwmon -o /var/lib/nlbwmon -b 524288 -i 15m -r 30s -p /usr/share/nlbwmon/protocols -G 0 -I 1 -L 0 -Z -s 192.168.0.0/16 -s 172.16.0.0/12 -s 10.0.0.
root     15165 19554  0 20:10 ?        00:00:00 sleep 60
root     15187 12112  0 20:11 pts/0    00:00:00 ps -ef
root     16109     1  0 Apr08 ?        00:00:10 /bin/sh /usr/lib/ddns/dynamic_dns_updater.sh -v 0 -S gandi_kevboorg -- start
root     16451     1  0 Apr08 ?        00:00:10 /bin/sh /usr/lib/ddns/dynamic_dns_updater.sh -v 0 -S gandi_mkwhitecom -- start
root     16769     1  0 Apr08 ?        00:00:10 /bin/sh /usr/lib/ddns/dynamic_dns_updater.sh -v 0 -S gandi_kayleighwhitecom -- start
root     17056     1  0 Apr08 ?        00:00:10 /bin/sh /usr/lib/ddns/dynamic_dns_updater.sh -v 0 -S gandi_sophiawhiteorg -- start
root     17383     1  0 Apr08 ?        00:00:10 /bin/sh /usr/lib/ddns/dynamic_dns_updater.sh -v 0 -S gandi_sophiwhiteorg -- start
root     19554     1  0 Apr08 ?        00:01:20 /bin/sh /usr/lib/ddns/dynamic_dns_updater.sh -v 0 -S gandi_guardhousekevboorg -- start

Personally I'm using net.core.rmem_max=1048576 and option netlink_buffer_size 1048576.

1 Like

I triple the buffer size as @KONG recommends and it works fine.

Copy and paste these commands at the same time in your SSH client and maybe the problem will be fixed:

# Increase the buffer size in nlbwmon
uci set nlbwmon.@nlbwmon[0].netlink_buffer_size="1572864"

# Increase the maximum socket buffer size
sed -i "/^[^#]*$/d" /etc/sysctl.conf
cat << "EOF" >> /etc/sysctl.conf
net.core.rmem_max=1572864
net.core.wmem_max=1572864
EOF

# Saving modified values
uci commit nlbwmon
/etc/init.d/nlbwmon restart
sysctl -p
1 Like

I increased the buffer size to 1048576 (from slh's post), but didn't increase rmem_max. In reading nlbwmon's source code, it appears that it should have printed out a warning it it tried to get memory that it couldn't get, and I wanted to see if I got that warning.

I did not. nlbwmon also stopped updating stats again, with the increased buffer size.

I have now done everything suggested by elan (including the even larger setting of 1572864). Systcl -a shows the new rmem_max and wmem_max settings. We'll see.

I think I'm going to need to actually set up a dev environment so I can recompile nlbwmon and add some more debugging output to see what's going on. The lack of any information at all from nlbwmon is making this difficult. It appears to just print to stderr, which I assume means the output should end up in syslog/logread, but I'm not seeing anything there.

2 Likes

i just restart the nlbwmon service every hour. theres a little discrepancy from vnstat, but not a lot

i placed this line in scheduled task

0 * * * * service nlbwmon restart

i've had this problem with nlbwmon stopping data collection intermittenlty.
in my recent router config (wrt32x with 22170 snapshot) it was working without the suggested mods until i enable irqbalance. i dont know if that was coincidental and now i'm doing some other checks. is it possible that irqbalance is a culprit in this nlbwmon problem?