Nlbwmon 100% cpu usage

@jow after a while, ~1 day, nlbwmon will do 100% cpu usage on wrt3200acm and the status page will show nothing
after service restart, the status page will work again, can I enable some debug mode ?

image

Can you capture some log lines with strace -p $(pidof nlbwmon) when it is using high CPU resources. That might help pinning the problem down.

is something like this, but only on reboot, after restarting the process, the problem will not occur anymore

lots of (I mean a lot ! )

recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)

and between

recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=148, type=0x100 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x02\x00\x00\x00\x34\x00\x01\x80\x14\x00\x01\x80\x08\x00\x01\x00\xc0\xa8\x7b\x76\x08\x00\x02\x00\x18\xc5\x8d\xf5\x1c\x00\x02\x80"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000004}, msg_namelen=12, msg_iov=[{iov_base={{len=196, type=0x102 /* NLMSG_??? */, flags=0, seq=0, pid=0}, "\x02\x00\x00\x00\x34\x00\x01\x80\x14\x00\x01\x80\x08\x00\x01\x00\x2d\x20\xa5\x56\x08\x00\x02\x00\x4f\x77\x61\xba\x1c\x00\x02\x80"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 196
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=148, type=0x100 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x02\x00\x00\x00\x34\x00\x01\x80\x14\x00\x01\x80\x08\x00\x01\x00\xc0\xa8\x7b\x76\x08\x00\x02\x00\x82\xcc\x59\x05\x1c\x00\x02\x80"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148

Same thing happened on my R7800.

I didn't see this thread, but next time it happens, I will try doing what @jow said.

Fwiw, here is some evidence of high usage.

1 Like

as lucize posted, mostly a lot of this (primarily the recvmsg):

recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=196, type=0x1
00 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x0a\x00\x00\x00\x4c\x00\x01\x80\x2c\x00\x01\x80\x14\x00\x03\x00
\xfd\x2d\x5d\x2a\xa3\x0a\x00\x00\x6d\x78\x5e\x63\x6e\xed\x3d\x71"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags
=0}, 0) = 196
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=196, type=0x1
00 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x0a\x00\x00\x00\x4c\x00\x01\x80\x2c\x00\x01\x80\x14\x00\x03\x00
\xfd\x2d\x5d\x2a\xa3\x0a\x00\x00\x6d\x78\x5e\x63\x6e\xed\x3d\x71"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags
=0}, 0) = 196
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=148, type=0x1
00 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x02\x00\x00\x00\x34\x00\x01\x80\x14\x00\x01\x80\x08\x00\x01\x00
\x05\x43\x15\x0b\x08\x00\x02\x00\x5a\xcf\xee\x63\x1c\x00\x02\x80"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags
=0}, 0) = 148
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=148, type=0x1
00 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x02\x00\x00\x00\x34\x00\x01\x80\x14\x00\x01\x80\x08\x00\x01\x00
\x05\x43\x15\x0b\x08\x00\x02\x00\x5a\xcf\xee\x63\x1c\x00\x02\x80"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags
=0}, 0) = 148
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=196, type=0x1
00 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x0a\x00\x00\x00\x4c\x00\x01\x80\x2c\x00\x01\x80\x14\x00\x03\x00
\x2a\x02\x0c\x7f\xc2\x41\xae\x00\xc0\x57\x02\x93\x77\xfe\xcc\xa8"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags
=0}, 0) = 196
recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000001}, msg_namelen=12, msg_iov=[{iov_base={{len=196, type=0x1
00 /* NLMSG_??? */, flags=0x600 /* NLM_F_??? */, seq=0, pid=0}, "\x02\x00\x00\x00\x34\x00\x01\x80\x14\x00\x01\x80\x08\x00\x01\x00
\xc0\xa8\x01\xb4\x08\x00\x02\x00\x34\x6d\x20\x17\x1c\x00\x02\x80"...}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags
=0}, 0) = 196
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)

had the same on R7800:

recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=12}, 0)         = -1 EAGAIN (Resource temporarily unavailable)

Since nlbwmon is frequently going wild and consuming all available processing power from at least one core I've put together some workarounds to be placed in /etc/rc.local file before exit 0 line.
Those have been tested on R7800 only.

  1. E-mail alert to be sent in case of high temperature:
(while true; do (sleep 15m; if [ `cat /sys/devices/virtual/thermal/thermal_zone9/temp` -ge 58000 ] && [ sleep 15m; cat /sys/devices/virtual/thermal/thermal_zone9/temp` -ge 58000 ]; then echo -e "Subject: Router name thermal alert\n\n"`cat /sys/devices/virtual/thermal/thermal_zone9/temp`"\n"`uptime`"\n"`top -b -n 3`"\n" | sendmail mymail@gmail.com; logger "WARNING: high temperature"; sleep 3h; fi); done) &

or variant checking average load instead:

(while true; do (sleep 15m; ([ `uptime | awk '{print (int($8*100))}'` -ge 125 ] && (echo -e "Subject: RouterName load alert\n\n"`cat /sys/devices/virtual/thermal/thermal_zone9/temp`"\n"`uptime`"\n"`top -b -n 3`"\n" | sendmail mymail@gmail.com; logger "WARNING: high load"); sleep 3h)); done) &

For both the assumption is sendmail is installed and configured.
2. Automatic restart of nlbwmon process:

(while true; do (NLB_T=45; sleep 15m; NLB_A=$(top -b -n 1 | grep nlbwmon | grep -v grep | awk '{print (int($7))}' | head -1); [ $NLB_A -ge $NLB_T ] && ( sleep 5m; NLB_B=$(top -b -n 1 | grep nlbwmon | grep -v grep | awk '{print (int($7))}' | head -1); [ $NLB_B -ge $NLB_T ] && ( /etc/init.d/nlbwmon restart; logger "WARNING: nlbwmon restarted due to process runaway, proc% was $NLB_A then $NLB_B" ))); done) &

edit: some fixes applied on June 5th

I pushed an update to nlbwmon which hopefully addresses the 100% CPU Load issue.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.