Oom-killer: dnsmasq when Physical Free RAM remains

does sound like its choking on something or a mem leak... but then if u have 350mb free? why is it killing it... Unless its trying to allocate more ram as a single block and can not? Try removing one your lists and see if it stablises?

I do know there are some list cleaners for adblock to help with memory issues but its been a while since i've used adblock. (they basically import all the lists and merge dupes etc so it just has one master list from all the ones you give it)

2 Likes

I've pared back the Adblock lists, and now, I'm showing:

So, we will see what happens.

1 Like

at least will help you narrow down if it dnsmasq or adblock causing issues.
There is one thing i do remember i had issues with. adblock when it did an update. if it didnt have enough memory to have 2x your lists in memory would fall over. i wonder if that is what might be happening? So while you have 350mb of ram "free" if your lists in total are over that then the loading and swapping to the updated lists might be why it dies? (he may have patched this. i honestly dont know but it is something i remembered.)

Could it be a variation of https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2021q2/014920.html ff.?

4 Likes

Well, your "used" goes from 548 MB down to 129 MB, by 420 MB...

During the blocklist update you might need 548+420 = 968 MB for that. And if you only have 350 MB free, the handling of the 420 MB list sounds unrealistic. (The new list needs to be prepared, while the old dnsmasq process still has the previous list.)

Almost half a gigabyte for adblock lists sounds extreme.

3 Likes

Here's a link to the original thread on this forum related to that particular report

3 Likes

I've taken the patch file you put in that thread and have updated it reflect dnsmasq-2.38. I'll test it and post it there for people to use, or submit it as a PR patch if it works (unless you want to.. It's your code, I just reworked it for the new source)

package/network/services/dnsmasq/patches/200-fix_max_procs.patch

--- a/src/dnsmasq.c
+++ b/src/dnsmasq.c
@@ -1036,7 +1036,7 @@ int main (int argc, char **argv)
   pid = getpid();

   daemon->pipe_to_parent = -1;
-  for (i = 0; i < MAX_PROCS; i++)
+  for (i = 0; i < daemon->max_procs; i++)
     daemon->tcp_pipes[i] = -1;

 #ifdef HAVE_INOTIFY
@@ -1497,7 +1497,7 @@ static void async_event(int pipe, time_t
                break;
            }
          else
-           for (i = 0 ; i < MAX_PROCS; i++)
+           for (i = 0 ; i < daemon->max_procs; i++)
              if (daemon->tcp_pids[i] == p)
                daemon->tcp_pids[i] = 0;
        break;
@@ -1561,7 +1561,7 @@ static void async_event(int pipe, time_t

       case EVENT_TERM:
        /* Knock all our children on the head. */
-       for (i = 0; i < MAX_PROCS; i++)
+       for (i = 0; i < daemon->max_procs; i++)
          if (daemon->tcp_pids[i] != 0)
            kill(daemon->tcp_pids[i], SIGALRM);

@@ -1732,7 +1732,7 @@ static void set_dns_listeners(void)
     poll_listen(rfl->rfd->fd, POLLIN);

   /* check to see if we have free tcp process slots. */
-  for (i = MAX_PROCS - 1; i >= 0; i--)
+  for (i = daemon->max_procs - 1; i >= 0; i--)
     if (daemon->tcp_pids[i] == 0 && daemon->tcp_pipes[i] == -1)
       break;

@@ -1756,7 +1756,7 @@ static void set_dns_listeners(void)
     }

   if (!option_bool(OPT_DEBUG))
-    for (i = 0; i < MAX_PROCS; i++)
+    for (i = 0; i < daemon->max_procs; i++)
       if (daemon->tcp_pipes[i] != -1)
        poll_listen(daemon->tcp_pipes[i], POLLIN);
 }
@@ -1791,7 +1791,7 @@ static void check_dns_listeners(time_t n
      to free the process slot. Once the child process has gone, poll()
      returns POLLHUP, not POLLIN, so have to check for both here. */
   if (!option_bool(OPT_DEBUG))
-    for (i = 0; i < MAX_PROCS; i++)
+    for (i = 0; i < daemon->max_procs; i++)
       if (daemon->tcp_pipes[i] != -1 &&
          poll_check(daemon->tcp_pipes[i], POLLIN | POLLHUP) &&
          !cache_recv_insert(now, daemon->tcp_pipes[i]))
@@ -1815,7 +1815,7 @@ static void check_dns_listeners(time_t n
         at least one a poll() time, that we still do.
         There may be more waiting connections after
         poll() returns then free process slots. */
-      for (i = MAX_PROCS - 1; i >= 0; i--)
+      for (i = daemon->max_procs - 1; i >= 0; i--)
        if (daemon->tcp_pids[i] == 0 && daemon->tcp_pipes[i] == -1)
          break;

--- a/src/dnsmasq.h
+++ b/src/dnsmasq.h
@@ -1240,6 +1240,9 @@ extern struct daemon {
   /* file for packet dumps. */
   int dumpfd;
 #endif
+
+  /* maximum number of child processes to fork */
+  unsigned int max_procs;
 } *daemon;

 /* cache.c */
--- a/src/option.c
+++ b/src/option.c
@@ -174,7 +174,8 @@ struct myoption {
 #define LOPT_CMARK_ALST_EN 365
 #define LOPT_CMARK_ALST    366
 #define LOPT_QUIET_TFTP    367
-
+#define LOPT_MAX_PROCS     368
+
 #ifdef HAVE_GETOPT_LONG
 static const struct option opts[] =
 #else
@@ -351,8 +352,9 @@ static const struct myoption opts[] =
     { "dhcp-ignore-clid", 0, 0,  LOPT_IGNORE_CLID },
     { "dynamic-host", 1, 0, LOPT_DYNHOST },
     { "log-debug", 0, 0, LOPT_LOG_DEBUG },
-       { "umbrella", 2, 0, LOPT_UMBRELLA },
+    { "umbrella", 2, 0, LOPT_UMBRELLA },
     { "quiet-tftp", 0, 0, LOPT_QUIET_TFTP },
+    { "max-procs", 1, 0, LOPT_MAX_PROCS },
     { NULL, 0, 0, 0 }
   };

@@ -539,8 +541,9 @@ static struct {
   { LOPT_SCRIPT_TIME, OPT_LEASE_RENEW, NULL, gettext_noop("Call dhcp-script when lease expiry changes."), NULL },
   { LOPT_UMBRELLA, ARG_ONE, "[=<optspec>]", gettext_noop("Send Cisco Umbrella identifiers including remote IP."), NULL },
   { LOPT_QUIET_TFTP, OPT_QUIET_TFTP, NULL, gettext_noop("Do not log routine TFTP."), NULL },
+  { LOPT_MAX_PROCS, ARG_ONE, "<number>", gettext_noop("Specify maximum number of child process to fork."), NULL },
   { 0, 0, NULL, NULL, NULL }
-};
+};

 /* We hide metacharacters in quoted strings by mapping them into the ASCII control
    character space. Note that the \0, \t \b \r \033 and \n characters are carefully placed in the
@@ -4800,7 +4803,12 @@ err:
        break;
       }
 #endif
-
+    case LOPT_MAX_PROCS:  /* --max-procs */
+      if (!atoi_check16(arg, &daemon->max_procs))
+       ret_err(gen_err);
+      if (daemon->max_procs > MAX_PROCS) daemon->max_procs = MAX_PROCS;
+        break;
+
     default:
       ret_err(_("unsupported option (check that dnsmasq was compiled with DHCP/TFTP/DNSSEC/DBus support)"));

@@ -5282,7 +5290,7 @@ void read_opts(int argc, char **argv, ch
   daemon->soa_refresh = SOA_REFRESH;
   daemon->soa_retry = SOA_RETRY;
   daemon->soa_expiry = SOA_EXPIRY;
-
+  daemon->max_procs = MAX_PROCS;
 #ifndef NO_ID
   add_txt("version.bind", "dnsmasq-" VERSION, 0 );
   add_txt("authors.bind", "Simon Kelley", 0);

Sure, I can submit a PR. Given that Simon Kelley didn't seem to think that a patch like this belongs in the upstream source, it probably makes sense to add it to Openwrt - the combination of large adblock lists and constrained router memory makes this a problem just waiting to reoccur.

Let us know if that patch actually solves your problem.

1 Like

Funny enough:

root@OpenWrt:/# service dnsmasq start
[  433.275260] do_page_fault(): sending SIGSEGV to ujail for invalid read access from 00000100f32271e3
[  433.284367] epc = 000000aaab68a828 in ujail[aaab680000+14000]
[  433.290141] ra  = 000000aaab68adc4 in ujail[aaab680000+14000]
root@OpenWrt:/#

Fresh rebase with master as of this post.. Any suggestions?

Edit: I'm going to try building without the procd-ujail and see if that helps

You should alert @daniel who has been the main author of the ujail functionality.

Please share platform and configuration which will allow me to reproduce this error.

Other suggestion is that you could build with debugging enabled, and you should be able to a core dump out of it, which you can then analyse with gdb in your buildhost (where you have the non-stripped binaries in staging_dir)

If gdb core dump debugging in new for you, see a recent example in:

Successful debugging (finding the crashing source line) pretty much requires that you have the unstripped binary still available in the buildhost. (or you copy and use the unstripped binary into the router itself.)

Hi @daniel - MIPS64 Octeon3 Itus Shield, built from source (master), HEAD at c4e994011f. Swapping procd-ujail for just procd solved the ujail issue (obviousy) and I'm more than happy to help test whatever once I get the image stabilized (this started with dnsmasq memory issues)

I've used remotegdb once before, but I don't use gdb all that much. I'll look into it, or maybe someone will be able to give me specific directions, if it comes to it :slight_smile:
I did remove sstrip striping from the build to leave the symbols intact though, so it's an option going forward.

I did manage to get dnsmasq patched to allow for the --max-procs call, and updated /etc/init.d/dnsmasq to call it.

# auto-generated config file from /etc/config/dhcp
conf-file=/etc/dnsmasq.conf
dhcp-authoritative
domain-needed
localise-queries
read-ethers
enable-ubus=dnsmasq
expand-hosts
bind-dynamic
local-service
edns-packet-max=1232
domain=lan
local=/lan/
max-procs=1
addn-hosts=/tmp/hosts
dhcp-leasefile=/tmp/dhcp.leases
resolv-file=/tmp/resolv.conf.d/resolv.conf.auto
stop-dns-rebind
rebind-localhost-ok
dhcp-broadcast=tag:needs-broadcast
conf-dir=/tmp/dnsmasq.d
user=dnsmasq
group=dnsmasq
dhcp-ignore-names=tag:dhcp_bogus_hostname
conf-file=/usr/share/dnsmasq/dhcpbogushostname.conf
bogus-priv
conf-file=/usr/share/dnsmasq/rfc6761.conf
dhcp-range=set:lan,192.168.1.100,192.168.1.249,255.255.255.0,12h
no-dhcp-interface=eth0

We will see if this fixes it or not.. it usually takes a few days though.

@dl12345

Ok.. So. Changes for dnsmasq. I'm currently testing to see if it dies, but it'll take a few days now that I'm stablized, but these are the changes that'll need to go into the PR you put up.

Patch to fix the dnsmasq init service

--- a/package/network/services/dnsmasq/files/dnsmasq.init
+++ b/package/network/services/dnsmasq/files/dnsmasq.init
@@ -937,6 +937,8 @@
        append_parm "$cfg" "maxport" "--max-port"
        append_parm "$cfg" "domain" "--domain"
        append_parm "$cfg" "local" "--local"
+       append_parm "$cfg" "maxprocs" "--max-procs"
+
        config_list_foreach "$cfg" "listen_address" append_listenaddress
        config_list_foreach "$cfg" "server" append_server
        config_list_foreach "$cfg" "rev_server" append_rev_server
--- a/package/network/services/dnsmasq/files/dhcp.conf
+++ b/package/network/services/dnsmasq/files/dhcp.conf
@@ -21,6 +21,7 @@
        #list bogusnxdomain     '64.94.110.11'
        option localservice     1  # disable to allow DNS requests from non-local subnets
        option ednspacket_max   1232
+       option maxprocs         1  # Max Forked Processes

 config dhcp lan
        option interface        lan

package/network/services/dnsmasq/patches/200-fix_max_procs.patch

--- a/src/dnsmasq.c
+++ b/src/dnsmasq.c
@@ -1036,7 +1036,7 @@ int main (int argc, char **argv)
   pid = getpid();

   daemon->pipe_to_parent = -1;
-  for (i = 0; i < MAX_PROCS; i++)
+  for (i = 0; i < daemon->max_procs; i++)
     daemon->tcp_pipes[i] = -1;

 #ifdef HAVE_INOTIFY
@@ -1497,7 +1497,7 @@ static void async_event(int pipe, time_t
                break;
            }
          else
-           for (i = 0 ; i < MAX_PROCS; i++)
+           for (i = 0 ; i < daemon->max_procs; i++)
              if (daemon->tcp_pids[i] == p)
                daemon->tcp_pids[i] = 0;
        break;
@@ -1561,7 +1561,7 @@ static void async_event(int pipe, time_t

       case EVENT_TERM:
        /* Knock all our children on the head. */
-       for (i = 0; i < MAX_PROCS; i++)
+       for (i = 0; i < daemon->max_procs; i++)
          if (daemon->tcp_pids[i] != 0)
            kill(daemon->tcp_pids[i], SIGALRM);

@@ -1732,7 +1732,7 @@ static void set_dns_listeners(void)
     poll_listen(rfl->rfd->fd, POLLIN);

   /* check to see if we have free tcp process slots. */
-  for (i = MAX_PROCS - 1; i >= 0; i--)
+  for (i = daemon->max_procs - 1; i >= 0; i--)
     if (daemon->tcp_pids[i] == 0 && daemon->tcp_pipes[i] == -1)
       break;

@@ -1756,7 +1756,7 @@ static void set_dns_listeners(void)
     }

   if (!option_bool(OPT_DEBUG))
-    for (i = 0; i < MAX_PROCS; i++)
+    for (i = 0; i < daemon->max_procs; i++)
       if (daemon->tcp_pipes[i] != -1)
        poll_listen(daemon->tcp_pipes[i], POLLIN);
 }
@@ -1791,7 +1791,7 @@ static void check_dns_listeners(time_t n
      to free the process slot. Once the child process has gone, poll()
      returns POLLHUP, not POLLIN, so have to check for both here. */
   if (!option_bool(OPT_DEBUG))
-    for (i = 0; i < MAX_PROCS; i++)
+    for (i = 0; i < daemon->max_procs; i++)
       if (daemon->tcp_pipes[i] != -1 &&
          poll_check(daemon->tcp_pipes[i], POLLIN | POLLHUP) &&
          !cache_recv_insert(now, daemon->tcp_pipes[i]))
@@ -1815,7 +1815,7 @@ static void check_dns_listeners(time_t n
         at least one a poll() time, that we still do.
         There may be more waiting connections after
         poll() returns then free process slots. */
-      for (i = MAX_PROCS - 1; i >= 0; i--)
+      for (i = daemon->max_procs - 1; i >= 0; i--)
        if (daemon->tcp_pids[i] == 0 && daemon->tcp_pipes[i] == -1)
          break;

--- a/src/dnsmasq.h
+++ b/src/dnsmasq.h
@@ -1240,6 +1240,9 @@ extern struct daemon {
   /* file for packet dumps. */
   int dumpfd;
 #endif
+
+  /* maximum number of child processes to fork */
+  unsigned int max_procs;
 } *daemon;

 /* cache.c */
--- a/src/option.c
+++ b/src/option.c
@@ -174,7 +174,8 @@ struct myoption {
 #define LOPT_CMARK_ALST_EN 365
 #define LOPT_CMARK_ALST    366
 #define LOPT_QUIET_TFTP    367
-
+#define LOPT_MAX_PROCS     368
+
 #ifdef HAVE_GETOPT_LONG
 static const struct option opts[] =
 #else
@@ -351,8 +352,9 @@ static const struct myoption opts[] =
     { "dhcp-ignore-clid", 0, 0,  LOPT_IGNORE_CLID },
     { "dynamic-host", 1, 0, LOPT_DYNHOST },
     { "log-debug", 0, 0, LOPT_LOG_DEBUG },
-       { "umbrella", 2, 0, LOPT_UMBRELLA },
+    { "umbrella", 2, 0, LOPT_UMBRELLA },
     { "quiet-tftp", 0, 0, LOPT_QUIET_TFTP },
+    { "max-procs", 1, 0, LOPT_MAX_PROCS },
     { NULL, 0, 0, 0 }
   };

@@ -539,8 +541,9 @@ static struct {
   { LOPT_SCRIPT_TIME, OPT_LEASE_RENEW, NULL, gettext_noop("Call dhcp-script when lease expiry changes."), NULL },
   { LOPT_UMBRELLA, ARG_ONE, "[=<optspec>]", gettext_noop("Send Cisco Umbrella identifiers including remote IP."), NULL },
   { LOPT_QUIET_TFTP, OPT_QUIET_TFTP, NULL, gettext_noop("Do not log routine TFTP."), NULL },
+  { LOPT_MAX_PROCS, ARG_ONE, "<number>", gettext_noop("Specify maximum number of child process to fork."), NULL },
   { 0, 0, NULL, NULL, NULL }
-};
+};

 /* We hide metacharacters in quoted strings by mapping them into the ASCII control
    character space. Note that the \0, \t \b \r \033 and \n characters are carefully placed in the
@@ -4800,7 +4803,12 @@ err:
        break;
       }
 #endif
-
+    case LOPT_MAX_PROCS:  /* --max-procs */
+      if (!atoi_check16(arg, &daemon->max_procs))
+       ret_err(gen_err);
+      if (daemon->max_procs > MAX_PROCS) daemon->max_procs = MAX_PROCS;
+        break;
+
     default:
       ret_err(_("unsupported option (check that dnsmasq was compiled with DHCP/TFTP/DNSSEC/DBus support)"));

@@ -5282,7 +5290,7 @@ void read_opts(int argc, char **argv, ch
   daemon->soa_refresh = SOA_REFRESH;
   daemon->soa_retry = SOA_RETRY;
   daemon->soa_expiry = SOA_EXPIRY;
-
+  daemon->max_procs = MAX_PROCS;
 #ifndef NO_ID
   add_txt("version.bind", "dnsmasq-" VERSION, 0 );
   add_txt("authors.bind", "Simon Kelley", 0);

I'm not sure we'd want to enable the process count restriction by default. I suspect it is only necessary when large adblock lists are being used. A setting of 1 is quite aggressive, meaning that only a single child process can be forked to handle tcp requests

A more robust solution might be to add functionality to luci's dnsmasq page to allow it to be configured, but leaving it unset by default, so not changing the init.d script...

Well, whatever is felt as appropriate.

I just needed to put something in there so it was recognized, as I know the discussion said for sure that setting it to 1 made the issue go away.

Since the default is 20, and the min would be 1, what would be the safe middle ground as a default? 3? 5? 10? Just leave it out of the dhcp config entirely? Will that leave it to the default?

for a quick test you can try merging this developer's branch https://git.openwrt.org/?p=openwrt/staging/ldir.git;a=shortlog;h=refs/heads/mine

You can also see what happens if you revert back to dnsmasq 2.85 by :

git revert ed7769aa405fe246b89c9c97b7fb552dfb0b4995
git revert 02a2b44eabf607fb5405ff0d7da4ad0748d3e1b1
git revert d2d0044ebf01b71f63cde609e09f6ac68cdfeccb

1 Like

Yes, just leaving it out entirely would be the way to go. That will leave it on default settings of 20.

If this is the cause of your problem, it's likely only to occur when a large adblock list is being used AND DNS lookups are being made over TCP.

Probably the kind of thing that should be documented in the wiki so that anyone experiencing oom-killer crashes of this nature can go and set the parameter.

1 Like

I've set it to max-procs=5 and loaded a set of blocked domains..

I will see how long it takes to kick over, and then reduce it and repeat. I'll let you know.

It would probably be quicker if you were to first establish exactly how many times the process can fork before you're out of memory. In the thread where I originally posted the dnsmasq patch, I mentioned how you can use netcat to determine this. Just setup a loop in a shell script that will netcat -t 127.0.0.1 53 and see how many times it iterates before you're out of memory. Once you've determined that, then configure an appropriate max-procs value and leave it running to see if it still falls over

2 Likes