Zombie process spawned by firewall include

While working on my project geoip-shell, I noticed the following weirdness.

For a simplified example, I create a script at /usr/bin/ttt.sh with following contents:

#!/bin/sh

/bin/sh /usr/bin/ttt2.sh

I create a script at /usr/bin/ttt2.sh with following contents:

#!/bin/sh

sleep 3 &
wait $!

Then I add the following section to /etc/config/firewall:

config include 'ttt'
	option enabled '1'
	option type 'script'
	option path '/usr/bin/ttt.sh'

Essentially, the firewall include script /usr/bin/ttt.sh calls another script /usr/bin/ttt2.sh, which forks sleep and waits on the forked process to finish.

Then I issue the command

service firewall restart

This command never finishes. The reason: the spawned sleep process becomes a zombie:

root@OpenWrt:~# ps | grep sleep | grep -v grep
 2929 root         0 Z    [sleep]

If either script is run directrly, sleep does not become a zombie and the process exits as expected.

So something about how the firewall include script is executed is different and I'm not sure what. Is there a workaround?

(the reason I want to call sleep with &, i.e. fork it and then use wait is that when executing sleep, the script ttt2.sh ignores kill commands, and when executing wait, it doesn't ignore them. and I want the script to respond to kill commands.)

P.s. this is the command line of the call (issued by firewall4):
8641 root 1184 S sh -c exec 1000>&-; config() { echo "You cannot use UCI in firewall includes!" >&2; exit 1; }; . '/usr/bin/ttt.sh'

Issuing this command manually doesn't spawn a zombie either:

sh -c exec 1000>&-; config() { echo "You cannot use UCI in firewall includes!" >&2; exit 1; }; . '/usr/bin/ttt.sh'

What happens if you replace this with the following?
sleep 3; exit &

I've seen similar when the procd file does not have any parameters set, specifically a timeout.
I set procd_set_param term_timeout 60 # wait before sending SIGKILL
But this is probably not applicable for what you are doing.

The firewall procd script does not have any parameters set because it is not a service daemon, it just runs and stops. The fork, I think, runs in another instance of ash, so it has to be terminated before procd will exit. Something along those lines anyway.... Worth a try.

Could nohup be a better suitable option here? If needs to run in the background...

If you want to dig into the ucode system() call, the code is here:

Interesting that the 30 second timeout doesn’t return control to the main script.

1 Like

Thanks. I'll take a look, although not sure this will be of much help even if I figure out what's wrong. My goal with geoip-shell is wide compatibility, including with older versions of OpenWrt, so fixing this in a future release doesn't solve the issue (unless some sort of ugly hack is used). I'm more interested in a solution/workaround with the current impelementation.

Actually, control is returned to the system caller (I'm not sure which process specifically runs the code in the firewall include script) after 30 seconds - I just didn't wait long enough. The ttt2.sh script remains hanging forever though.
Edit: looks like the firewall include is executed by main.uc in firewall4.

I'm not sure how this can be used to solve the problem at hand. I specifically need to use the sleep command, and I want that sleep command to run asynchronously in order for the calling script (ttt2.sh in this example) to remain responsive to kill commands.

This makes no sense to me. First, I don't want my script to exit. Second, I don't understand how one can exit asynchronously. If you mean (sleep 3; exit) & then this doesn't change the behavior.

Ok sleep was not a place holder. I thought you just need to properly exec something in the background and then nohup should be the choice

1 Like

After looking at the ucode function which appears to be the firewall4 component which executes firewall includes (here), I came up with a simplified test using the following ucode script:

let path = '/usr/bin/ttt2.sh';
let rc = system([
        'sh', '-c',
        `exec 1000>&-; /bin/sh '${path}'`
], 5000);

This directly calls ttt2.sh (without sourcing ttt.sh). Same behavior here: sleep becomes a zombie and ttt2.sh hangs forever.

So I guess it's a bug in ucode? Perhaps @efahl knows something about this (as a prolific ucode programmer)? If it's a bug then probably it's a good idea to report it. But first and foremost, I'd like to know if there is a workaround with the current implementation.

BTW I have no idea what this does:

exec 1000>&-

Maybe someone can explain it before I asked the clanker.

Edit: removing exec 1000>&- from the ucode script doesn't change the behavior, regardless of what that command does.

I think that ties back to the flock file descriptor in the fw4 script.

firewall4 is practically un maintained at the moment with no obvious successor to jow.

1 Like

Stupid question(?) Does spawning a subshell change things?

( sleep 3 )&
1 Like

I take it that jow has left the project? If so then that's too bad as firewall4 is such a core component.

I tried various combinations with spawning a subshell (including the one you suggested) to no avail.

I don’t know what his particular situation is (life happens to us all), but the repo is getting stale with old issues and PRs. No commits in nearly a year.

1 Like

In the end, it sounds like trying to hook in via the include mechanism isn’t going to work for the persistence you want. Luckily there are a lot of other ways to manipulate nftables.

It works and works reliably. The only issue is when one instance of geoip-shell is running and another instance is trying to stop the running one. It can't achieve this via the normal kill command as long as the first instance is sleeping, unless kill -9 is used (but kill -9 prevents the first instance from cleaning up).

This is what I'm trying to solve now. I don't think it's worth to change the persistence implementation for this, so if no workaround is found, I'll live without asynchronous sleep and implement some way for the 2nd instance to clean up after the first one.

I see that ucode is also a jow project.

Filed an issue:

2 Likes

It looks like ucode is blocking SIGCHLD and that is getting inherited by the forked process. ash relies on this signal to detect when the child process exits.

You can unblock it in your shell script with the following: (this was wrong, see below)

trap - SIGCHLD
1 Like

Thanks for the hint. Unfortunately, this doesn't work:

#!/bin/sh

trap - SIGCHLD

sleep 3 &
sleep_pid=$!
echo "Waiting on $sleep_pid"
wait "$sleep_pid"
echo "Done"

still behaves the same way (wait never returns). Am I doing something wrong?

Whoops, you're right. Sorry, I thought it was working for me with your test case, but I forgot I had set the timeout in the ucode system() call to 0. Looks like that trap won't unblock the signal.

1 Like

For completeness, I tried to reset the signal in the ucode script, i.e.:

signal('SIGCHLD', 'default');
let path = '/usr/bin/test.sh';
system([
        'sh', '-c',
        `/bin/sh '${path}'`
], 5000);

as per documentation, but that doesn't work either.

Yeah, the problem is happening inside the implementation of the system() call, so nothing done before that will help. If you set the timeout to 0, that would avoid the problem, because ucode doesn't block the signal in that case, but I'm assuming that's not an option, given that the ucode usage is coming from firewall4, not your project.

There's probably plenty of ways to hack around this, but I'm not clear on why you need to use wait here. You said the shell script ignores kill unless you do it that way, but that shouldn't be the case. Killing the shell process won't kill its children, but that's true for using wait, too, unless you have a trap that implements cleanup.

Of course trap is used and it calls a cleanup function. Now when the trap is used, the script won't respond to a kill signal:

#!/bin/sh

IFS=$'\n' read -r -n512 -d '' _ _ _ _ _ pid_line _ < /proc/self/status
__pid="${pid_line##*[^0-9]}"
echo "My pid: '$__pid'"

trap 'echo "Exiting"; exit' INT KILL HUP TERM

sleep 15

echo "Sleeping done"

If you want to see by yourself, run this script, take the PID it prints out and send it the normal kill signal in a second terminal. You'll see that the script continues to run and only when sleep completes, it will respond to the signal. With forking sleep and wait'ing on it, the script responds immediately when receiving the signal.