[SOLVED] OpenVPN sometimes loses the ability to respawn when restarted

VincentR · April 17, 2019, 5:18pm

I have noticed since I have upgraded to 18.06 (in the last months only), that some of my processes - including OpenVPN - seem not to be respawned by procd when crashing.

I recently came across a reproducible scenario which lead me to the following findings:

as listed by @antismap in a previous post (Here) and a very well documented procd bug (Here), the respawn element of a procd instance gets lost during restart within certain conditions (most typically when a process fails to respond to SIGTERM in time).

this seem to happen quite easily on openvpn instances (I'm running on MT7621).

Wed Apr 17 16:17:21 2019 daemon.info procd: Instance openvpn::openvpn pid 1720 not stopped on SIGTERM, sending SIGKILL instead

As a quick way to observe the issue, I can typically reproduce this by running "/etc/init.d/openvpn restart" 4 or 5 times in a row.

Question: What would be the best way to get around this? Or is there a plan to address that bug?

Example of the issue as captured:

root@openwrt:/# /etc/init.d/openvpn restart
root@openwrt:/# ubus call service list '{"name":"openvpn","verbose": true}'
{
        "openvpn": {
                "instances": {
                        "openvpn": {
                                "running": true,
                                "pid": 5085,
                                "command": [
                                        "\/usr\/sbin\/openvpn",
                                        "--syslog",
                                        "openvpn(openvpn)",
                                        "--status",
                                        "\/var\/run\/openvpn.openvpn.status",
                                        "--cd",
                                        "\/etc\/openvpn",
                                        "--config",
                                        "\/etc\/openvpn\/openvpn.conf"
                                ],
                                "term_timeout": 15,
                                "respawn": {
                                        "threshold": 3600,
                                        "timeout": 5,
                                        "retry": -1
                                }
                        }
                },
                "triggers": [...]
        }
}
root@openwrt:/# /etc/init.d/openvpn restart
root@openwrt:/# ubus call service list '{"name":"openvpn","verbose": true}'
{
        "openvpn": {
                "instances": {
                        "openvpn": {
                                "running": true,
                                "pid": 5127,
                                "command": [
                                        "\/usr\/sbin\/openvpn",
                                        "--syslog",
                                        "openvpn(openvpn)",
                                        "--status",
                                        "\/var\/run\/openvpn.openvpn.status",
                                        "--cd",
                                        "\/etc\/openvpn",
                                        "--config",
                                        "\/etc\/openvpn\/openvpn.conf"
                                ],
                                "term_timeout": 15
                        }
                },
                "triggers": [...]
        }
}

Thanks for your feedback!

reinerotto · April 17, 2019, 5:38pm

Best way to get around this issue would be, not to have any crashing services Second best, as I do it, since ancient times already: Just having simple script, always active, to check for existance of required programs/services. And to trigger a reboot, if anything missing.

VincentR · April 18, 2019, 9:54am

@reinerotto, thanks for your answer. Let me expand a bit.

These services are not crashing as such but rather usually stuck in networking calls (select and the like) or waiting for a resource and unable to respond to SIGTERM. This is something that can't be avoided due to the nature of OpenWrt which takes in 3rdParty packages and runs them in all types of environments.
Your suggestion is to replace procd (or complement it) by another process manager (cron + ps, ...). As much as that would solve this issue in most cases, it is not what I'm looking for here. It also means that you're now duplicating the chances that start/stop/restart commands are going to be issued at unexpected times as two process managers are now competing with each other.

I've started looking at the procd code, but I'm probably going to need a bit of time to get my head around it. I think the latest commit to deal with deleting stopped instances may have a small impact (as there is more deletion of instances in the AVL tree). The complete resolution could be around using a dedicated flag to prevent restarting/respawning rather than modifying the instance values when stopping the process as they are still likely to be used as shown in the bug. This would allow for a valid duplication/setting of new values.

Other possibilities (not sure if they are viable) could include:

waiting for SIGKILL to have done its job and rereading data from config
copying data before setting respawn to false
setting respawn to config values after copying in restart situations

vgaetera · April 18, 2019, 11:29am

You need to monitor not the service process but its role.
So, use a watchdog script / cron job pinging remote gateway / peer and restarting the service / rebooting if the check fails.

VincentR · April 18, 2019, 11:37am

Thanks @vgaetera,
unfortunately, also not what I'm looking for.
I don't want to replace procd here. I suppose I'm either looking for a workaround to get procd back on track (from init script configuration so I can at least modify the scripts for the processes that matter to me) or a way/timeline to fix that code.
This is not limited to one specific service where I could implement that feature (I used openvpn as an example because it's easy to reproduce with that package). This issue happens on any service that is unfortunate enough to fail to stop within the configured time given for SIGTERM. From that time, it will have lost its respawning ability.

VincentR · April 23, 2019, 10:46am

Well, I believe this is now solved: Commit details
Thanks to all involved in the fix!

system · May 3, 2019, 10:46am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.