DescriptionVojtech Juranek
2019-10-29 08:24:15 UTC
+++ This bug was initially created as a clone of Bug #1761260 +++
Description of problem:
When using KillMode=mixed, and ExecStopPost is set, systemd does not deliver
the expected SIGKILL to child processes, leaving child processes running after
the service is stopped.
Version-Release number of selected component (if applicable):
systemd-239-14.git33ccd62.fc29.x86_64
How reproducible:
Always
Steps to Reproduce:
1. Install a service with KillMode=mixed and ExecStopPost set
2. Run a service starting child processes
3. Stop the service
Actual results:
systemd does not send SIGTERM or SIGKILL to child process.
Child process started by the service is running after the service is stopped.
Expected results:
Systemd sends SIGKILL to child process.
Child process started by the service is not running after service is stopped.
Here are example scripts and service file reproducing this issue.
## Service script
$ cat /usr/local/bin/kill-children-test
#!/usr/bin/python3 -u
import signal
import subprocess
import time
def terminate(signo, frame):
print("Received termination signal: {}".format(signo))
raise SystemExit
signal.signal(signal.SIGTERM, terminate)
try:
p = subprocess.Popen(["/usr/local/bin/non-terminating-child"])
print("started child process pid={}".format(p.pid))
p.wait()
print("child terminated rc={}".format(p.returncode))
except SystemExit:
pass
print("Terminating")
time.sleep(2)
## Child process script
This script intentionally ignores SIGTERM, for testing that systemd
send a SIGKILL.
$ cat /usr/local/bin/non-terminating-child
#!/usr/bin/python3 -u
import signal
import time
def terminate(signo, frame):
print("Child ignoring signal {}".format(signo))
signal.signal(signal.SIGTERM, terminate)
while 1:
print("Child waiting")
time.sleep(300)
print("Child woke up")
## Service file
The service simulate pre/post scripts that take some time to run.
$ cat /etc/systemd/system/kill-children-test.service
[Unit]
Description=Testing killMode=mixed with ExecStopPost
[Service]
Type=simple
ExecStartPre=/usr/bin/true
ExecStart=/usr/local/bin/kill-children-test
ExecStopPost=/usr/bin/true
Restart=always
TimeoutStopSec=10
KillMode=mixed
## Reproduce leftover process
$ sudo systemctl daemon-reload
$ sudo systemctl start kill-children-test.service; sleep 5; sudo systemctl stop kill-children-test
$ ps -eF | grep non-terminating-child | grep -v grep
root 4931 1 0 57994 9024 5 22:35 ? 00:00:00 /usr/bin/python3 -u /usr/local/bin/non-terminating-child
$ sudo kill -9 4931
$ journalctl -f -u kill-children-test -n0
-- Logs begin at Sat 2019-09-28 20:04:15 IDT. --
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/start/replace
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/start as 25707
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/start as 25707
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/bin/true
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Forked /usr/bin/true as 4928
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Changed dead -> start-pre
Oct 13 22:35:58 sparse-local systemd[1]: Starting Testing killMode=mixed with ExecStopPost...
Oct 13 22:35:58 sparse-local systemd[4928]: kill-children-test.service: Executing: /usr/bin/true
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Child 4928 belongs to kill-children-test.service.
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Control process exited, code=exited status=0
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Got final SIGCHLD for state start-pre.
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Passing 0 fds to service
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/local/bin/kill-children-test
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Forked /usr/local/bin/kill-children-test as 4929
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Changed start-pre -> running
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/start finished, result=done
Oct 13 22:35:58 sparse-local systemd[1]: Started Testing killMode=mixed with ExecStopPost.
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Failed to send unit change signal for kill-children-test.service: Connection reset by peer
Oct 13 22:35:58 sparse-local systemd[4929]: kill-children-test.service: Executing: /usr/local/bin/kill-children-test
Oct 13 22:35:58 sparse-local kill-children-test[4929]: started child process pid=4931
Child started here...
Oct 13 22:35:58 sparse-local kill-children-test[4929]: Child waiting
Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/stop/replace
Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/stop as 25814
Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/stop as 25814
Oct 13 22:36:03 sparse-local kill-children-test[4929]: Received termination signal: 15
Main process received SIGTERM
Oct 13 22:36:03 sparse-local kill-children-test[4929]: Terminating
Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Changed running -> stop-sigterm
Oct 13 22:36:03 sparse-local systemd[1]: Stopping Testing killMode=mixed with ExecStopPost...
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Child 4929 belongs to kill-children-test.service.
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Main process exited, code=exited, status=0/SUCCESS
Main process exited, system running ExecStopPost, ignoring child process
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/bin/true
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Forked /usr/bin/true as 4939
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Changed stop-sigterm -> stop-post
Oct 13 22:36:05 sparse-local systemd[4939]: kill-children-test.service: Executing: /usr/bin/true
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Child 4939 belongs to kill-children-test.service.
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Control process exited, code=exited status=0
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Got final SIGCHLD for state stop-post.
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Changed stop-post -> dead
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Failed to destroy cgroup /system.slice/kill-children-test.service, ignoring: Device or resource busy
Systemd fail fail to destroy the cgroup because it left the child
running.
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/stop finished, result=done
Oct 13 22:36:05 sparse-local systemd[1]: Stopped Testing killMode=mixed with ExecStopPost.
Service is reported as stopped, but it has running child processes
Oct 13 22:38:45 sparse-local systemd[1]: kill-children-test.service: Child 4931 belongs to kill-children-test.service.
Child was killed externally
Oct 13 22:38:45 sparse-local systemd[1]: kill-children-test.service: cgroup is empty
Oct 13 22:38:45 sparse-local systemd[1]: kill-children-test.service: Collecting.
## Same flow without setting ExecStopPost
(Removed the ExecStopPost from the service file)
$ journalctl -f -u kill-children-test -n0
-- Logs begin at Sat 2019-09-28 20:04:15 IDT. --
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/start/replace
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/start as 26137
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/start as 26137
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/bin/true
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Forked /usr/bin/true as 5332
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Changed dead -> start-pre
Oct 13 22:44:38 sparse-local systemd[1]: Starting Testing killMode=mixed with ExecStopPost...
Oct 13 22:44:38 sparse-local systemd[5332]: kill-children-test.service: Executing: /usr/bin/true
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Child 5332 belongs to kill-children-test.service.
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Control process exited, code=exited status=0
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Got final SIGCHLD for state start-pre.
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Passing 0 fds to service
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/local/bin/kill-children-test
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Forked /usr/local/bin/kill-children-test as 5333
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Changed start-pre -> running
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/start finished, result=done
Oct 13 22:44:38 sparse-local systemd[1]: Started Testing killMode=mixed with ExecStopPost.
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Failed to send unit change signal for kill-children-test.service: Connection reset by peer
Oct 13 22:44:38 sparse-local systemd[5333]: kill-children-test.service: Executing: /usr/local/bin/kill-children-test
Oct 13 22:44:38 sparse-local kill-children-test[5333]: started child process pid=5335
Oct 13 22:44:38 sparse-local kill-children-test[5333]: Child waiting
Oct 13 22:44:43 sparse-local kill-children-test[5333]: Received termination signal: 15
Main process received SIGTERM
Oct 13 22:44:43 sparse-local kill-children-test[5333]: Terminating
Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/stop/replace
Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/stop as 26244
Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/stop as 26244
Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Changed running -> stop-sigterm
Oct 13 22:44:43 sparse-local systemd[1]: Stopping Testing killMode=mixed with ExecStopPost...
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Child 5333 belongs to kill-children-test.service.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Main process exited, code=exited, status=0/SUCCESS
Main process terminated
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Killing process 5335 (non-terminating) with signal SIGKILL.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Changed stop-sigterm -> final-sigkill
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Child 5335 belongs to kill-children-test.service.
Child process was terminated by SIGKILL
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: cgroup is empty
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Changed final-sigkill -> dead
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/stop finished, result=done
Oct 13 22:44:45 sparse-local systemd[1]: Stopped Testing killMode=mixed with ExecStopPost.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Collecting.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Collecting.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Collecting.
I think the expected behavior would be:
1. Stop main process (SIGTERM + SIGKILL if needed)
2. Stop child processes with SIGKILL
3. Run execStopPost if set
If systemd does not handle this, the only reliable way to ensure that
your child processes are killed is using prctl(). This is not easy to
do in many cases (e.g. python), and hard to get right, since the all
processes started by a service, including child processes started the
child processes need to use prctl() after forking the child, before
execve the child.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Low: systemd security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2020:4553