Created attachment 1625376 [details] Example scripts and service for reproducing this issue Description of problem: When using KillMode=mixed, and ExecStopPost is set, systemd does not deliver the expected SIGKILL to child processes, leaving child processes running after the service is stopped. Version-Release number of selected component (if applicable): systemd-239-14.git33ccd62.fc29.x86_64 How reproducible: Always Steps to Reproduce: 1. Install a service with KillMode=mixed and ExecStopPost set 2. Run a service starting child processes 3. Stop the service Actual results: systemd does not send SIGTERM or SIGKILL to child process. Child process started by the service is running after the service is stopped. Expected results: Systemd sends SIGKILL to child process. Child process started by the service is not running after service is stopped. Here are example scripts and service file reproducing this issue. ## Service script $ cat /usr/local/bin/kill-children-test #!/usr/bin/python3 -u import signal import subprocess import time def terminate(signo, frame): print("Received termination signal: {}".format(signo)) raise SystemExit signal.signal(signal.SIGTERM, terminate) try: p = subprocess.Popen(["/usr/local/bin/non-terminating-child"]) print("started child process pid={}".format(p.pid)) p.wait() print("child terminated rc={}".format(p.returncode)) except SystemExit: pass print("Terminating") time.sleep(2) ## Child process script This script intentionally ignores SIGTERM, for testing that systemd send a SIGKILL. $ cat /usr/local/bin/non-terminating-child #!/usr/bin/python3 -u import signal import time def terminate(signo, frame): print("Child ignoring signal {}".format(signo)) signal.signal(signal.SIGTERM, terminate) while 1: print("Child waiting") time.sleep(300) print("Child woke up") ## Service file The service simulate pre/post scripts that take some time to run. $ cat /etc/systemd/system/kill-children-test.service [Unit] Description=Testing killMode=mixed with ExecStopPost [Service] Type=simple ExecStartPre=/usr/bin/true ExecStart=/usr/local/bin/kill-children-test ExecStopPost=/usr/bin/true Restart=always TimeoutStopSec=10 KillMode=mixed ## Reproduce leftover process $ sudo systemctl daemon-reload $ sudo systemctl start kill-children-test.service; sleep 5; sudo systemctl stop kill-children-test $ ps -eF | grep non-terminating-child | grep -v grep root 4931 1 0 57994 9024 5 22:35 ? 00:00:00 /usr/bin/python3 -u /usr/local/bin/non-terminating-child $ sudo kill -9 4931 $ journalctl -f -u kill-children-test -n0 -- Logs begin at Sat 2019-09-28 20:04:15 IDT. -- Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/start/replace Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/start as 25707 Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/start as 25707 Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/bin/true Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Forked /usr/bin/true as 4928 Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Changed dead -> start-pre Oct 13 22:35:58 sparse-local systemd[1]: Starting Testing killMode=mixed with ExecStopPost... Oct 13 22:35:58 sparse-local systemd[4928]: kill-children-test.service: Executing: /usr/bin/true Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Child 4928 belongs to kill-children-test.service. Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Control process exited, code=exited status=0 Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Got final SIGCHLD for state start-pre. Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Passing 0 fds to service Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/local/bin/kill-children-test Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Forked /usr/local/bin/kill-children-test as 4929 Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Changed start-pre -> running Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/start finished, result=done Oct 13 22:35:58 sparse-local systemd[1]: Started Testing killMode=mixed with ExecStopPost. Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Failed to send unit change signal for kill-children-test.service: Connection reset by peer Oct 13 22:35:58 sparse-local systemd[4929]: kill-children-test.service: Executing: /usr/local/bin/kill-children-test Oct 13 22:35:58 sparse-local kill-children-test[4929]: started child process pid=4931 Child started here... Oct 13 22:35:58 sparse-local kill-children-test[4929]: Child waiting Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/stop/replace Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/stop as 25814 Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/stop as 25814 Oct 13 22:36:03 sparse-local kill-children-test[4929]: Received termination signal: 15 Main process received SIGTERM Oct 13 22:36:03 sparse-local kill-children-test[4929]: Terminating Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Changed running -> stop-sigterm Oct 13 22:36:03 sparse-local systemd[1]: Stopping Testing killMode=mixed with ExecStopPost... Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Child 4929 belongs to kill-children-test.service. Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Main process exited, code=exited, status=0/SUCCESS Main process exited, system running ExecStopPost, ignoring child process Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/bin/true Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Forked /usr/bin/true as 4939 Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Changed stop-sigterm -> stop-post Oct 13 22:36:05 sparse-local systemd[4939]: kill-children-test.service: Executing: /usr/bin/true Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Child 4939 belongs to kill-children-test.service. Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Control process exited, code=exited status=0 Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Got final SIGCHLD for state stop-post. Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Changed stop-post -> dead Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Failed to destroy cgroup /system.slice/kill-children-test.service, ignoring: Device or resource busy Systemd fail fail to destroy the cgroup because it left the child running. Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/stop finished, result=done Oct 13 22:36:05 sparse-local systemd[1]: Stopped Testing killMode=mixed with ExecStopPost. Service is reported as stopped, but it has running child processes Oct 13 22:38:45 sparse-local systemd[1]: kill-children-test.service: Child 4931 belongs to kill-children-test.service. Child was killed externally Oct 13 22:38:45 sparse-local systemd[1]: kill-children-test.service: cgroup is empty Oct 13 22:38:45 sparse-local systemd[1]: kill-children-test.service: Collecting. ## Same flow without setting ExecStopPost (Removed the ExecStopPost from the service file) $ journalctl -f -u kill-children-test -n0 -- Logs begin at Sat 2019-09-28 20:04:15 IDT. -- Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/start/replace Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/start as 26137 Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/start as 26137 Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/bin/true Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Forked /usr/bin/true as 5332 Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Changed dead -> start-pre Oct 13 22:44:38 sparse-local systemd[1]: Starting Testing killMode=mixed with ExecStopPost... Oct 13 22:44:38 sparse-local systemd[5332]: kill-children-test.service: Executing: /usr/bin/true Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Child 5332 belongs to kill-children-test.service. Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Control process exited, code=exited status=0 Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Got final SIGCHLD for state start-pre. Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Passing 0 fds to service Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/local/bin/kill-children-test Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Forked /usr/local/bin/kill-children-test as 5333 Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Changed start-pre -> running Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/start finished, result=done Oct 13 22:44:38 sparse-local systemd[1]: Started Testing killMode=mixed with ExecStopPost. Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Failed to send unit change signal for kill-children-test.service: Connection reset by peer Oct 13 22:44:38 sparse-local systemd[5333]: kill-children-test.service: Executing: /usr/local/bin/kill-children-test Oct 13 22:44:38 sparse-local kill-children-test[5333]: started child process pid=5335 Oct 13 22:44:38 sparse-local kill-children-test[5333]: Child waiting Oct 13 22:44:43 sparse-local kill-children-test[5333]: Received termination signal: 15 Main process received SIGTERM Oct 13 22:44:43 sparse-local kill-children-test[5333]: Terminating Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/stop/replace Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/stop as 26244 Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/stop as 26244 Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Changed running -> stop-sigterm Oct 13 22:44:43 sparse-local systemd[1]: Stopping Testing killMode=mixed with ExecStopPost... Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Child 5333 belongs to kill-children-test.service. Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Main process exited, code=exited, status=0/SUCCESS Main process terminated Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Killing process 5335 (non-terminating) with signal SIGKILL. Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Changed stop-sigterm -> final-sigkill Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Child 5335 belongs to kill-children-test.service. Child process was terminated by SIGKILL Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: cgroup is empty Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Changed final-sigkill -> dead Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/stop finished, result=done Oct 13 22:44:45 sparse-local systemd[1]: Stopped Testing killMode=mixed with ExecStopPost. Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Collecting. Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Collecting. Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Collecting. I think the expected behavior would be: 1. Stop main process (SIGTERM + SIGKILL if needed) 2. Stop child processes with SIGKILL 3. Run execStopPost if set If systemd does not handle this, the only reliable way to ensure that your child processes are killed is using prctl(). This is not easy to do in many cases (e.g. python), and hard to get right, since the all processes started by a service, including child processes started the child processes need to use prctl() after forking the child, before execve the child.
This message is a reminder that Fedora 29 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '29'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 29 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.
This bug appears to have been reported against 'rawhide' during the Fedora 32 development cycle. Changing version to 32.
https://github.com/systemd/systemd/pull/14871
FEDORA-2020-645de57f2e has been submitted as an update to Fedora 32. https://bodhi.fedoraproject.org/updates/FEDORA-2020-645de57f2e
systemd-245.2-1.fc32 has been pushed to the Fedora 32 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-645de57f2e
FEDORA-2020-645de57f2e has been pushed to the Fedora 32 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2020-287a343a35 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2020-287a343a35
FEDORA-2020-287a343a35 has been pushed to the Fedora 31 stable repository. If problem still persists, please make note of it in this bug report.