Bug 1761260 - Systemd does not attempt to kill child processes if ExecStopPost is set
Summary: Systemd does not attempt to kill child processes if ExecStopPost is set
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 31
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1766477 1766479
TreeView+ depends on / blocked
 
Reported: 2019-10-13 20:06 UTC by Nir Soffer
Modified: 2020-04-02 09:55 UTC (History)
9 users (show)

Fixed In Version: systemd-243.8-1.fc31
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1766477 1766479 (view as bug list)
Environment:
Last Closed: 2020-04-02 09:55:24 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Example scripts and service for reproducing this issue (659 bytes, application/gzip)
2019-10-13 20:06 UTC, Nir Soffer
no flags Details

Description Nir Soffer 2019-10-13 20:06:31 UTC
Created attachment 1625376 [details]
Example scripts and service for reproducing this issue

Description of problem:

When using KillMode=mixed, and ExecStopPost is set, systemd does not deliver
the expected SIGKILL to child processes, leaving child processes running after
the service is stopped.

Version-Release number of selected component (if applicable):
systemd-239-14.git33ccd62.fc29.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install a service with KillMode=mixed and ExecStopPost set
2. Run a service starting child processes
3. Stop the service

Actual results:
systemd does not send SIGTERM or SIGKILL to child process.
Child process started by the service is running after the service is stopped.

Expected results:
Systemd sends SIGKILL to child process.
Child process started by the service is not running after service is stopped.

Here are example scripts and service file reproducing this issue.

## Service script

$ cat /usr/local/bin/kill-children-test
#!/usr/bin/python3 -u

import signal
import subprocess
import time

def terminate(signo, frame):
    print("Received termination signal: {}".format(signo))
    raise SystemExit

signal.signal(signal.SIGTERM, terminate)

try:
    p = subprocess.Popen(["/usr/local/bin/non-terminating-child"])
    print("started child process pid={}".format(p.pid))
    p.wait()
    print("child terminated rc={}".format(p.returncode))
except SystemExit:
    pass

print("Terminating")
time.sleep(2)


## Child process script

This script intentionally ignores SIGTERM, for testing that systemd
send a SIGKILL.

$ cat /usr/local/bin/non-terminating-child
#!/usr/bin/python3 -u

import signal
import time

def terminate(signo, frame):
    print("Child ignoring signal {}".format(signo))

signal.signal(signal.SIGTERM, terminate)

while 1:
    print("Child waiting")
    time.sleep(300)
    print("Child woke up")


## Service file

The service simulate pre/post scripts that take some time to run.

$ cat /etc/systemd/system/kill-children-test.service
[Unit]
Description=Testing killMode=mixed with ExecStopPost

[Service]
Type=simple
ExecStartPre=/usr/bin/true
ExecStart=/usr/local/bin/kill-children-test
ExecStopPost=/usr/bin/true
Restart=always
TimeoutStopSec=10
KillMode=mixed


## Reproduce leftover process

$ sudo systemctl daemon-reload

$ sudo systemctl start kill-children-test.service; sleep 5; sudo systemctl stop kill-children-test

$ ps -eF | grep non-terminating-child | grep -v grep
root      4931     1  0 57994  9024   5 22:35 ?        00:00:00 /usr/bin/python3 -u /usr/local/bin/non-terminating-child

$ sudo kill -9 4931

$ journalctl -f -u kill-children-test -n0
-- Logs begin at Sat 2019-09-28 20:04:15 IDT. --
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/start/replace
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/start as 25707
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/start as 25707
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/bin/true
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Forked /usr/bin/true as 4928
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Changed dead -> start-pre
Oct 13 22:35:58 sparse-local systemd[1]: Starting Testing killMode=mixed with ExecStopPost...
Oct 13 22:35:58 sparse-local systemd[4928]: kill-children-test.service: Executing: /usr/bin/true
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Child 4928 belongs to kill-children-test.service.
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Control process exited, code=exited status=0
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Got final SIGCHLD for state start-pre.
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Passing 0 fds to service
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/local/bin/kill-children-test
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Forked /usr/local/bin/kill-children-test as 4929
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Changed start-pre -> running
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/start finished, result=done
Oct 13 22:35:58 sparse-local systemd[1]: Started Testing killMode=mixed with ExecStopPost.
Oct 13 22:35:58 sparse-local systemd[1]: kill-children-test.service: Failed to send unit change signal for kill-children-test.service: Connection reset by peer
Oct 13 22:35:58 sparse-local systemd[4929]: kill-children-test.service: Executing: /usr/local/bin/kill-children-test
Oct 13 22:35:58 sparse-local kill-children-test[4929]: started child process pid=4931

    Child started here...

Oct 13 22:35:58 sparse-local kill-children-test[4929]: Child waiting
Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/stop/replace
Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/stop as 25814
Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/stop as 25814
Oct 13 22:36:03 sparse-local kill-children-test[4929]: Received termination signal: 15

    Main process received SIGTERM

Oct 13 22:36:03 sparse-local kill-children-test[4929]: Terminating
Oct 13 22:36:03 sparse-local systemd[1]: kill-children-test.service: Changed running -> stop-sigterm
Oct 13 22:36:03 sparse-local systemd[1]: Stopping Testing killMode=mixed with ExecStopPost...
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Child 4929 belongs to kill-children-test.service.
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Main process exited, code=exited, status=0/SUCCESS

    Main process exited, system running ExecStopPost, ignoring child process

Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/bin/true
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Forked /usr/bin/true as 4939
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Changed stop-sigterm -> stop-post
Oct 13 22:36:05 sparse-local systemd[4939]: kill-children-test.service: Executing: /usr/bin/true
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Child 4939 belongs to kill-children-test.service.
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Control process exited, code=exited status=0
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Got final SIGCHLD for state stop-post.
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Changed stop-post -> dead
Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Failed to destroy cgroup /system.slice/kill-children-test.service, ignoring: Device or resource busy

    Systemd fail fail to destroy the cgroup because it left the child
    running.

Oct 13 22:36:05 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/stop finished, result=done
Oct 13 22:36:05 sparse-local systemd[1]: Stopped Testing killMode=mixed with ExecStopPost.

    Service is reported as stopped, but it has running child processes

Oct 13 22:38:45 sparse-local systemd[1]: kill-children-test.service: Child 4931 belongs to kill-children-test.service.

    Child was killed externally

Oct 13 22:38:45 sparse-local systemd[1]: kill-children-test.service: cgroup is empty
Oct 13 22:38:45 sparse-local systemd[1]: kill-children-test.service: Collecting.


## Same flow without setting ExecStopPost

(Removed the ExecStopPost from the service file)

$ journalctl -f -u kill-children-test -n0
-- Logs begin at Sat 2019-09-28 20:04:15 IDT. --
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/start/replace
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/start as 26137
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/start as 26137
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/bin/true
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Forked /usr/bin/true as 5332
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Changed dead -> start-pre
Oct 13 22:44:38 sparse-local systemd[1]: Starting Testing killMode=mixed with ExecStopPost...
Oct 13 22:44:38 sparse-local systemd[5332]: kill-children-test.service: Executing: /usr/bin/true
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Child 5332 belongs to kill-children-test.service.
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Control process exited, code=exited status=0
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Got final SIGCHLD for state start-pre.
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Passing 0 fds to service
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: About to execute: /usr/local/bin/kill-children-test
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Forked /usr/local/bin/kill-children-test as 5333
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Changed start-pre -> running
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/start finished, result=done
Oct 13 22:44:38 sparse-local systemd[1]: Started Testing killMode=mixed with ExecStopPost.
Oct 13 22:44:38 sparse-local systemd[1]: kill-children-test.service: Failed to send unit change signal for kill-children-test.service: Connection reset by peer
Oct 13 22:44:38 sparse-local systemd[5333]: kill-children-test.service: Executing: /usr/local/bin/kill-children-test
Oct 13 22:44:38 sparse-local kill-children-test[5333]: started child process pid=5335
Oct 13 22:44:38 sparse-local kill-children-test[5333]: Child waiting
Oct 13 22:44:43 sparse-local kill-children-test[5333]: Received termination signal: 15

    Main process received SIGTERM

Oct 13 22:44:43 sparse-local kill-children-test[5333]: Terminating
Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Trying to enqueue job kill-children-test.service/stop/replace
Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Installed new job kill-children-test.service/stop as 26244
Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Enqueued job kill-children-test.service/stop as 26244
Oct 13 22:44:43 sparse-local systemd[1]: kill-children-test.service: Changed running -> stop-sigterm
Oct 13 22:44:43 sparse-local systemd[1]: Stopping Testing killMode=mixed with ExecStopPost...
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Child 5333 belongs to kill-children-test.service.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Main process exited, code=exited, status=0/SUCCESS

    Main process terminated

Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Killing process 5335 (non-terminating) with signal SIGKILL.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Changed stop-sigterm -> final-sigkill
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Child 5335 belongs to kill-children-test.service.

    Child process was terminated by SIGKILL

Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: cgroup is empty
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Changed final-sigkill -> dead
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Job kill-children-test.service/stop finished, result=done
Oct 13 22:44:45 sparse-local systemd[1]: Stopped Testing killMode=mixed with ExecStopPost.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Collecting.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Collecting.
Oct 13 22:44:45 sparse-local systemd[1]: kill-children-test.service: Collecting.


I think the expected behavior would be:

1. Stop main process (SIGTERM + SIGKILL if needed)
2. Stop child processes with SIGKILL
3. Run execStopPost if set

If systemd does not handle this, the only reliable way to ensure that
your child processes are killed is using prctl(). This is not easy to
do in many cases (e.g. python), and hard to get right, since the all
processes started by a service, including child processes started the
child processes need to use prctl() after forking the child, before
execve the child.

Comment 1 Ben Cotton 2019-10-31 18:43:21 UTC
This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 2 Ben Cotton 2019-11-27 23:37:15 UTC
Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 3 Ben Cotton 2020-02-11 17:43:20 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 32 development cycle.
Changing version to 32.

Comment 4 Zbigniew Jędrzejewski-Szmek 2020-03-11 09:18:11 UTC
https://github.com/systemd/systemd/pull/14871

Comment 5 Fedora Update System 2020-03-18 21:21:05 UTC
FEDORA-2020-645de57f2e has been submitted as an update to Fedora 32. https://bodhi.fedoraproject.org/updates/FEDORA-2020-645de57f2e

Comment 6 Fedora Update System 2020-03-19 02:26:40 UTC
systemd-245.2-1.fc32 has been pushed to the Fedora 32 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-645de57f2e

Comment 7 Fedora Update System 2020-03-23 20:24:47 UTC
FEDORA-2020-645de57f2e has been pushed to the Fedora 32 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 8 Fedora Update System 2020-04-01 18:25:15 UTC
FEDORA-2020-287a343a35 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2020-287a343a35

Comment 9 Fedora Update System 2020-04-02 09:55:24 UTC
FEDORA-2020-287a343a35 has been pushed to the Fedora 31 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.