Bug 1201657
| Summary: | Running docker stop does not properly stop systemd-container-based container, SIGTERM causes reexecution | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jan Pazdziora (Red Hat) <jpazdziora> |
| Component: | docker-latest | Assignee: | Lokesh Mandvekar <lsm5> |
| Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.1 | CC: | ajia, dwalsh, jpazdziora, liko, lsm5, lsu, praiskup, smccarty, vlisivka, vpavlin |
| Target Milestone: | rc | Keywords: | Extras |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
If you are running systemd as pid1 inside of your container, in order to get it to shut down correctly, you need to send it the proper signal.
By default docker sends PID 1 SIGTERM, but now you can specify the stop-signal using the --stop-signal
docker run --stop-signal=RTMIN+3 ...
Will cause RTMIN+3 signal to be sent to PID 1 when executing docker stop. If systemd is running as PID 1 it will shut down correctly.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-05-12 14:53:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1303656 | ||
The following patch against systemd-container-208.20-4.el7 fixes the behaviour:
From: Jan Pazdziora <jpazdziora>
Date: Fri Mar 13 06:09:04 EDT 2015
Subject: [PATCH] Fix SIGTERM handling in docker container
diff -ru systemd-208.dist/src/core/manager.c systemd-208/src/core/manager.c
--- systemd-208.dist/src/core/manager.c 2015-03-13 05:56:30.487603754 -0400
+++ systemd-208/src/core/manager.c 2015-03-13 06:05:08.268935431 -0400
@@ -1483,8 +1483,13 @@
if (m->running_as == SYSTEMD_SYSTEM) {
/* This is for compatibility with the
* original sysvinit */
- m->exit_code = MANAGER_REEXECUTE;
- break;
+ char *container = NULL;
+ detect_container(&container);
+ if (!(container && streq(container, "docker"))) {
+ /* But in a docker container, SIGTERM should just invoke exit */
+ m->exit_code = MANAGER_REEXECUTE;
+ break;
+ }
}
/* Fall through */
With this, running docker stop <the-container-id> takes
real 0m1.575s
while on the container console, there is
[ OK ] Started The Apache HTTP Server.
[ OK ] Reached target Multi-User System.
Red Hat Enterprise Linux Server 7.1 (Maipo)
Kernel 3.10.0-229.el7.x86_64 on an x86_64
5f17a0fa1205 login: [ OK ] Stopped target Multi-User System.
Stopping Enable periodic update of entitlement certificates....
Stopping The Apache HTTP Server...
[ OK ] Stopped target Login Prompts.
Stopping Console Getty...
[ OK ] Stopped Enable periodic update of entitlement certificates..
[ OK ] Stopped Console Getty.
[ OK ] Stopped The Apache HTTP Server.
[ OK ] Stopped target Basic System.
[ OK ] Stopped target Slices.
[ OK ] Stopped target Paths.
[ OK ] Stopped target Timers.
[ OK ] Stopped target Sockets.
[ OK ] Stopped target System Initialization.
Stopping Create Volatile Files and Directories...
[ OK ] Stopped Create Volatile Files and Directories.
[ OK ] Reached target Shutdown.
Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Storage is finalized.
Exiting container.
#
Please consider patching systemd-container or proposing this change to systemd as well.
No this is a defined interface in systemd. Correct way is to send SIGRTMIN+3. Docker should allow to specify other signal. Dan, could you take a look and see what can we do about it? Well you can add other commands, but how is docker supposed to know what is running within the container. I think systemd should work properly in a container environment. Otherwise we would need to have administrators configure systemd based images differently. (In reply to Daniel Walsh from comment #5) > Well you can add other commands, but how is docker supposed to know what is > running within the container. We could use LABEL to define the behaviour. Not sure if docker would be open acting based on it, though. > I think systemd should work properly in a > container environment. And the question is how do we define "properly"? Docker defines that it sends SIGTERM for graceful shutdown. Systemd defines that it expects SIGRTMIN+3 for graceful shutdown and that SIGTERM does reload/reexec. Clearly, there is a clash. Yes, I agree that it should be systemd that should adapt to the environment it's running in, not forcing the surrounding environment to change its behaviour. So do you agree with the (direction of the) proposed patch from comment 2 (with the understanding that this change of behaviour under container=docker would need to be documented in systemd documentation as well)? Yes I agree that systemd should know it is running in a docker container and change its behaviour to handle signals. I am pretty sure docker should make the kill signal configurable. When you send PID 1 a SIGTERM, then this means that PID 1 shall reexecute. It always has meant that, regardless if you look at sysvinit, upstart or systemd. It's one of the special semantics that PID 1 has. Others are that you get foreign child processes reparented to you, that SIGPWR is sent to you and so on. Systemd will only react to SIGTERM like this if it is PID 1. If docker starts a process as PID 1 then it needs to speak the init protocol, and hence SIGTERM results in reexec, nothing else. If docker wants to support non-init processes in containers, then it should not run them as PID 1, to avoid this confusion. There's really nothing to fix here in systemd. Docker should just follow UNIX semantics on this one. I mean, to turn this around, what actually happens if docker runs arbitrary processes as PID 1, which don't happen to reap foreign child processes and something down the process tree dies? Will it collect zombies indefinitely? Looks like it would be nice if Docker was ready to run any program, without changes, correctly as its first process (must that be necessarily pid 1?). Being dependent on fully featured PID 1 defined in container is not ideal. For systemd, not only respecting SIGTERM in container (and shutdown), but also SIGINT would be very neat (running container with -i -t, beeing able to CTRL-C the process). The problem is probably the "reexec" feature would need to be moved from SIGTERM handler somewhere else .. In any way, this seems to be really blocking issue for systemd & docker cooperation. Is there some easy maintainable workaround? If systemd will not change and docker will not change, we are at a cross roads. We could attempt to add a option to docker run/create that stated you were running as systemd container, which is what lennart would prefer, and send the signals systemd expects. Or we could put a ship bash script that installs itself at /bin/init in the base container and traps sigint and sends its child the systemd signals to die. It would then execute systemd, but this would mean systemd would not be running as pid1, which might cause other problems. I am sure this is what docker would prefer. ALthough docker would prefer systemd just realize that it is running in a docker container. (In reply to Lennart Poettering from comment #8) > > If docker wants to support non-init processes in containers, then it should > not run them as PID 1, to avoid this confusion. Docker does not want to support non-init processes, or init processes for that matter. Docker behaves in certain way and there is a business need to run systemd (or systemd-container) in Docker containers. Systemd has a way to detect it is running in a Docker container and act accordingly. > There's really nothing to fix here in systemd. Docker should just follow > UNIX semantics on this one. So what is the UNIX semantics for gracefully shutting down pid 1 in a container? (In reply to Lennart Poettering from comment #9) > I mean, to turn this around, what actually happens if docker runs arbitrary > processes as PID 1, which don't happen to reap foreign child processes and > something down the process tree dies? Will it collect zombies indefinitely? Yes, running docker run -ti fedora-21-perl perl -e 'while (1) { if (fork()) { sleep 3 } else { sleep 5 ; exit }}' shows that the zombies will continue accumulating and staying around. (In reply to Daniel Walsh from comment #11) > It would then execute systemd, > but this would mean systemd would not be running as pid1, which might cause > other problems. I assume you mean fork systemd here, not exec. Yes, then systemd is not running as pid 1 and then we can just not use systemd at all. BTW Docker relies on the killing of the pid namespace as a mechansim to cleanup container processes. If PID1 of a Pid NS dies, the kernel will kill all processes. As of right now I am in a holding pattern I can not fix this since both parties will not change their default behavior. Well, it's not systemd that redefines unix semantics here, it's docker. It shouldn't run code that is not prepared to run as PID 1 as PID 1, and it should not run code that expects to be run as PID 1 as non-PID 1. It's that simple. I have submitted a pull request to run docker in systemd mode. https://github.com/docker/docker/pull/13525 They are currently objecting to the --systemd option and we are suggesting alternative options to allow for either a standard MultiService container framework, or allow the specification of multiple init systemd Something like --boot=systemd Then other who invent newer init systems could add --boot=foobar And submit patches for that. I think this gives us the best way forward to bridge the divide between docker and systemd in a container. I have some success with "docker exec CONTAINER shutdown -h now" then "docker kill CONTAINER", but sometimes hard lock happens: docker cannot kill container because of strange error:
# docker exec 89dcce833701 bash
nsenter: Failed to open ns file /proc/17433/ns for ns ipc: No such file or directory
Cannot run exec command dd50f760c4e428f8c58b554c36df9957eb2cf9a224ed7c397e9ed4316e1a1a7a in container 89dcce833701f93988b522862b43df1b30e5e69aea4b081de801db381548e6d3: [8] System error: exit status 1
Error starting exec command in container dd50f760c4e428f8c58b554c36df9957eb2cf9a224ed7c397e9ed4316e1a1a7a: Cannot run exec command dd50f760c4e428f8c58b554c36df9957eb2cf9a224ed7c397e9ed4316e1a1a7a in container 89dcce833701f93988b522862b43df1b30e5e69aea4b081de801db381548e6d3: [8] System error: exit status 1
"docker kill 89dcce833701" just hangs.
# LANG=C ls -l /proc/17433/{cwd,exe,root}
ls: cannot read symbolic link /proc/17433/cwd: No such file or directory
ls: cannot read symbolic link /proc/17433/exe: No such file or directory
ls: cannot read symbolic link /proc/17433/root: No such file or directory
lrwxrwxrwx 1 root root 0 Jun 3 02:15 /proc/17433/cwd
lrwxrwxrwx 1 root root 0 Jun 3 02:15 /proc/17433/exe
lrwxrwxrwx 1 root root 0 Jun 3 02:15 /proc/17433/root
It looks like bug in kernel (CentOS7, kernel 3.10.0-229.1.2.el7.x86_64, docker 1.5 or 1.6).
PS. Additional information about stuck systemd process in container after "shutdown -h now", if somebody is interested: # ps l 17433 F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 17433 1 20 0 0 0 zap_pi Ds ? 0:00 [systemd-shutdow] # cat /proc/17433/syscall 128 0x7fff3edf8a30 0x0 0x7fff3edf8a20 0x8 0x2ee88696 0x2 0x7fff3edf8940 0x7f38622f95de I.e. process is waiting for something. # docker rm 89dcce833701 Error response from daemon: Cannot destroy container 89dcce833701: Driver devicemapper failed to remove root filesystem 89dcce833701f93988b522862b43df1b30e5e69aea4b081de801db381548e6d3: Device is Busy FATA[0010] Error: failed to remove one or more containers Workaround is to stop all services in container first using command: docker exec CONTAINER bash -c 'systemctl stop systemctl stop $( cd /etc/systemd/system/; ls *.service ; cd /usr/lib/systemd/system; ls *.service ; cd /run/systemd/system/; ls *.service )' Then kill container using command docker kill CONTAINER docker run --init=systemd will be in docker-1.7 release. (In reply to Daniel Walsh from comment #22) > docker run --init=systemd > > will be in docker-1.7 release. Hi Dan, In docker-1.7.0-4.el7.x86_64, I see the docker daemon still sends 15 to the "--init=systemd" container, the matter i wonder is for those two new added variables config.init and container.init, the first one receives value from a *flInit, But i don't find anything pass the value to container.init, any suggestion? # docker run -it --init=systemd rhel7 /bin/bash # docker inspect 9d403b0ce2d2 | grep -i init "Init": "systemd" #time docker stop 9d403b0ce2d2 real 0m10.307s user 0m0.018s sys 0m0.016s DEBU[0094] Sending 15 to 9d403b0ce2d2f2cda8d2e29a699410d51adb9a6a30a11d54472024b96c732607 INFO[0104] Container 9d403b0ce2d2f2cda8d2e29a699410d51adb9a6a30a11d54472024b96c732607 failed to exit within 10 seconds of SIGTERM - using the force DEBU[0104] Sending 9 to 9d403b0ce2d2f2cda8d2e29a699410d51adb9a6a30a11d54472024b96c732607 And ```container.init``` is empry after re-compile the src rpm Ok looks like a bug. I will look into it. Dan, with docker-1.7.1-115.el7.x86_64, running # docker run -ti --init=systemd -v /sys/fs/cgroup:/sys/fs/cgroup bz1201657 yields flag provided but not defined: --init See 'docker run --help'. Am I correct in assuming that the --init plan is now obsolete and https://github.com/docker/docker/pull/15307 in Docker 1.9 is the new solution? Yes docker refused --init=systemd, so we are going about this a different way. There is a new flag in docker run --stop-signal=SIGTERM Signal to stop a container, SIGTERM by default Which will be in docker-1.9 We are looking at adding systemd support under runc, you should be seeing something on this in the next week or two. In docker-1.9.1-15.el7.x86_64, within the new option ``docker run --stop-signal`` (also supported in Dockerfile, see Keyword ``STOPSIGNAL``)
#docker run --stop-signal=SIGKILL -it rhel7 /bin/bash
#time docker stop e59b38587eff
e59b38587eff
real 0m0.484s
user 0m0.038s
sys 0m0.014s
Even thought, there is something wrong with --stop-signal=SIGTERM
INFO[0259] {Action=stop, ID=1fe42ecccda47a11cf7e12930eb229766429cc5afc89dd19e1e629695ae77541, Username=root, LoginUID=0, PID=5293}
DEBU[0259] Sending 15 to 1fe42ecccda47a11cf7e12930eb229766429cc5afc89dd19e1e629695ae77541
INFO[0269] Container 1fe42ecccda47a11cf7e12930eb229766429cc5afc89dd19e1e629695ae77541 failed to exit within 10 seconds of SIGTERM - using the force
I'm debugging and i'd like to verify this one, then open a new bug once figure out what happened in SIGTERM
# docker run --stop-signal=RTMIN+3 -ti -v /sys/fs/cgroup:/sys/fs/cgroup httpd /sbin/init The proper signal to send sytemd if it is running as PID 1 is RTMIN+3. The example above is the command I run when running docker fully locked down. (In reply to Daniel Walsh from comment #30) > # docker run --stop-signal=RTMIN+3 -ti -v /sys/fs/cgroup:/sys/fs/cgroup > httpd /sbin/init > > The proper signal to send sytemd if it is running as PID 1 is RTMIN+3. > > > The example above is the command I run when running docker fully locked down. With docker-1.8.2-10.el7.x86_64 I get flag provided but not defined: --stop-signal See 'docker run --help'. and with the newer docker-1.9.1-6.git6ec29ef.fc23.x86_64 or docker-1.9.1-16.el7.x86_64 I get Error response from daemon: Invalid signal: RTMIN+3 Do you use verbatim value 37? Could you please reverify with systemd-based container? Ideally we probably also want this documented somewhere so documentation bugzilla might be needed. You need to use --stop-signal=$(kill -l RTMIN+3) . See https://github.com/vlisivka/docker-centos7-systemd-unpriv/blob/master/run.sh This feature is in docker-1.10 (In reply to Daniel Walsh from comment #34) > This feature is in docker-1.10 In that case let me revert the status back to ASSIGNED, waiting for 1.10 build to be listed in Fixed-in-version for proper QA. Well Lokesh is using the modified field to know what fixes are available. Sadly we can not build docker-1.10 until we ship docker-1.9, so we are kind of frozen right now. But I think leaving this in Modified is the best state, since we would have to search through all assigned to find which ones are fixed in docker-1.10. I've verified that with docker-1.10.2-9.git0f5ac89.fc23.x86_64 and Dockerfile [root@dell-pe-fc630-01 ~]# cat httpd-systemd/Dockerfile FROM fedora:23 RUN dnf -y install httpd && dnf clean all RUN systemctl enable httpd.service RUN echo "Test Server" > /var/www/html/index.html EXPOSE 80 ENV container docker VOLUME [ "/tmp", "/run" ] CMD [ "/usr/sbin/init" ] running docker run -t --stop-signal=RTMIN+3 -v /sys/fs/cgroup:/sys/fs/cgroup:ro --security-opt seccomp:unconfined --rm --name httpd httpd-systemd will set .Config.StopSignal in the container to RTMIN+3 and docker stop will actually initiate the systemd shutdown right away, it it gets evaluated and it works. On the other hand, using STOPSIGNAL in the Dockerfile does not seem to work. It will set the ContainerConfig.StopSignal and Config.StopSignal to the value provided (either RTMIN+3 or 37) but when container is run, its Config.StopSignal is set to SIGTERM. Should I file separate bugzilla for that? What command is atomic running? This does not make much sense. Comment 38 and comment 39 were with no atomic -- just plain docker 1.10. Sorry misread. I did not know there was a STOPSIGNAL (In reply to Daniel Walsh from comment #42) > Sorry misread. I did not know there was a STOPSIGNAL It's noted at https://docs.docker.com/engine/reference/builder/#stopsignal What does docker inspect show of the image? Both ContainerConfig.StopSignal and Config.StopSignal are set to the value of STOPSIGNAL, in the image. However, the value does not get from the image to container's Config.StopSignal. Ok and if you specified sigkill, same thing? Probably just not propagating from IMage to Container. It seems to have been fixed in a252516ec19c9c83055a882da894712f2e812ecc via https://github.com/docker/docker/issues/19300 and https://github.com/docker/docker/pull/20290. If I understand the git history of docker correctly, it will only be fixed in 1.11. Could we carry that patch in 1.10 builds? runcom, can you see about getting this back ported into docker-1.10 sure thing, on it changing component to 'docker-latest' because 1.10. # rpm -q docker-latest
docker-latest-1.10.3-19.el7.x86_64
1. open a terminal
# docker-latest run -t --stop-signal=RTMIN+3 -v /sys/fs/cgroup:/sys/fs/cgroup:ro --security-opt seccomp:unconfined --rm --name httpd httpd:v1
systemd 222 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
Detected virtualization docker.
Detected architecture x86-64.
Running with unpopulated /etc.
Welcome to Fedora 23 (Twenty Three)!
Set hostname to <ae815f86379b>.
Initializing machine ID from random generator.
<ignore/>
2. open second terminal
# docker-latest ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ae815f86379b httpd:v1 "/usr/sbin/init" About a minute ago Up 59 seconds 80/tcp httpd
# time docker-latest stop ae815f86379b
ae815f86379b
real 0m1.599s
user 0m0.024s
sys 0m0.008s
Return first terminal check output.
<slice>
Stopping First Boot Wizard...
[ OK ] Reached target Shutdown.
Sending SIGTERM to remaining processes...
Sending SIGKILL to remaining processes...
Halting system.
Exiting container.
</slice>
# docker-latest ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
384243e04d06 httpd:v1 "/sbin/init" 13 minutes ago Exited (0) 8 minutes ago goofy_mirzakhani
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1057.html |
Description of problem: Using rhel7.1 image with systemd (well, systemd-container) based services like FROM rhel7.1 RUN yum -y install httpd && yum clean all RUN systemctl enable httpd.service RUN echo "Test Server" > /var/www/html/index.html EXPOSE 80 ENV container docker CMD [ "/usr/sbin/init" ] I would expect the services to be properly stopped when docker stop is run, not killed abruptly. However, docker stop sends SIGTERM, waits 10 seconds, then sends SIGKILL. And systemd(-container) upon receiving SIGTERM merely reexecutes. This matches the description in man systemd(1): SIGTERM Upon receiving this signal the systemd system manager serializes its state, reexecutes itself and deserializes the saved state again. This is mostly equivalent to systemctl daemon-reexec. systemd user managers will start the exit.target unit when this signal is received. This is mostly equivalent to systemctl --user start exit.target. Version-Release number of selected component (if applicable): Image rhel7.1 docker-1.5.0-16.el7.x86_64 How reproducible: Deterministic. Steps to Reproduce: 1. Have Dockerfile as shown above. 2. Build, run a container. 3. Verify that curl against that container's IP address works. 4. Run time docker stop <the-container-id> & 5. Continue running the curl commands. Actual results: The time command reports that the docker stop finishes after 10+ seconds and during those 10 seconds, curl commands work because the Apache in container is still running. Expected results: Apache being stopped right away, docker stop returning sooner because it does not need to wait those 10 seconds to send the SIGKILL. Additional info: You can also do docker exec <the-container-id> journalctl to see that systemd was reexecuted upon receiving the SIGTERM: Mar 11 11:11:50 d72c3ff96c5d systemd[1]: Reexecuting. Mar 11 11:11:50 d72c3ff96c5d systemd[1]: systemd 208 running in system mode. (+PAM -LIBWRAP -AUDIT +SELINUX -IMA +SYSVINIT -LIBCRYPTSETUP -GCRYPT -ACL -XZ) Mar 11 11:11:50 d72c3ff96c5d systemd[1]: Detected virtualization 'docker'. Mar 11 11:11:50 d72c3ff96c5d systemd[1]: Failed to open private bus connection: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory Either systemd-container needs to be changed to run proper target (exet.target?) for SIGTERM, or docker stop needs to be changed to do something else than sending SIGTERM which does not work, or /usr/sbin/init in the container needs to be changed to run systemd as user manager (if that is possible). Filing against systemd-container but please feel free to move to different component.