| Summary: | Docker command crashes after journald is stopped | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jhon Honce <jhonce> | |
| Component: | docker | Assignee: | Daniel Walsh <dwalsh> | |
| Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> | |
| Severity: | high | Docs Contact: | Yoana Ruseva <yruseva> | |
| Priority: | high | |||
| Version: | 7.2 | CC: | agoldste, ajia, amurdaca, cww, dornelas, dwalsh, harald, jamills, jhonce, jkrieger, jyundt, lfriedma, lnykryn, lpoetter, lsm5, lsu, mmcgrath, mpatel, nalin, pweil, rrajaram, systemd-maint, vbatts, walters | |
| Target Milestone: | rc | Keywords: | Extras | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | docker-1.9.1-17.el7 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause:
Docker constantly writes to stdout ad stderr inside of its unit file. The systemd-journald writes this stdout and stderr content to the journal. If the journal is restarted, this results in the docker daemon getting a SIGPIPE.
Consequence:
When docker daemon receives a SIGPIPE it shuts down stopping all containers.
Fix:
We have added a helper program that will capture STDOUT and STDERR from docker and forward the content to the journal. This helper program can handle the SIGPIPE better then docker daemon.
Result:
If the systemd-journal is restarted the docker daemon will no longer shutdown stopping all containers.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1318361 (view as bug list) | Environment: | ||
| Last Closed: | 2016-03-31 23:23:18 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1186913, 1203710 | |||
|
Description
Jhon Honce
2016-01-19 22:06:19 UTC
(In reply to Jhon Honce from comment #0) > How reproducible: > always > > Steps to Reproduce: > 1. systemctl stop systemd-journald > 2. systemctl stop systemd-journald.socket > 3. Repeat, docker ps until crash is seen It is very weird, I can't reproduce it per above steps on docker-1.9.1-11.el7.x86_64, the following is some details. # for i in {1..1000}; do systemctl stop systemd-journald; systemctl stop systemd-journald.socket; docker ps; done And wait for some time, to run docker ps again, the docker daemon is running w/o crashing. Jhon, any important steps I missed? (In reply to Alex Jia from comment #3) > (In reply to Jhon Honce from comment #0) > And wait for some time, to run docker ps again, the docker daemon is running > w/o crashing. > > Jhon, any important steps I missed? Well, sometimes, I can reproduce the bug after running for loop in shell and waiting for sevaral minutes, I will try to find a simple approach to reproduce it each time. Any reason for this being a real world problem? I am figuring something is being written to syslog and failing because the process listening on /dev/log goes away. But is this something worth even looking to fix? alex, can you try with a process constantly pinging docker at the same time. I believe the issue must be related to docker communicating with the journal, and I believe the only way this happens in via syslog. Making the docker daemon active, and doing your systemctl restart journald could cause it. Do you have any logs from docker when it crashes? (In reply to Daniel Walsh from comment #6) > alex, can you try with a process constantly pinging docker at the same time. > I believe the issue must be related to docker communicating with the > journal, and I believe the only way this happens in via syslog. > I can always reproduce the bug based on the following steps [cloud-user@atomic-00 ~]$ sudo systemctl stop systemd-journald.socket [cloud-user@atomic-00 ~]$ for i in {1..200}; do sudo docker ps & done > Making the docker daemon active, and doing your systemctl restart journald > could cause it. > > Do you have any logs from docker when it crashes? But I can't find any useful information from systemctl status docker. Do you have journald setup as your default logger?
systemctl status docker | grep journald
└─1038 /usr/bin/docker daemon --selinux-enabled --exec-opt native.cgroupdriver=systemd --log-driver=journald --storage-opt dm.basesize=
Setting the following # /etc/systemd/system/systemd-journald.service.d/local.conf [Service] WatchdogSec=10min could provide a workaround for docker being signaled during log roll overs on slow storage devices. The 10min would be tuned for the size of the logs vs. the speed of the storage devices. This does nothing for the root issue of docker daemon crashing when journald is shutdown or crashes. Yes, I added the following to my setup: # grep OPTIONS /etc/sysconfig/docker OPTIONS='--selinux-enabled --log-driver=journald' This looks like you are not using the new journald back end for docker. I guess for some reason systemd is killing the docker daemon? I added systemd maintainers to the list to see if they knew why this is happening? If the journald service is killed, does systemd take down individual services? Looks like we are hitting this problem https://bugs.freedesktop.org/show_bug.cgi?id=84923 So bottom line are we blaming this on journald or on golang? If journald is restarted, is systemd supposed to reconnect stdout/stderr sockets of services in unit files, or does all of the output just go to /dev/null with the services ignoring sigpipe? If we can get golang to ignore sigpipe, is that good enough? golang shouldn't rely on SIGPIPE I would say...except then I found: https://github.com/golang/go/blob/master/src/os/signal/doc.go Basically it's documented that writing to fd 0,1,2 may raise SIGPIPE. I guess it's trying to be an in-between of default SIGPIPE (broken for pretty much any nontrivial program) and masked SIGPIPE (your app needs to properly propagate errors immediately to exit) And it looks like the journal is FD 1. So...since we're not using the docker daemon in a Unix shell command pipeline, let's change Docker to just entirely mask SIGPIPE. From go-1.5.1/src/os/file_unix.go:66, Go is hard coded to catch 10 EPIPE/SIGPIPE 's from a fd and then die. Note the use of the private sigpipe() function which by-passes the signal.Notify() mechanism.
func epipecheck(file *File, e error) {
if e == syscall.EPIPE {
if atomic.AddInt32(&file.nepipe, 1) >= 10 {
sigpipe()
}
} else {
atomic.StoreInt32(&file.nepipe, 0)
}
}
Vincent any thoughts on this? We should probably bring this up for discussion on docker, to tell them about this problem. Or should we change docker to write all its messages to a log driver rather then to stdout/stderr (Not sure if writes to stderr). Moving this bug to golang, since this does not seem easily fixable with this sigpipe handling, I have no idea why they are doing this. Would be nice to know some history here. Docker has had this issue a couple of times. Interesting links. https://github.com/docker/docker/issues/19728 https://github.com/docker/docker/issues/7087 https://github.com/golang/go/issues/11845 (In reply to Daniel Walsh from comment #35) > Or should we change docker to write all its messages to a log driver rather > then to stdout/stderr (Not sure if writes to stderr). By default, messages are logged to stderr. See: - https://github.com/fedora-cloud/docker-rpm/pull/2 - https://github.com/projectatomic/forward-journald In docker-1.9.1-23.el7.x86_64 and golang-1.4.2-9.el7.x86_64 , after kill systemd-journald the docker daemon is stiall alive, move to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0536.html *** Bug 1300056 has been marked as a duplicate of this bug. *** |