Bug 1588773
| Summary: | Active controller-manager hung beacuse container stdout/stderr is blocked | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Derek Carr <decarr> | ||||||
| Component: | Containers | Assignee: | Mrunal Patel <mpatel> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | yasun | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 3.10.0 | CC: | aos-bugs, ccoleman, eparis, jokerman, jupierce, mmccomas, mpatel, scuppett | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 3.10.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1589939 (view as bug list) | Environment: | |||||||
| Last Closed: | 2018-12-20 21:37:20 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1589939 | ||||||||
| Attachments: |
|
||||||||
|
Description
Derek Carr
2018-06-07 20:12:43 UTC
/cc Clayton At this point it looks like we have 1 goroutine which is holding the glog mutex and which is inside the write() syscall. While I haven't looked at the code I'm guessing it is calling write() to the /dev/stdout pipe. And if it has hung for 171 minutes this likely means the pipe is full. For the pipe to be full that would mean that the other side is either (a) not draining the pipe at all, or (b) draining it so slowly nothing appears to be making progress. If this happens again we can get some info about what the kernel is up to by running: echo d > /proc/sysrq-trigger echo t > /proc/sysrq-trigger On the host. output should be in dmesg (and hopefully the log/journal). If sysrq is disabled you can run echo "1" > /proc/sys/kernel/sysrq Then try again. The thing on the other end of the pipe should be docker, and it appears to be the json log file. I have no idea how to get info about if docker is doing it's job on that end, or why it might be hung... The containerd-shim process should be reading from the other end. Can we get an strace of it? ps -ef | grep $container_id | grep docker-containerd-shim Created attachment 1449418 [details]
debugging showing that docker is not draining stderr (docker-shim is sending the data)
I found that 1 of the etcd's was having major trouble. It would respond to some things, but most things it said context deadline exceeded. I looked at etcd had been trying to write to stderr for 29 hours. I looked at docker-shim, it was not doing much of anything, but it had been trying to write to the init-stderr pipe for 29 hours. I looked at docker, docker was not reading from the init-stderr pipe at all. Wrote a tiny program to see how full the pipes were, init-stderr had 64k of crap in it. Not enough room for more messages. So etcd is locked up waiting on dockerd-current. # rpm -qa \*container\* \*docker\* | sort atomic-openshift-docker-excluder-3.10.0-0.54.0.git.0.00a8b84.el7.noarch container-selinux-2.21-2.gitba103ac.el7.noarch container-storage-setup-0.7.0-1.git4ca59c5.el7.noarch docker-1.13.1-53.git774336d.el7.x86_64 docker-client-1.13.1-53.git774336d.el7.x86_64 docker-common-1.13.1-53.git774336d.el7.x86_64 docker-rhel-push-plugin-1.13.1-53.git774336d.el7.x86_64 python-docker-2.4.2-1.3.el7.noarch python-docker-pycreds-1.10.6-1.el7.noarch skopeo-containers-0.1.28-1.git0270e56.el7.x86_64 subscription-manager-plugin-container-1.19.21-1.el7.x86_64 Eric, can we get a stack trace of the docker daemon? kill -SIGUSR1 <docker_pid> should do it. Also, there might be something in the docker logs so worth getting those as well. Antonio and I looked at some commits and think that we could be hitting https://github.com/moby/moby/pull/33897. We are backporting it to our docker. Created attachment 1449436 [details]
Docker goroutine stacks
https://github.com/openshift/origin/pull/19952 adds a log healthz test https://github.com/openshift/origin/pull/19953 sets the DialTimeout to help the apiserver deal with an unresponsive etcd better. tracking that issue in https://bugzilla.redhat.com/show_bug.cgi?id=1589342 controller logging health checks are merged moving to container team to resolve stdout/stderr issue Confirm with the OPs, k8s_controllers_master-controllers is running well in starter-ca-central-1. *** Bug 1588692 has been marked as a duplicate of this bug. *** |