Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1588773

Summary:

Active controller-manager hung beacuse container stdout/stderr is blocked

Product:

OpenShift Container Platform

Reporter:

Derek Carr <decarr>

Component:

Containers

Assignee:

Mrunal Patel <mpatel>

Status:

CLOSED CURRENTRELEASE

QA Contact:

yasun

Severity:

high

Docs Contact:

Priority:

high

Version:

3.10.0

CC:

aos-bugs, ccoleman, eparis, jokerman, jupierce, mmccomas, mpatel, scuppett

Target Milestone:

---

Target Release:

3.10.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1589939 (view as bug list)

Environment:

Last Closed:

2018-12-20 21:37:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1589939

Attachments:

Description	Flags
debugging showing that docker is not draining stderr (docker-shim is sending the data)	none
Docker goroutine stacks	none

Description Derek Carr 2018-06-07 20:12:43 UTC

Description of problem:
kube controller manager was hung, did not release leader.

Version-Release number of selected component (if applicable):


How reproducible:
happened on ca-central-1, was hung for ~2hrs

actual steps to reproduce are unclear.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Michal Fojtik 2018-06-08 09:37:22 UTC

/cc Clayton

Comment 12 Eric Paris 2018-06-08 20:01:30 UTC

At this point it looks like we have 1 goroutine which is holding the glog mutex and which is inside the write() syscall. While I haven't looked at the code I'm guessing it is calling write() to the /dev/stdout pipe. And if it has hung for 171 minutes this likely means the pipe is full. For the pipe to be full that would mean that the other side is either (a) not draining the pipe at all, or (b) draining it so slowly nothing appears to be making progress.

If this happens again we can get some info about what the kernel is up to by running:

echo d > /proc/sysrq-trigger
echo t > /proc/sysrq-trigger 

On the host. output should be in dmesg (and hopefully the log/journal). If sysrq is disabled you can run

echo "1" > /proc/sys/kernel/sysrq

Then try again.


The thing on the other end of the pipe should be docker, and it appears to be the json log file. I have no idea how to get info about if docker is doing it's job on that end, or why it might be hung...

Comment 14 Mrunal Patel 2018-06-08 20:08:21 UTC

The containerd-shim process should be reading from the other end. Can we get an strace of it? 

ps -ef | grep $container_id | grep docker-containerd-shim

Comment 19 Eric Paris 2018-06-09 17:08:16 UTC

Created attachment 1449418 [details]
debugging showing that docker is not draining stderr (docker-shim is sending the data)

Comment 20 Eric Paris 2018-06-09 17:11:01 UTC

I found that 1 of the etcd's was having major trouble. It would respond to some things, but most things it said context deadline exceeded. I looked at etcd had been trying to write to stderr for 29 hours. I looked at docker-shim, it was not doing much of anything, but it had been trying to write to the init-stderr pipe for 29 hours. I looked at docker, docker was not reading from the init-stderr pipe at all.

Wrote a tiny program to see how full the pipes were, init-stderr had 64k of crap in it. Not enough room for more messages.

So etcd is locked up waiting on dockerd-current.

# rpm -qa \*container\* \*docker\* | sort
atomic-openshift-docker-excluder-3.10.0-0.54.0.git.0.00a8b84.el7.noarch
container-selinux-2.21-2.gitba103ac.el7.noarch
container-storage-setup-0.7.0-1.git4ca59c5.el7.noarch
docker-1.13.1-53.git774336d.el7.x86_64
docker-client-1.13.1-53.git774336d.el7.x86_64
docker-common-1.13.1-53.git774336d.el7.x86_64
docker-rhel-push-plugin-1.13.1-53.git774336d.el7.x86_64
python-docker-2.4.2-1.3.el7.noarch
python-docker-pycreds-1.10.6-1.el7.noarch
skopeo-containers-0.1.28-1.git0270e56.el7.x86_64
subscription-manager-plugin-container-1.19.21-1.el7.x86_64

Comment 21 Mrunal Patel 2018-06-09 17:42:28 UTC

Eric, can we get a stack trace of the docker daemon? 
kill -SIGUSR1 <docker_pid> should do it. Also, there might be something in the docker logs so worth getting those as well.

Comment 22 Mrunal Patel 2018-06-09 17:53:48 UTC

Antonio and I looked at some commits and think that we could be hitting https://github.com/moby/moby/pull/33897. We are backporting it to our docker.

Comment 23 Eric Paris 2018-06-09 19:11:49 UTC

Created attachment 1449436 [details]
Docker goroutine stacks

Comment 25 Jordan Liggitt 2018-06-09 23:46:23 UTC

https://github.com/openshift/origin/pull/19952 adds a log healthz test

Comment 26 Jordan Liggitt 2018-06-10 01:05:48 UTC

https://github.com/openshift/origin/pull/19953 sets the DialTimeout to help the apiserver deal with an unresponsive etcd better. tracking that issue in https://bugzilla.redhat.com/show_bug.cgi?id=1589342

Comment 28 Jordan Liggitt 2018-06-14 14:39:24 UTC

controller logging health checks are merged

moving to container team to resolve stdout/stderr issue

Comment 31 yasun 2018-06-21 07:44:10 UTC

Confirm with the OPs,  k8s_controllers_master-controllers is running well in  starter-ca-central-1.

Comment 32 Derek Carr 2018-06-22 15:07:19 UTC

*** Bug 1588692 has been marked as a duplicate of this bug. ***