Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1588773

Summary: Active controller-manager hung beacuse container stdout/stderr is blocked
Product: OpenShift Container Platform Reporter: Derek Carr <decarr>
Component: ContainersAssignee: Mrunal Patel <mpatel>
Status: CLOSED CURRENTRELEASE QA Contact: yasun
Severity: high Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, ccoleman, eparis, jokerman, jupierce, mmccomas, mpatel, scuppett
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1589939 (view as bug list) Environment:
Last Closed: 2018-12-20 21:37:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1589939    
Attachments:
Description Flags
debugging showing that docker is not draining stderr (docker-shim is sending the data)
none
Docker goroutine stacks none

Description Derek Carr 2018-06-07 20:12:43 UTC
Description of problem:
kube controller manager was hung, did not release leader.

Version-Release number of selected component (if applicable):


How reproducible:
happened on ca-central-1, was hung for ~2hrs

actual steps to reproduce are unclear.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Michal Fojtik 2018-06-08 09:37:22 UTC
/cc Clayton

Comment 12 Eric Paris 2018-06-08 20:01:30 UTC
At this point it looks like we have 1 goroutine which is holding the glog mutex and which is inside the write() syscall. While I haven't looked at the code I'm guessing it is calling write() to the /dev/stdout pipe. And if it has hung for 171 minutes this likely means the pipe is full. For the pipe to be full that would mean that the other side is either (a) not draining the pipe at all, or (b) draining it so slowly nothing appears to be making progress.

If this happens again we can get some info about what the kernel is up to by running:

echo d > /proc/sysrq-trigger
echo t > /proc/sysrq-trigger 

On the host. output should be in dmesg (and hopefully the log/journal). If sysrq is disabled you can run

echo "1" > /proc/sys/kernel/sysrq

Then try again.


The thing on the other end of the pipe should be docker, and it appears to be the json log file. I have no idea how to get info about if docker is doing it's job on that end, or why it might be hung...

Comment 14 Mrunal Patel 2018-06-08 20:08:21 UTC
The containerd-shim process should be reading from the other end. Can we get an strace of it? 

ps -ef | grep $container_id | grep docker-containerd-shim

Comment 19 Eric Paris 2018-06-09 17:08:16 UTC
Created attachment 1449418 [details]
debugging showing that docker is not draining stderr (docker-shim is sending the data)

Comment 20 Eric Paris 2018-06-09 17:11:01 UTC
I found that 1 of the etcd's was having major trouble. It would respond to some things, but most things it said context deadline exceeded. I looked at etcd had been trying to write to stderr for 29 hours. I looked at docker-shim, it was not doing much of anything, but it had been trying to write to the init-stderr pipe for 29 hours. I looked at docker, docker was not reading from the init-stderr pipe at all.

Wrote a tiny program to see how full the pipes were, init-stderr had 64k of crap in it. Not enough room for more messages.

So etcd is locked up waiting on dockerd-current.

# rpm -qa \*container\* \*docker\* | sort
atomic-openshift-docker-excluder-3.10.0-0.54.0.git.0.00a8b84.el7.noarch
container-selinux-2.21-2.gitba103ac.el7.noarch
container-storage-setup-0.7.0-1.git4ca59c5.el7.noarch
docker-1.13.1-53.git774336d.el7.x86_64
docker-client-1.13.1-53.git774336d.el7.x86_64
docker-common-1.13.1-53.git774336d.el7.x86_64
docker-rhel-push-plugin-1.13.1-53.git774336d.el7.x86_64
python-docker-2.4.2-1.3.el7.noarch
python-docker-pycreds-1.10.6-1.el7.noarch
skopeo-containers-0.1.28-1.git0270e56.el7.x86_64
subscription-manager-plugin-container-1.19.21-1.el7.x86_64

Comment 21 Mrunal Patel 2018-06-09 17:42:28 UTC
Eric, can we get a stack trace of the docker daemon? 
kill -SIGUSR1 <docker_pid> should do it. Also, there might be something in the docker logs so worth getting those as well.

Comment 22 Mrunal Patel 2018-06-09 17:53:48 UTC
Antonio and I looked at some commits and think that we could be hitting https://github.com/moby/moby/pull/33897. We are backporting it to our docker.

Comment 23 Eric Paris 2018-06-09 19:11:49 UTC
Created attachment 1449436 [details]
Docker goroutine stacks

Comment 25 Jordan Liggitt 2018-06-09 23:46:23 UTC
https://github.com/openshift/origin/pull/19952 adds a log healthz test

Comment 26 Jordan Liggitt 2018-06-10 01:05:48 UTC
https://github.com/openshift/origin/pull/19953 sets the DialTimeout to help the apiserver deal with an unresponsive etcd better. tracking that issue in https://bugzilla.redhat.com/show_bug.cgi?id=1589342

Comment 28 Jordan Liggitt 2018-06-14 14:39:24 UTC
controller logging health checks are merged

moving to container team to resolve stdout/stderr issue

Comment 31 yasun 2018-06-21 07:44:10 UTC
Confirm with the OPs,  k8s_controllers_master-controllers is running well in  starter-ca-central-1.

Comment 32 Derek Carr 2018-06-22 15:07:19 UTC
*** Bug 1588692 has been marked as a duplicate of this bug. ***