1449277 – Pod stuck in terminating state

Bug 1449277 - Pod stuck in terminating state

Summary: Pod stuck in terminating state

Keywords:
Status:	CLOSED DUPLICATE of bug 1437952
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Jhon Honce
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-09 14:03 UTC by Matt Woodson
Modified:	2017-05-12 18:13 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1450461 (view as bug list)
Environment:
Last Closed:	2017-05-12 14:54:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matt Woodson 2017-05-09 14:03:13 UTC

Description of problem:

We are doing an upgrade from 3.5.x to another version of 3.5.x.  During this, we have a pod that is stuck in "terminating".  This is done during a drain node operation.

--------------------------------------------------------------------------------

[root@xxxx ~]# oc get pod -n xxxxxx django-psql-persistent-1-74fj2
NAME                             READY     STATUS        RESTARTS   AGE
django-psql-persistent-1-74fj2   0/1       Terminating   12         3d
--------------------------------------------------------------------------------


On the node, this is the output of docker ps:
--------------------------------------------------------------------------------
[root@xxxxxxxx-compute-949c9 ~]# docker ps
CONTAINER ID        IMAGE                                                           COMMAND                  CREATED             STATUS              PORTS               NAMES
1ad9094ae160        registry.ops.openshift.com/openshift3/logging-fluentd:v3.4      "sh run.sh"              25 hours ago        Up 25 hours                             k8s_fluentd-elasticsearch.37ec2843_logging-fluentd-4070t_logging_a11ea457-26fc-11e7-a103-0e63b9c1c48f_b3fe7e96
5ea96c31a492        registry.ops.openshift.com/openshift3/ose-pod:v3.5.5.10         "/pod"                   25 hours ago        Up 25 hours                             k8s_POD.13c4e6bc_django-psql-persistent-1-74fj2_test-hungy_559472e9-31a0-11e7-af6d-0e63b9c1c48f_79706ad4
ef2296fd8075        registry.ops.openshift.com/openshift3/ose-pod:v3.5.5.10         "/pod"                   25 hours ago        Up 25 hours                             k8s_POD.13c4e6bc_logging-fluentd-4070t_logging_a11ea457-26fc-11e7-a103-0e63b9c1c48f_6eae74c3

--------------------------------------------------------------------------------

from "oc describe pod django-psql-persistent-1-74fj2", this error appears:
--------------------------------------------------------------------------------
  1d		7s		6883	{kubelet ip-172-31-8-207.ec2.internal}			Warning		FailedSync	Error syncing pod, skipping: Error response from daemon: {"message":"devmapper: Unknown device 1f651f526b213c98550db3a320745f2413e62d418f3aaf590dd91212f5e77f1d"}

--------------------------------------------------------------------------------



Version-Release number of selected component (if applicable):

oc v3.5.5.10
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://internal.api.preview.openshift.com
openshift v3.5.5.10
kubernetes v1.5.2+43a9be4



How reproducible:

Not sure.

Actual results:

Node stuck in terminating, the upgrade in openshift-ansible hangs indefinitely and times out.

Expected results:

Node should terminate, and pod be moved during drain.


Additional info:

Comment 14 Eric Paris 2017-05-12 14:04:50 UTC

The fluentd pod running is the cause of the other 4 pods stuck terminating.

Fluentd is mounting /var/lib/container into itself to get access to the json-log files. When the fluentd container is created all of the mount points under /var/lib/container (aka all of the /shm mount points) are also created inside the fluentd container.

When docker tries to tear down some other container it will fail because /shm is still in use inside the fluentd container.

The only way to recover is to delete the fluentd container.



There were other failures for example:
May 10 20:39:32 ip-172-31-2-90 dockerd-current: time="2017-05-10T20:39:32.234460958Z" level=error msg="Handler for DELETE /v1.24/containers/212db8317613fd4d00fa4688c79ccb7cfd976fe4d9bacc1497800ad9a4b1b179 returned error: Driver devicemapper failed to remove root filesystem 212db8317613fd4d00fa4688c79ccb7cfd976fe4d9bacc1497800ad9a4b1b179: remove /var/lib/docker/devicemapper/mnt/6481c69096ad5e51061ee3e9e93d3bcd3a5c0aac087c2253ada5dfb726cd4b5c: device or resource busy"

But this has cleared itself up, I can't say why. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1450426 so that in the future we might get more info about those particular failures.

Comment 15 Eric Paris 2017-05-12 14:09:58 UTC

A potential workaround/solution to this problem might be for the fluentd pod to unmount everything it has inside /var/lib/docker on startup.

Basically
umount /var/lib/docker/containers/*/shm
On startup

Note You need to log in before you can comment on or make changes to this bug.