Bug 1449277 - Pod stuck in terminating state
Summary: Pod stuck in terminating state
Keywords:
Status: CLOSED DUPLICATE of bug 1437952
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 3.x
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Jhon Honce
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-09 14:03 UTC by Matt Woodson
Modified: 2017-05-12 18:13 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1450461 (view as bug list)
Environment:
Last Closed: 2017-05-12 14:54:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Matt Woodson 2017-05-09 14:03:13 UTC
Description of problem:

We are doing an upgrade from 3.5.x to another version of 3.5.x.  During this, we have a pod that is stuck in "terminating".  This is done during a drain node operation.

--------------------------------------------------------------------------------

[root@xxxx ~]# oc get pod -n xxxxxx django-psql-persistent-1-74fj2
NAME                             READY     STATUS        RESTARTS   AGE
django-psql-persistent-1-74fj2   0/1       Terminating   12         3d
--------------------------------------------------------------------------------


On the node, this is the output of docker ps:
--------------------------------------------------------------------------------
[root@xxxxxxxx-compute-949c9 ~]# docker ps
CONTAINER ID        IMAGE                                                           COMMAND                  CREATED             STATUS              PORTS               NAMES
1ad9094ae160        registry.ops.openshift.com/openshift3/logging-fluentd:v3.4      "sh run.sh"              25 hours ago        Up 25 hours                             k8s_fluentd-elasticsearch.37ec2843_logging-fluentd-4070t_logging_a11ea457-26fc-11e7-a103-0e63b9c1c48f_b3fe7e96
5ea96c31a492        registry.ops.openshift.com/openshift3/ose-pod:v3.5.5.10         "/pod"                   25 hours ago        Up 25 hours                             k8s_POD.13c4e6bc_django-psql-persistent-1-74fj2_test-hungy_559472e9-31a0-11e7-af6d-0e63b9c1c48f_79706ad4
ef2296fd8075        registry.ops.openshift.com/openshift3/ose-pod:v3.5.5.10         "/pod"                   25 hours ago        Up 25 hours                             k8s_POD.13c4e6bc_logging-fluentd-4070t_logging_a11ea457-26fc-11e7-a103-0e63b9c1c48f_6eae74c3

--------------------------------------------------------------------------------

from "oc describe pod django-psql-persistent-1-74fj2", this error appears:
--------------------------------------------------------------------------------
  1d		7s		6883	{kubelet ip-172-31-8-207.ec2.internal}			Warning		FailedSync	Error syncing pod, skipping: Error response from daemon: {"message":"devmapper: Unknown device 1f651f526b213c98550db3a320745f2413e62d418f3aaf590dd91212f5e77f1d"}

--------------------------------------------------------------------------------



Version-Release number of selected component (if applicable):

oc v3.5.5.10
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://internal.api.preview.openshift.com
openshift v3.5.5.10
kubernetes v1.5.2+43a9be4



How reproducible:

Not sure.

Actual results:

Node stuck in terminating, the upgrade in openshift-ansible hangs indefinitely and times out.

Expected results:

Node should terminate, and pod be moved during drain.


Additional info:

Comment 14 Eric Paris 2017-05-12 14:04:50 UTC
The fluentd pod running is the cause of the other 4 pods stuck terminating.

Fluentd is mounting /var/lib/container into itself to get access to the json-log files. When the fluentd container is created all of the mount points under /var/lib/container (aka all of the /shm mount points) are also created inside the fluentd container.

When docker tries to tear down some other container it will fail because /shm is still in use inside the fluentd container.

The only way to recover is to delete the fluentd container.



There were other failures for example:
May 10 20:39:32 ip-172-31-2-90 dockerd-current: time="2017-05-10T20:39:32.234460958Z" level=error msg="Handler for DELETE /v1.24/containers/212db8317613fd4d00fa4688c79ccb7cfd976fe4d9bacc1497800ad9a4b1b179 returned error: Driver devicemapper failed to remove root filesystem 212db8317613fd4d00fa4688c79ccb7cfd976fe4d9bacc1497800ad9a4b1b179: remove /var/lib/docker/devicemapper/mnt/6481c69096ad5e51061ee3e9e93d3bcd3a5c0aac087c2253ada5dfb726cd4b5c: device or resource busy"

But this has cleared itself up, I can't say why. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1450426 so that in the future we might get more info about those particular failures.

Comment 15 Eric Paris 2017-05-12 14:09:58 UTC
A potential workaround/solution to this problem might be for the fluentd pod to unmount everything it has inside /var/lib/docker on startup.

Basically
umount /var/lib/docker/containers/*/shm
On startup


Note You need to log in before you can comment on or make changes to this bug.