Description of problem: We are doing an upgrade from 3.5.x to another version of 3.5.x. During this, we have a pod that is stuck in "terminating". This is done during a drain node operation. -------------------------------------------------------------------------------- [root@xxxx ~]# oc get pod -n xxxxxx django-psql-persistent-1-74fj2 NAME READY STATUS RESTARTS AGE django-psql-persistent-1-74fj2 0/1 Terminating 12 3d -------------------------------------------------------------------------------- On the node, this is the output of docker ps: -------------------------------------------------------------------------------- [root@xxxxxxxx-compute-949c9 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 1ad9094ae160 registry.ops.openshift.com/openshift3/logging-fluentd:v3.4 "sh run.sh" 25 hours ago Up 25 hours k8s_fluentd-elasticsearch.37ec2843_logging-fluentd-4070t_logging_a11ea457-26fc-11e7-a103-0e63b9c1c48f_b3fe7e96 5ea96c31a492 registry.ops.openshift.com/openshift3/ose-pod:v3.5.5.10 "/pod" 25 hours ago Up 25 hours k8s_POD.13c4e6bc_django-psql-persistent-1-74fj2_test-hungy_559472e9-31a0-11e7-af6d-0e63b9c1c48f_79706ad4 ef2296fd8075 registry.ops.openshift.com/openshift3/ose-pod:v3.5.5.10 "/pod" 25 hours ago Up 25 hours k8s_POD.13c4e6bc_logging-fluentd-4070t_logging_a11ea457-26fc-11e7-a103-0e63b9c1c48f_6eae74c3 -------------------------------------------------------------------------------- from "oc describe pod django-psql-persistent-1-74fj2", this error appears: -------------------------------------------------------------------------------- 1d 7s 6883 {kubelet ip-172-31-8-207.ec2.internal} Warning FailedSync Error syncing pod, skipping: Error response from daemon: {"message":"devmapper: Unknown device 1f651f526b213c98550db3a320745f2413e62d418f3aaf590dd91212f5e77f1d"} -------------------------------------------------------------------------------- Version-Release number of selected component (if applicable): oc v3.5.5.10 kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://internal.api.preview.openshift.com openshift v3.5.5.10 kubernetes v1.5.2+43a9be4 How reproducible: Not sure. Actual results: Node stuck in terminating, the upgrade in openshift-ansible hangs indefinitely and times out. Expected results: Node should terminate, and pod be moved during drain. Additional info:
The fluentd pod running is the cause of the other 4 pods stuck terminating. Fluentd is mounting /var/lib/container into itself to get access to the json-log files. When the fluentd container is created all of the mount points under /var/lib/container (aka all of the /shm mount points) are also created inside the fluentd container. When docker tries to tear down some other container it will fail because /shm is still in use inside the fluentd container. The only way to recover is to delete the fluentd container. There were other failures for example: May 10 20:39:32 ip-172-31-2-90 dockerd-current: time="2017-05-10T20:39:32.234460958Z" level=error msg="Handler for DELETE /v1.24/containers/212db8317613fd4d00fa4688c79ccb7cfd976fe4d9bacc1497800ad9a4b1b179 returned error: Driver devicemapper failed to remove root filesystem 212db8317613fd4d00fa4688c79ccb7cfd976fe4d9bacc1497800ad9a4b1b179: remove /var/lib/docker/devicemapper/mnt/6481c69096ad5e51061ee3e9e93d3bcd3a5c0aac087c2253ada5dfb726cd4b5c: device or resource busy" But this has cleared itself up, I can't say why. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1450426 so that in the future we might get more info about those particular failures.
A potential workaround/solution to this problem might be for the fluentd pod to unmount everything it has inside /var/lib/docker on startup. Basically umount /var/lib/docker/containers/*/shm On startup