Description of problem: pods in terminating status for more than 50 mins, sometimes I met this during upgrade OCP 3.7.40 to v3.9.14 on openstack, This results in upgrade failure in project kube-service-catalog # oc get pods NAME READY STATUS RESTARTS AGE apiserver-sbjfj 0/1 Terminating 15 3h controller-manager-9nkpw 0/1 Terminating 2 3h # oc describe ds/controller-manager Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 3h daemon-set Created pod: controller-manager-9nkpw Normal SuccessfulDelete 51m daemonset-controller Deleted pod: controller-manager-9nkpw # oc get nodes NAME STATUS ROLES AGE VERSION 172.16.120.107 Ready compute 3h v1.7.6+a08f5eeb62 172.16.120.126 Ready master 3h v1.9.1+a0ce1bc657 172.16.120.127 Ready master 3h v1.9.1+a0ce1bc657 172.16.120.133 Ready master 3h v1.9.1+a0ce1bc657 172.16.120.139 Ready compute 3h v1.7.6+a08f5eeb62 Mar 28 07:53:08 host-172-16-120-127 atomic-openshift-node[21939]: E0328 07:53:08.200965 22008 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-service-catalog-apiserver-token-xq2kg\" (\"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\")" failed. No retries permitted until 2018-03-28 07:53:08.700924633 +0000 UTC m=+638.555869093 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"service-catalog-apiserver-token-xq2kg\" (UniqueName: \"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-service-catalog-apiserver-token-xq2kg\") pod \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\" (UID: \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\") : remove /var/lib/origin/openshift.local.volumes/pods/2e5276fa-3249-11e8-92bc-fa163ed8c2e8/volumes/kubernetes.io~secret/service-catalog-apiserver-token-xq2kg: device or resource busy" Mar 28 07:53:08 host-172-16-120-127 atomic-openshift-node[21939]: E0328 07:53:08.203675 22008 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-apiserver-ssl\" (\"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\")" failed. No retries permitted until 2018-03-28 07:53:08.703629544 +0000 UTC m=+638.558574027 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"apiserver-ssl\" (UniqueName: \"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-apiserver-ssl\") pod \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\" (UID: \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\") : remove /var/lib/origin/openshift.local.volumes/pods/2e5276fa-3249-11e8-92bc-fa163ed8c2e8/volumes/kubernetes.io~secret/apiserver-ssl: device or resource busy" Version-Release number of selected component (if applicable): # openshift version openshift v3.9.14 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.16 openshift-ansible-3.9.14-1.git.3.c62bc34.el7.noarch How reproducible: sometimes Steps to Reproduce: 1. setup OCP v3.7.40 on openstack 2. upgrade to 3.9 Actual results: Upgrade failed. pods in project kube-service-catalog in terminating status for more than 50 mins Expected results: upgrade succeeds and pods running
Is this still happening? If so, use the path in the TearDown error message as an argument to "lsof +D" on the node to figure out what process is holding the mount point open.
What should I do when no such command on Atomic Host? Thanks. # which lsof /usr/bin/which: no lsof in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/.local/bin:/root/bin) Operating System: Red Hat Enterprise Linux Atomic Host 7.4.5
Sorry. I tried upgrade OCP 3.7.42(latest 3.7) to 3.9.19(latest 3.9), on AH 7.4.5 and AH 7.5.0. Did not meet this issue. I do keep an env for debug when I meet this, Unfortunately it is gone since 10+ days passed. May I have your suggestion what config to try next? and how long the env need to keep for debug if it occurs?
I don't have a suggestion on how it could be recreated as it shouldn't happen in normal operation. Without an recreation environment/procedure or logs, there isn't sufficient information to progress on this bug. There could be a number of reasons why the mount point was busy but no way to tell without more information from an environment where the situation is happening. The this is recreated please reopen and provide access to the host that is experiencing the issue. The key will be finding the process that has a file open on the mount point.
Can you take a look at BZ 1566150 Seems related. Thanks. "We have a fluentd point that won't terminate. We issued an "ocp delete pod logging-fluentd-dn9xt", and then just see the pod stay in the "Terminating" state. This is on OCP 3.9 and one of the starter clusters."