Bug 1561375

Summary: pods in terminating status for more than 50 mins, sometimes
Product: OpenShift Container Platform Reporter: Weihua Meng <wmeng>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED INSUFFICIENT_DATA QA Contact: DeShuai Ma <dma>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, jokerman, mmccomas, sjenning, wjiang, wmeng
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-11 01:54:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Weihua Meng 2018-03-28 08:52:17 UTC
Description of problem:
pods in terminating status for more than 50 mins, sometimes

I met this during upgrade OCP 3.7.40 to v3.9.14 on openstack,
This results in upgrade failure

in project kube-service-catalog
# oc get pods
NAME                       READY     STATUS        RESTARTS   AGE
apiserver-sbjfj            0/1       Terminating   15         3h
controller-manager-9nkpw   0/1       Terminating   2          3h


# oc describe ds/controller-manager
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  3h    daemon-set            Created pod: controller-manager-9nkpw
  Normal  SuccessfulDelete  51m   daemonset-controller  Deleted pod: controller-manager-9nkpw

# oc get nodes
NAME             STATUS    ROLES     AGE       VERSION
172.16.120.107   Ready     compute   3h        v1.7.6+a08f5eeb62
172.16.120.126   Ready     master    3h        v1.9.1+a0ce1bc657
172.16.120.127   Ready     master    3h        v1.9.1+a0ce1bc657
172.16.120.133   Ready     master    3h        v1.9.1+a0ce1bc657
172.16.120.139   Ready     compute   3h        v1.7.6+a08f5eeb62


Mar 28 07:53:08 host-172-16-120-127 atomic-openshift-node[21939]: E0328 07:53:08.200965   22008 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-service-catalog-apiserver-token-xq2kg\" (\"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\")" failed. No retries permitted until 2018-03-28 07:53:08.700924633 +0000 UTC m=+638.555869093 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"service-catalog-apiserver-token-xq2kg\" (UniqueName: \"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-service-catalog-apiserver-token-xq2kg\") pod \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\" (UID: \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\") : remove /var/lib/origin/openshift.local.volumes/pods/2e5276fa-3249-11e8-92bc-fa163ed8c2e8/volumes/kubernetes.io~secret/service-catalog-apiserver-token-xq2kg: device or resource busy"
Mar 28 07:53:08 host-172-16-120-127 atomic-openshift-node[21939]: E0328 07:53:08.203675   22008 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-apiserver-ssl\" (\"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\")" failed. No retries permitted until 2018-03-28 07:53:08.703629544 +0000 UTC m=+638.558574027 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"apiserver-ssl\" (UniqueName: \"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-apiserver-ssl\") pod \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\" (UID: \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\") : remove /var/lib/origin/openshift.local.volumes/pods/2e5276fa-3249-11e8-92bc-fa163ed8c2e8/volumes/kubernetes.io~secret/apiserver-ssl: device or resource busy"

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.9.14
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.16

openshift-ansible-3.9.14-1.git.3.c62bc34.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. setup OCP v3.7.40 on openstack 
2. upgrade to 3.9

Actual results:
Upgrade failed.
pods in project kube-service-catalog in terminating status for more than 50 mins

Expected results:
upgrade succeeds and pods running

Comment 2 Seth Jennings 2018-04-10 02:16:24 UTC
Is this still happening?

If so, use the path in the TearDown error message as an argument to "lsof +D" on the node to figure out what process is holding the mount point open.

Comment 3 Weihua Meng 2018-04-10 13:05:00 UTC
What should I do when no such command on Atomic Host?

Thanks.

# which lsof
/usr/bin/which: no lsof in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/.local/bin:/root/bin)

  Operating System: Red Hat Enterprise Linux Atomic Host 7.4.5

Comment 5 Weihua Meng 2018-04-10 23:26:51 UTC
Sorry. 
I tried upgrade OCP 3.7.42(latest 3.7) to 3.9.19(latest 3.9), on AH 7.4.5 and AH 7.5.0. Did not meet this issue.

I do keep an env for debug when I meet this, Unfortunately it is gone since 10+ days passed.

May I have your suggestion what config to try next? 
and how long the env need to keep for debug if it occurs?

Comment 6 Seth Jennings 2018-04-11 01:54:18 UTC
I don't have a suggestion on how it could be recreated as it shouldn't happen in normal operation.

Without an recreation environment/procedure or logs, there isn't sufficient information to progress on this bug.  There could be a number of reasons why the mount point was busy but no way to tell without more information from an environment where the situation is happening.

The this is recreated please reopen and provide access to the host that is experiencing the issue.  The key will be finding the process that has a file open on the mount point.

Comment 7 Weihua Meng 2018-04-12 01:24:44 UTC
Can you take a look at BZ 1566150

Seems related.
Thanks.

"We have a fluentd point that won't terminate.  We issued an "ocp delete pod logging-fluentd-dn9xt", and then just see the pod stay in the "Terminating" state.

This is on OCP 3.9 and one of the starter clusters."