Bug 1561375 - pods in terminating status for more than 50 mins, sometimes
Summary: pods in terminating status for more than 50 mins, sometimes
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.9.z
Assignee: Seth Jennings
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-28 08:52 UTC by Weihua Meng
Modified: 2018-08-27 13:36 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-11 01:54:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1566150 0 high CLOSED Fluentd pod won't terminate, even after pid of container was killed 2021-02-22 00:41:40 UTC

Internal Links: 1566150

Description Weihua Meng 2018-03-28 08:52:17 UTC
Description of problem:
pods in terminating status for more than 50 mins, sometimes

I met this during upgrade OCP 3.7.40 to v3.9.14 on openstack,
This results in upgrade failure

in project kube-service-catalog
# oc get pods
NAME                       READY     STATUS        RESTARTS   AGE
apiserver-sbjfj            0/1       Terminating   15         3h
controller-manager-9nkpw   0/1       Terminating   2          3h


# oc describe ds/controller-manager
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  3h    daemon-set            Created pod: controller-manager-9nkpw
  Normal  SuccessfulDelete  51m   daemonset-controller  Deleted pod: controller-manager-9nkpw

# oc get nodes
NAME             STATUS    ROLES     AGE       VERSION
172.16.120.107   Ready     compute   3h        v1.7.6+a08f5eeb62
172.16.120.126   Ready     master    3h        v1.9.1+a0ce1bc657
172.16.120.127   Ready     master    3h        v1.9.1+a0ce1bc657
172.16.120.133   Ready     master    3h        v1.9.1+a0ce1bc657
172.16.120.139   Ready     compute   3h        v1.7.6+a08f5eeb62


Mar 28 07:53:08 host-172-16-120-127 atomic-openshift-node[21939]: E0328 07:53:08.200965   22008 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-service-catalog-apiserver-token-xq2kg\" (\"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\")" failed. No retries permitted until 2018-03-28 07:53:08.700924633 +0000 UTC m=+638.555869093 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"service-catalog-apiserver-token-xq2kg\" (UniqueName: \"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-service-catalog-apiserver-token-xq2kg\") pod \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\" (UID: \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\") : remove /var/lib/origin/openshift.local.volumes/pods/2e5276fa-3249-11e8-92bc-fa163ed8c2e8/volumes/kubernetes.io~secret/service-catalog-apiserver-token-xq2kg: device or resource busy"
Mar 28 07:53:08 host-172-16-120-127 atomic-openshift-node[21939]: E0328 07:53:08.203675   22008 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-apiserver-ssl\" (\"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\")" failed. No retries permitted until 2018-03-28 07:53:08.703629544 +0000 UTC m=+638.558574027 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"apiserver-ssl\" (UniqueName: \"kubernetes.io/secret/2e5276fa-3249-11e8-92bc-fa163ed8c2e8-apiserver-ssl\") pod \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\" (UID: \"2e5276fa-3249-11e8-92bc-fa163ed8c2e8\") : remove /var/lib/origin/openshift.local.volumes/pods/2e5276fa-3249-11e8-92bc-fa163ed8c2e8/volumes/kubernetes.io~secret/apiserver-ssl: device or resource busy"

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.9.14
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.16

openshift-ansible-3.9.14-1.git.3.c62bc34.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. setup OCP v3.7.40 on openstack 
2. upgrade to 3.9

Actual results:
Upgrade failed.
pods in project kube-service-catalog in terminating status for more than 50 mins

Expected results:
upgrade succeeds and pods running

Comment 2 Seth Jennings 2018-04-10 02:16:24 UTC
Is this still happening?

If so, use the path in the TearDown error message as an argument to "lsof +D" on the node to figure out what process is holding the mount point open.

Comment 3 Weihua Meng 2018-04-10 13:05:00 UTC
What should I do when no such command on Atomic Host?

Thanks.

# which lsof
/usr/bin/which: no lsof in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/.local/bin:/root/bin)

  Operating System: Red Hat Enterprise Linux Atomic Host 7.4.5

Comment 5 Weihua Meng 2018-04-10 23:26:51 UTC
Sorry. 
I tried upgrade OCP 3.7.42(latest 3.7) to 3.9.19(latest 3.9), on AH 7.4.5 and AH 7.5.0. Did not meet this issue.

I do keep an env for debug when I meet this, Unfortunately it is gone since 10+ days passed.

May I have your suggestion what config to try next? 
and how long the env need to keep for debug if it occurs?

Comment 6 Seth Jennings 2018-04-11 01:54:18 UTC
I don't have a suggestion on how it could be recreated as it shouldn't happen in normal operation.

Without an recreation environment/procedure or logs, there isn't sufficient information to progress on this bug.  There could be a number of reasons why the mount point was busy but no way to tell without more information from an environment where the situation is happening.

The this is recreated please reopen and provide access to the host that is experiencing the issue.  The key will be finding the process that has a file open on the mount point.

Comment 7 Weihua Meng 2018-04-12 01:24:44 UTC
Can you take a look at BZ 1566150

Seems related.
Thanks.

"We have a fluentd point that won't terminate.  We issued an "ocp delete pod logging-fluentd-dn9xt", and then just see the pod stay in the "Terminating" state.

This is on OCP 3.9 and one of the starter clusters."


Note You need to log in before you can comment on or make changes to this bug.