Bug 1502129

Summary: OpenShift Container Platform and CNS, pods stuck in terminating state
Product: OpenShift Container Platform Reporter: Magnus Glantz <sudo>
Component: NodeAssignee: Joel Smith <joelsmith>
Status: CLOSED DUPLICATE QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.1CC: aos-bugs, jokerman, mmccomas, sudo, tahonen, wmeng
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-20 16:54:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Ansible hosts file
none
sosreport from master
none
sosreport from node associated to stuck pod none

Description Magnus Glantz 2017-10-14 11:59:57 UTC
Created attachment 1338537 [details]
Ansible hosts file

Description of problem:
When provisioning an OCP 3.6 cluster with CNS, fairly often, created pods cannot be deleted, they are stuck in terminating state, so that the project they are in cannot either be deleted. I have observed that restarting nodes which on pods are stuck resolves this state. Not sure if it does so reliably though.


Version-Release number of selected component (if applicable):
OCP 3.6-latest with container native storage.


How reproducible:
I can reliably recreate the issue by deploying a new OCP cluster with CNS and then just create and destroy 1-3 projects with pods which has persistent storage via CNS.


Steps to Reproduce:
1. Install CNS enabled cluster using https://github.com/mglantz/ocp36-azure-cns
2. Create 1-3 projects with Jenkins persistent
3. Delete projects, observe pods stuck in terminating state.

Actual results:
Pods stuck in terminating state.

Expected results:
Projects and pods deleted.

Additional info:

Comment 1 Magnus Glantz 2017-10-14 12:00:56 UTC
Created attachment 1338538 [details]
sosreport from master

Comment 2 Magnus Glantz 2017-10-14 12:03:03 UTC
Created attachment 1338539 [details]
sosreport from node associated to stuck pod

This node also runs CNS/infra, but that seems not to be related, I've seen pods stuck on nodes which does not run CNS.

Comment 3 Magnus Glantz 2017-10-14 12:05:19 UTC
Adding tahonen (SSA/OpenShift) who has also seen this issue.

Comment 4 Magnus Glantz 2017-10-14 12:07:52 UTC
Restarting atomic-openshift-master-api and atomic-openshift-master-controllers does not resolve the issue.

Comment 5 Magnus Glantz 2017-10-14 12:35:58 UTC
From stuck pod (which is detailed in sosreports)

[root@ocpm-0 ~]# oc project test
Now using project "test" on server "https://ocpb.eazdhewkr11upilhlavwerjpeb.fx.internal.cloudapp.net:8443".
[root@ocpm-0 ~]# oc get all
NAME                 READY     STATUS        RESTARTS   AGE
po/jenkins-1-pjhzn   0/1       Terminating   0          1h
[root@ocpm-0 ~]# oc describe pod jenkins-1-pjhzn 
Name:				jenkins-1-pjhzn
Namespace:			test
Security Policy:		restricted
Node:				ocpi-2.eazdhewkr11upilhlavwerjpeb.fx.internal.cloudapp.net/192.168.2.5
Start Time:			Sat, 14 Oct 2017 11:15:11 +0000
Labels:				deployment=jenkins-1
				deploymentconfig=jenkins
				name=jenkins
Annotations:			kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"test","name":"jenkins-1","uid":"ecdb2278-b0d0-11e7-8117-000d3ab79a02",...
				openshift.io/deployment-config.latest-version=1
				openshift.io/deployment-config.name=jenkins
				openshift.io/deployment.name=jenkins-1
				openshift.io/scc=restricted
Status:				Terminating (expires Sat, 14 Oct 2017 11:20:29 +0000)
Termination Grace Period:	30s
IP:				
Controllers:			ReplicationController/jenkins-1
Containers:
  jenkins:
    Container ID:	docker://625c14709fb05348b896bb0078ebed4f352e7cfa537a2a3fe8f358be6cf79b57
    Image:		registry.access.redhat.com/openshift3/jenkins-2-rhel7@sha256:c47b5d8c9ba8a57255e5191cbf0ed9e0cb998bc823846ba52c34cca11a3cf2a0
    Image ID:		docker-pullable://registry.access.redhat.com/openshift3/jenkins-2-rhel7@sha256:c47b5d8c9ba8a57255e5191cbf0ed9e0cb998bc823846ba52c34cca11a3cf2a0
    Port:		
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      memory:	512Mi
    Requests:
      memory:	512Mi
    Liveness:	http-get http://:8080/login delay=420s timeout=3s period=10s #success=1 #failure=30
    Readiness:	http-get http://:8080/login delay=3s timeout=3s period=10s #success=1 #failure=3
    Environment:
      OPENSHIFT_ENABLE_OAUTH:		true
      OPENSHIFT_ENABLE_REDIRECT_PROMPT:	true
      OPENSHIFT_JENKINS_JVM_ARCH:	i386
      KUBERNETES_MASTER:		https://kubernetes.default:443
      KUBERNETES_TRUST_CERTIFICATES:	true
      JNLP_SERVICE_NAME:		jenkins-jnlp
    Mounts:
      /var/lib/jenkins from jenkins-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from jenkins-token-82fgt (ro)
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  jenkins-data:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	jenkins
    ReadOnly:	false
  jenkins-token-82fgt:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	jenkins-token-82fgt
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	<none>
Tolerations:	<none>
Events:		<none>

Comment 6 Magnus Glantz 2017-10-14 12:38:09 UTC
Please note that this terminating state does not seem to expire, atleast not in 2 hours time..

Comment 7 Magnus Glantz 2017-10-14 19:33:59 UTC
Registry is not on CNS, if this matters.

Comment 8 Seth Jennings 2017-10-18 04:00:31 UTC
Joel, PTAL. Might be related to (or a dup of) https://bugzilla.redhat.com/show_bug.cgi?id=1489082

Comment 9 Joel Smith 2017-10-20 06:42:39 UTC
Magnus,
Can you confirm whether just deleting the pod triggers the issue? Or does it only occur if you delete the namespace while the pod still exists?

Comment 10 Joel Smith 2017-10-20 16:54:14 UTC
We're marking this as a duplicate for now. If fixes for 1489082 don't remedy this, we'll re-open it.

*** This bug has been marked as a duplicate of bug 1489082 ***

Comment 11 Magnus Glantz 2019-11-25 09:42:51 UTC
Duplicate