Created attachment 1338537 [details] Ansible hosts file Description of problem: When provisioning an OCP 3.6 cluster with CNS, fairly often, created pods cannot be deleted, they are stuck in terminating state, so that the project they are in cannot either be deleted. I have observed that restarting nodes which on pods are stuck resolves this state. Not sure if it does so reliably though. Version-Release number of selected component (if applicable): OCP 3.6-latest with container native storage. How reproducible: I can reliably recreate the issue by deploying a new OCP cluster with CNS and then just create and destroy 1-3 projects with pods which has persistent storage via CNS. Steps to Reproduce: 1. Install CNS enabled cluster using https://github.com/mglantz/ocp36-azure-cns 2. Create 1-3 projects with Jenkins persistent 3. Delete projects, observe pods stuck in terminating state. Actual results: Pods stuck in terminating state. Expected results: Projects and pods deleted. Additional info:
Created attachment 1338538 [details] sosreport from master
Created attachment 1338539 [details] sosreport from node associated to stuck pod This node also runs CNS/infra, but that seems not to be related, I've seen pods stuck on nodes which does not run CNS.
Adding tahonen (SSA/OpenShift) who has also seen this issue.
Restarting atomic-openshift-master-api and atomic-openshift-master-controllers does not resolve the issue.
From stuck pod (which is detailed in sosreports) [root@ocpm-0 ~]# oc project test Now using project "test" on server "https://ocpb.eazdhewkr11upilhlavwerjpeb.fx.internal.cloudapp.net:8443". [root@ocpm-0 ~]# oc get all NAME READY STATUS RESTARTS AGE po/jenkins-1-pjhzn 0/1 Terminating 0 1h [root@ocpm-0 ~]# oc describe pod jenkins-1-pjhzn Name: jenkins-1-pjhzn Namespace: test Security Policy: restricted Node: ocpi-2.eazdhewkr11upilhlavwerjpeb.fx.internal.cloudapp.net/192.168.2.5 Start Time: Sat, 14 Oct 2017 11:15:11 +0000 Labels: deployment=jenkins-1 deploymentconfig=jenkins name=jenkins Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"test","name":"jenkins-1","uid":"ecdb2278-b0d0-11e7-8117-000d3ab79a02",... openshift.io/deployment-config.latest-version=1 openshift.io/deployment-config.name=jenkins openshift.io/deployment.name=jenkins-1 openshift.io/scc=restricted Status: Terminating (expires Sat, 14 Oct 2017 11:20:29 +0000) Termination Grace Period: 30s IP: Controllers: ReplicationController/jenkins-1 Containers: jenkins: Container ID: docker://625c14709fb05348b896bb0078ebed4f352e7cfa537a2a3fe8f358be6cf79b57 Image: registry.access.redhat.com/openshift3/jenkins-2-rhel7@sha256:c47b5d8c9ba8a57255e5191cbf0ed9e0cb998bc823846ba52c34cca11a3cf2a0 Image ID: docker-pullable://registry.access.redhat.com/openshift3/jenkins-2-rhel7@sha256:c47b5d8c9ba8a57255e5191cbf0ed9e0cb998bc823846ba52c34cca11a3cf2a0 Port: State: Terminated Exit Code: 0 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 0 Limits: memory: 512Mi Requests: memory: 512Mi Liveness: http-get http://:8080/login delay=420s timeout=3s period=10s #success=1 #failure=30 Readiness: http-get http://:8080/login delay=3s timeout=3s period=10s #success=1 #failure=3 Environment: OPENSHIFT_ENABLE_OAUTH: true OPENSHIFT_ENABLE_REDIRECT_PROMPT: true OPENSHIFT_JENKINS_JVM_ARCH: i386 KUBERNETES_MASTER: https://kubernetes.default:443 KUBERNETES_TRUST_CERTIFICATES: true JNLP_SERVICE_NAME: jenkins-jnlp Mounts: /var/lib/jenkins from jenkins-data (rw) /var/run/secrets/kubernetes.io/serviceaccount from jenkins-token-82fgt (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: jenkins-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: jenkins ReadOnly: false jenkins-token-82fgt: Type: Secret (a volume populated by a Secret) SecretName: jenkins-token-82fgt Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: <none> Events: <none>
Please note that this terminating state does not seem to expire, atleast not in 2 hours time..
Registry is not on CNS, if this matters.
Joel, PTAL. Might be related to (or a dup of) https://bugzilla.redhat.com/show_bug.cgi?id=1489082
Magnus, Can you confirm whether just deleting the pod triggers the issue? Or does it only occur if you delete the namespace while the pod still exists?
We're marking this as a duplicate for now. If fixes for 1489082 don't remedy this, we'll re-open it. *** This bug has been marked as a duplicate of bug 1489082 ***
Duplicate