Description of problem: When trigger two deployment in parallel,the previous generated pod can't be deleted, it is always "Terminating". Version-Release number of selected component (if applicable): openshift v3.1.0.4-3-ga6353c7 kubernetes v1.1.0-origin-1107-g4c8e6f4 etcd 2.1.2 How reproducible: Always Steps to Reproduce: 1.Create a dc [root@dma origin]# oc new-app -f examples/db-templates/mysql-ephemeral-template.json --> Deploying template mysql-ephemeral for "examples/db-templates/mysql-ephemeral-template.json" With parameters: DATABASE_SERVICE_NAME=mysql MYSQL_USER=userCPE # generated MYSQL_PASSWORD=dOws2sFqr08HMFCB # generated MYSQL_DATABASE=sampledb --> Creating resources ... Service "mysql" created DeploymentConfig "mysql" created --> Success Run 'oc status' to view your app. 2.Trigger a deployment before the deployment auto trigger [root@dma origin]# oc deploy mysql --latest Started deployment #1 3.Check pod status [root@dma origin]# oc get pod NAME READY STATUS RESTARTS AGE mysql-1-8no34 0/1 Terminating 0 12m mysql-2-xc2z0 1/1 Running 0 10m Actual results: 3.The pod "mysql-1-8no34" should be deleted Expected results: 3.The pod is always Terminating Additional info: [root@dma origin]# oc describe pod/mysql-1-8no34 |grep "Termination Grace Period" Termination Grace Period: 30s
Can this be consistently reproduced? I haven't had any success reproducing it. When you observe the pod in a "Terminating" state, please capture the output of `oc get pod -o yaml` and also `docker inspect` the containers for the pod. If the pod is "Terminating" when it should be in a terminal state relative to container status, the issue is with the kubelet. You might also need to wait longer for the kubelet to reconcile the pods with the container states. In any case, there is no bug with deployments: the deployment system shouldn't delete any of these pods until they reach a terminal state, which "Terminating" is not. The deployment system isn't responsible for transitioning the pod to a new phase based on its containers: that responsibility lies with the kubelet. The deployment system itself is behaving as designed here.
A correction to my previous statements: the deployment system would never delete these pods in any case. The pods you listed are owned by the deployment RCs. If the RC owning "mysql-1-8no34" has a replica count of 0, the deployment process did its job, and the responsibility for taking down the pod lies with the replication manager in Kube.
Dan Mace: Last time when test it can be consistently reproduced on our ose env, but now I can't reproduce too.In this bug "mysql-1-8no34" waiting 12m but still "Terminating". Usually it can't be deleted quickly. I don't know what's wrong with this. When this bug produce, it can be always reproduced.
Reassigning to Node since the root issue is a pod stuck in "Terminating". We still need the pod YAML and output of docker inspect for one of the stuck pods in order to diagnose.
I am also able to reproduce this 100% of the time. ┌─[root@master1]─[~] └──> oc get pods NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-008sq 1/1 Running 0 3h hawkular-metrics-dbpwi 1/1 Running 0 43m hawkular-metrics-gxi1s 0/1 Terminating 0 1h hawkular-metrics-ki0pk 0/1 Terminating 0 3h hawkular-metrics-r7ofl 0/1 Terminating 0 3h heapster-hszpa 1/1 Running 0 12m metrics-deployer-m619c 0/1 Completed 0 3h ┌─[root@master1]─[~] └──> oc delete pod hawkular-metrics-r7ofl hawkular-metrics-gxi1s pod "hawkular-metrics-r7ofl" deleted pod "hawkular-metrics-gxi1s" deleted ┌─[root@master1]─[~] └──> oc get pods NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-008sq 1/1 Running 0 3h hawkular-metrics-dbpwi 1/1 Running 0 43m hawkular-metrics-gxi1s 0/1 Terminating 0 1h hawkular-metrics-ki0pk 0/1 Terminating 0 3h hawkular-metrics-r7ofl 0/1 Terminating 0 3h heapster-hszpa 1/1 Running 0 12m metrics-deployer-m619c 0/1 Completed 0 3h First deleting the pods then resetting the atomic-openshift-master-controllers resolves the issue for me.
This is an upstream issue related to kubelet.
Do you have any more details on what the upstream issue is?
Andy, I don't know of any specific #. We don't set TerminationGracePeriodSeconds in the podspec anywhere in the deployment code so it defaults to DefaultTerminationGracePeriodSeconds which is 30s. https://github.com/kubernetes/kubernetes/blob/beb5d01f9c72730768d875361ee9e0c08367a52e/pkg/api/v1/types.go#L1299 https://github.com/openshift/origin/blob/a001f36e711a0797fed6bb8e0b722e73fa26d306/examples/deployment/README.md#graceful-termination I would expect terminating pods to be handled by the kubelet.
Solly, I talked with Ryan and he said he was able to reproduce when: 1. Node has a pod with a PV (NFS) 1. NFS server runs out of disk space 3. Attempts to delete pods running on that node leave the pods in Terminating state Not sure if this is a coincidence or the actual root cause. Can you take a look?
I've been unable to reproduce using the above steps, but after some discussion with Ryan, it looks like it may be a property of certain containers. We'll look into it further.
Are you still able to reproduce this?
Now I can't reproduce this.
Closing. Feel free to reopen if you can provide steps to reproduce.