Bug 1281286 - Can't delete previous pod when trigger deployment concurrently
Summary: Can't delete previous pod when trigger deployment concurrently
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Solly Ross
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-12 09:00 UTC by DeShuai Ma
Modified: 2016-03-30 05:41 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-18 11:04:19 UTC


Attachments (Terms of Use)

Description DeShuai Ma 2015-11-12 09:00:50 UTC
Description of problem:
When trigger two deployment in parallel,the previous generated pod can't be deleted, it is always "Terminating".

Version-Release number of selected component (if applicable):
openshift v3.1.0.4-3-ga6353c7
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

How reproducible:
Always

Steps to Reproduce:
1.Create a dc 
[root@dma origin]# oc new-app -f examples/db-templates/mysql-ephemeral-template.json
--> Deploying template mysql-ephemeral for "examples/db-templates/mysql-ephemeral-template.json"
     With parameters:
      DATABASE_SERVICE_NAME=mysql
      MYSQL_USER=userCPE # generated
      MYSQL_PASSWORD=dOws2sFqr08HMFCB # generated
      MYSQL_DATABASE=sampledb
--> Creating resources ...
    Service "mysql" created
    DeploymentConfig "mysql" created
--> Success
    Run 'oc status' to view your app.

2.Trigger a deployment before the deployment auto trigger
[root@dma origin]# oc deploy mysql --latest 
Started deployment #1

3.Check pod status 
[root@dma origin]# oc get pod 
NAME            READY     STATUS        RESTARTS   AGE
mysql-1-8no34   0/1       Terminating   0          12m
mysql-2-xc2z0   1/1       Running       0          10m

Actual results:
3.The pod "mysql-1-8no34" should be deleted

Expected results:
3.The pod is always Terminating

Additional info:
[root@dma origin]# oc describe pod/mysql-1-8no34 |grep "Termination Grace Period"
Termination Grace Period:    30s

Comment 1 Dan Mace 2015-11-18 14:28:07 UTC
Can this be consistently reproduced? I haven't had any success reproducing it.

When you observe the pod in a "Terminating" state, please capture the output of `oc get pod -o yaml` and also `docker inspect` the containers for the pod. If the pod is "Terminating" when it should be in a terminal state relative to container status, the issue is with the kubelet. You might also need to wait longer for the kubelet to reconcile the pods with the container states.

In any case, there is no bug with deployments: the deployment system shouldn't delete any of these pods until they reach a terminal state, which "Terminating" is not. The deployment system isn't responsible for transitioning the pod to a new phase based on its containers: that responsibility lies with the kubelet.

The deployment system itself is behaving as designed here.

Comment 2 Dan Mace 2015-11-18 14:30:45 UTC
A correction to my previous statements: the deployment system would never delete these pods in any case. The pods you listed are owned by the deployment RCs. If the RC owning "mysql-1-8no34" has a replica count of 0, the deployment process did its job, and the responsibility for taking down the pod lies with the replication manager in Kube.

Comment 3 DeShuai Ma 2015-11-19 04:40:10 UTC
Dan Mace:  Last time when test it can be consistently reproduced on our ose env, but now I can't reproduce too.In this bug "mysql-1-8no34" waiting 12m but still "Terminating". Usually it can't be deleted quickly. I don't know what's wrong with this. When this bug produce, it can be always reproduced.

Comment 4 Dan Mace 2015-11-19 14:21:19 UTC
Reassigning to Node since the root issue is a pod stuck in "Terminating". We still need the pod YAML and output of docker inspect for one of the stuck pods in order to diagnose.

Comment 5 Ryan Howe 2015-12-30 23:49:07 UTC
I am also able to reproduce this 100% of the time.

┌─[root@master1]─[~]
└──> oc get pods
NAME                         READY     STATUS        RESTARTS   AGE
hawkular-cassandra-1-008sq   1/1       Running       0          3h
hawkular-metrics-dbpwi       1/1       Running       0          43m
hawkular-metrics-gxi1s       0/1       Terminating   0          1h
hawkular-metrics-ki0pk       0/1       Terminating   0          3h
hawkular-metrics-r7ofl       0/1       Terminating   0          3h
heapster-hszpa               1/1       Running       0          12m
metrics-deployer-m619c       0/1       Completed     0          3h
┌─[root@master1]─[~]
└──> oc delete pod hawkular-metrics-r7ofl hawkular-metrics-gxi1s
pod "hawkular-metrics-r7ofl" deleted
pod "hawkular-metrics-gxi1s" deleted
┌─[root@master1]─[~]
└──> oc get pods
NAME                         READY     STATUS        RESTARTS   AGE
hawkular-cassandra-1-008sq   1/1       Running       0          3h
hawkular-metrics-dbpwi       1/1       Running       0          43m
hawkular-metrics-gxi1s       0/1       Terminating   0          1h
hawkular-metrics-ki0pk       0/1       Terminating   0          3h
hawkular-metrics-r7ofl       0/1       Terminating   0          3h
heapster-hszpa               1/1       Running       0          12m
metrics-deployer-m619c       0/1       Completed     0          3h 

First deleting the pods then resetting the atomic-openshift-master-controllers  resolves the issue for me.

Comment 6 Michail Kargakis 2016-02-03 14:01:18 UTC
This is an upstream issue related to kubelet.

Comment 7 Andy Goldstein 2016-02-08 21:29:18 UTC
Do you have any more details on what the upstream issue is?

Comment 8 Michail Kargakis 2016-02-09 08:45:22 UTC
Andy,

I don't know of any specific #. We don't set TerminationGracePeriodSeconds in the podspec anywhere in the deployment code so it defaults to DefaultTerminationGracePeriodSeconds which is 30s.

https://github.com/kubernetes/kubernetes/blob/beb5d01f9c72730768d875361ee9e0c08367a52e/pkg/api/v1/types.go#L1299
https://github.com/openshift/origin/blob/a001f36e711a0797fed6bb8e0b722e73fa26d306/examples/deployment/README.md#graceful-termination


I would expect terminating pods to be handled by the kubelet.

Comment 9 Andy Goldstein 2016-02-15 16:40:31 UTC
Solly, I talked with Ryan and he said he was able to reproduce when:

1. Node has a pod with a PV (NFS)
1. NFS server runs out of disk space
3. Attempts to delete pods running on that node leave the pods in Terminating state

Not sure if this is a coincidence or the actual root cause. Can you take a look?

Comment 10 Solly Ross 2016-02-15 21:46:15 UTC
I've been unable to reproduce using the above steps, but after some discussion with Ryan, it looks like it may be a property of certain containers.  We'll look into it further.

Comment 14 Andy Goldstein 2016-03-16 16:03:16 UTC
Are you still able to reproduce this?

Comment 15 Andy Goldstein 2016-03-16 16:03:33 UTC
Are you still able to reproduce this?

Comment 16 DeShuai Ma 2016-03-18 01:36:53 UTC
Now I can't reproduce this.

Comment 17 Andy Goldstein 2016-03-18 11:04:19 UTC
Closing. Feel free to reopen if you can provide steps to reproduce.


Note You need to log in before you can comment on or make changes to this bug.