Bug 1289819

Summary:	[platformmanagement_public_547]When the previous deployment failed by cancel, the posterior will alway timeout and the replicas with wrong num
Product:	OKD	Reporter:	zhou ying <yinzhou>
Component:	Deployments	Assignee:	Dan Mace <dmace>
Status:	CLOSED CURRENTRELEASE	QA Contact:	zhou ying <yinzhou>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.x	CC:	aos-bugs
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-12 17:15:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description zhou ying 2015-12-09 03:19:22 UTC

Description of problem:
Start a deployment , when the prehooks pod completed, and some new pod running, but the whole deplyment was not completed, cancel the running deployment , then the deployment will failed with status :DeadlineExceeded. After that the posterior deployment will alway timeout and the replicas with wrong num.

Version-Release number of selected component (if applicable):
openshift v1.1-370-g3818f29
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

How reproducible:
always

Steps to Reproduce:
1. Create a dc use json:
{
    "kind": "DeploymentConfig",
    "apiVersion": "v1",
    "metadata": {
        "name": "hooks",
        "creationTimestamp": null,
        "labels": {
            "name": "mysql"
        }
    },
    "spec": {
        "strategy": {
            "type": "Rolling",
            "rollingParams": {
                "pre": {
                    "failurePolicy": "Retry",
                    "execNewPod": {
                        "command": [
                            "/bin/bash",
                            "-c",
                            "/usr/bin/sleep 20"
                        ],
                        "env": [
                            {
                                "name": "VAR",
                                "value": "pre-deployment"
                            }
                        ],
                        "containerName": "mysql-55-centos7"
                    }
                },
                "post": {
                    "failurePolicy": "Ignore",
                    "execNewPod": {
                        "command": [
                            "/bin/false"
                        ],
                        "env": [
                            {
                                "name": "VAR",
                                "value": "post-deployment"
                            }
                        ],
                        "containerName": "mysql-55-centos7"
                    }
                }
            },
            "resources": {}
        },
        "triggers": [
            {
                "type": "ConfigChange"
            }
        ],
        "replicas": 1,
        "selector": {
            "name": "mysql"
        },
        "template": {
            "metadata": {
                "creationTimestamp": null,
                "labels": {
                    "name": "mysql"
                }
            },
            "spec": {
                "containers": [
                    {
                        "name": "mysql-55-centos7",
                        "image": "openshift/mysql-55-centos7:latest",
                        "ports": [
                            {
                                "containerPort": 3306,
                                "protocol": "TCP"
                            }
                        ],
                        "env": [
                            {
                                "name": "MYSQL_USER",
                                "value": "user8Y2"
                            },
                            {
                                "name": "MYSQL_PASSWORD",
                                "value": "Plqe5Wev"
                            },
                            {
                                "name": "MYSQL_DATABASE",
                                "value": "root"
                            }
                        ],
                        "resources": {},
                        "terminationMessagePath": "/dev/termination-log",
                        "imagePullPolicy": "Always",
                        "securityContext": {
                            "capabilities": {},
                            "privileged": false
                        }
                    }
                ],
                "restartPolicy": "Always",
                "dnsPolicy": "ClusterFirst"
            }
        }
    },
    "status": {}
}

2. Scale up the dc to replicas 10;
3. Start a new deployment :
 `oc deploy hooks --latest`
4. When the prehooks completed, and some new pod created , cancel the running deployment, check the status now.
5. Start a new deployment again, and wait to completed, check the status.
6. Start a new deployment again, and wait to completed, check the status.

Actual results:
4. The two deployment will both have replicas num, and pods running. 
[root@ip-172-18-15-24 amd64]# oc get pods
NAME              READY     STATUS             RESTARTS   AGE
hooks-1-1j6iu     1/1       Running            0          3m
hooks-1-x25hu     1/1       Running            0          3m
hooks-2-30il5     1/1       Running            0          31s
hooks-2-4qkch     1/1       Running            0          40s
hooks-2-8crv8     1/1       Running            0          40s
hooks-2-ako90     1/1       Running            0          40s
hooks-2-deploy    0/1       DeadlineExceeded   0          1m
hooks-2-ev471     1/1       Running            0          31s
hooks-2-prehook   0/1       Completed          0          1m
hooks-2-zn4hu     1/1       Running            0          31s
[root@ip-172-18-15-24 amd64]# oc get rc
CONTROLLER   CONTAINER(S)       IMAGE(S)                            SELECTOR                                               REPLICAS   AGE
hooks-1      mysql-55-centos7   openshift/mysql-55-centos7:latest   deployment=hooks-1,deploymentconfig=hooks,name=mysql   2          5m
hooks-2      mysql-55-centos7   openshift/mysql-55-centos7:latest   deployment=hooks-2,deploymentconfig=hooks,name=mysql   6          1m


5\6 The posterior deployment will always timeout, the logs like :
[root@ip-172-18-15-24 amd64]# oc get pods
NAME              READY     STATUS             RESTARTS   AGE
hooks-1-1j6iu     1/1       Running            0          31m
hooks-1-x25hu     1/1       Running            0          31m
hooks-2-deploy    0/1       DeadlineExceeded   0          29m
hooks-2-prehook   0/1       Completed          0          29m
hooks-3-deploy    0/1       Error              0          27m
hooks-3-prehook   0/1       Completed          0          27m
hooks-4-8k9w8     1/1       Running            0          10m
hooks-4-deploy    0/1       Error              0          10m
hooks-4-prehook   0/1       Completed          0          10m
hooks-4-vhuvy     1/1       Running            0          10m

[root@ip-172-18-15-24 amd64]# oc logs -f dc/hooks
I1209 02:32:25.539436       1 deployer.go:198] Deploying from zhouyt/hooks-1 to zhouyt/hooks-4 (replicas: 2)
I1209 02:32:25.557740       1 lifecycle.go:109] Created lifecycle pod zhouyt/hooks-4-prehook for deployment zhouyt/hooks-4
I1209 02:32:25.557797       1 lifecycle.go:122] Watching logs for hook pod zhouyt/hooks-4-prehook while awaiting completion
I1209 02:32:46.897480       1 lifecycle.go:162] Finished reading logs for hook pod zhouyt/hooks-4-prehook
I1209 02:32:50.685463       1 rolling.go:148] Pre hook finished
I1209 02:32:51.708990       1 rolling.go:232] RollingUpdater: Continuing update with existing controller hooks-4.
I1209 02:32:51.709026       1 rolling.go:232] RollingUpdater: Scaling up hooks-4 from 0 to 2, scaling down hooks-1 from 2 to 0 (keep 9 pods available, don't exceed 11 pods)
I1209 02:32:51.709034       1 rolling.go:232] RollingUpdater: Scaling hooks-4 up to 2
F1209 02:42:56.825494       1 deployer.go:65] timed out waiting for any update progress to be made

Expected results:
The posterior deployment should be ok, and the replicas should be 10.

Additional info:

Comment 1 Dan Mace 2015-12-09 20:43:45 UTC

Zhou,

Good find. Here's an easier way to reproduce the root issue:

1. Create the deployment config with no hooks and replicas=5
2. Deploy once successfully
3. Deploy again, cancel midway through scaling so that some of the new RC is scaled up and some of the old RC is scaled down.

The latest failed RC will eventually be scaled back down to zero (correct), but the active/last completed successful RC (version 1) will remain at the scaling count it was at when the deployer was cancelled. Then, the DC will be updated to match the replica count of RC version 1. Then, because the DC now has the incorrect replica count, new RCs will use the value from the DC.

Comment 2 Dan Mace 2015-12-10 16:35:17 UTC

https://github.com/openshift/origin/pull/6260

Comment 3 Dan Mace 2015-12-11 13:56:54 UTC

Please keep in mind that after a failed deployment, it can take a while (up to a minute) for the cluster to reconcile and bring the failed deployment back to 0 and the old active deployment back up to the correct scale (assuming you didn't start another deployment already.)

Comment 4 zhou ying 2015-12-14 03:03:41 UTC

I've confirmed on latest ami, the issue has fiexd. 
[root@ip-172-18-4-35 amd64]# openshift version
openshift v1.1-428-ged29520
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2