Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1539143

Summary:	Unable to rollout or rollback if node becomes NotReady during deployment.
Product:	OpenShift Container Platform	Reporter:	Ryan Howe <rhowe>
Component:	Master	Assignee:	Tomáš Nožička <tnozicka>
Status:	CLOSED NOTABUG	QA Contact:	zhou ying <yinzhou>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.7.0	CC:	aos-bugs, decarr, erich, jhonce, jokerman, keprice, mfojtik, mmccomas, rhowe, xxia
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-19 12:10:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ryan Howe 2018-01-26 18:50:07 UTC

Description of problem:

When node becomes NotReady when a deploy is currently running on that node, we are unable to redeploy or rollback until the node becomes ready again. Deployment pods can not be removed. 

Version-Release number of selected component (if applicable):
Reproduced on 3.6 and 3.7 


How reproducible:
100%

Steps to Reproduce:

```
> oc get pods
NAME                     READY     STATUS      RESTARTS   AGE
httpd-example-6-jnlnm    1/1       Running     0          1m

> oc rollout latest dc/httpd-example

> oc get pods -o wide

httpd-example-6-jnlnm    1/1       Running     0          1m        10.128.2.30   node-0.openshift.com
httpd-example-7-4shgv    0/1       Pending     0          14s       <none>        node-0.openshift.com
httpd-example-7-deploy   1/1       Running     0          21s       10.128.2.31   node-0.openshift.com


*** NODE-0 gets killed ****
> oc rollout cancel dc/httpd-example 
deploymentconfig "httpd-example" cancelling

> oc get pods
NAME                     READY     STATUS        RESTARTS   AGE
httpd-example-6-jnlnm    1/1       Running       0          6m
httpd-example-7-4shgv    0/1       Pending       0          4m
httpd-example-7-deploy   1/1       Terminating   0          5m

> oc rollout latest dc/httpd-example 
error: #7 is already in progress (Running).

> oc rollout cancel dc/httpd-example 
deploymentconfig "httpd-example" already cancelled
No rollout is in progress (latest rollout #7 running (cancelling) 10 minutes ago)

> oc rollout latest dc/httpd-example 
error: #7 is already in progress (Running).

> oc get pods
NAME                     READY     STATUS    RESTARTS   AGEoc 
httpd-example-7-4shgv    0/1       Unknown   0          13m
httpd-example-7-8qr6s    0/1       Running   0          8m    
httpd-example-7-deploy   1/1       Unknown   0          13m


The RC does schedule a pods the a different node. We are still und

> oc delete pod httpd-example-7-deploy --grace-period=0 
*** HANGS **** 
cltr-c 

Can't recover and redeploy until node comes back. 

> oc rollback dc/httpd-example
#8 rolled back to httpd-example-6
Warning: the following images triggers were disabled: httpd-example:latest
  You can re-enable them with: oc set triggers dc/httpd-example --auto

> > oc get pods
NAME                     READY     STATUS    RESTARTS   AGEoc 
httpd-example-7-4shgv    0/1       Unknown   0          20m
httpd-example-7-8qr6s    0/1       Running   0          15m    
httpd-example-7-deploy   1/1       Unknown   0          20m

> oc rollout status dc/httpd-example
Waiting for latest deployment config spec to be observed by the controller loop...

**** Start node-0 
Everything goes through the deployment gets rollback and we are able to rollout latest again 

> oc get pods
NAME                    READY     STATUS        RESTARTS   AGE
httpd-example-8-sjbcq   1/1       Running       0          1m

> oc rollout latest dc/httpd-example
deploymentconfig "httpd-example" rolled out

 > oc get pods
NAME                      READY     STATUS         RESTARTS   AGE
httpd-example-9-deploy   1/1       Running        0          27s
httpd-example-9-m6z5d    0/1       Running        0          21s
```


Actual results:
Unable to deploy any pods

Expected results:
Be able to deploy pods

Comment 1 Michal Fojtik 2018-01-29 10:56:45 UTC

If you create a new rollout which means a new replication controller is created for the latest version and the replication controller and the replication controller is being scaled up and the new pod is created on node that become un-schedulable, then that means the pod will not be ready and therefore the rollout will fail and will be rollbacked.

I think you can use the MaxUnavailable and MaxSurge to prevent the rollout from being stuck:

https://github.com/openshift/origin/blob/master/pkg/apps/apis/apps/types.go#L275

Comment 2 Tomáš Nožička 2018-01-30 14:51:36 UTC

Adding --force to oc delete might help.

The issue seems to be that the deployer Pod is in Unknown state. If the node goes down it should transition to Failed phase. (Likely after some time.)

I don't think DeploymentConfig controller has a chance at reconciling here; this has to be fixed on Pod level so that the Pod reaches Failed state, which I think it should after a while.

Comment 3 Jhon Honce 2018-01-31 15:33:36 UTC

Using force will create a mess in storage.  It does _NOT_ force the delete, it just ignores the errors.

Comment 4 Ryan Howe 2018-02-05 16:58:32 UTC

The main issue here is that if a node become Not Ready when a deployment currently running and scheduled to the node. There is no way to do any rollout or deploys for the deployment with out bring the Not Ready node back to the cluster.

Comment 5 Ryan Howe 2018-02-05 17:03:00 UTC

The actual deployer pod is stuck in the unknown state, and we are prevented from taking any deployment action until that pod is deleted or enters a different state.

Comment 7 Kevin Price 2018-03-26 09:40:39 UTC

What is the status of this bug fix? Do we have a reasonable workaround to provide in the meantime?