Bug 1539143
| Summary: | Unable to rollout or rollback if node becomes NotReady during deployment. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ryan Howe <rhowe> |
| Component: | Master | Assignee: | Tomáš Nožička <tnozicka> |
| Status: | CLOSED NOTABUG | QA Contact: | zhou ying <yinzhou> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.7.0 | CC: | aos-bugs, decarr, erich, jhonce, jokerman, keprice, mfojtik, mmccomas, rhowe, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-10-19 12:10:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ryan Howe
2018-01-26 18:50:07 UTC
If you create a new rollout which means a new replication controller is created for the latest version and the replication controller and the replication controller is being scaled up and the new pod is created on node that become un-schedulable, then that means the pod will not be ready and therefore the rollout will fail and will be rollbacked. I think you can use the MaxUnavailable and MaxSurge to prevent the rollout from being stuck: https://github.com/openshift/origin/blob/master/pkg/apps/apis/apps/types.go#L275 Adding --force to oc delete might help. The issue seems to be that the deployer Pod is in Unknown state. If the node goes down it should transition to Failed phase. (Likely after some time.) I don't think DeploymentConfig controller has a chance at reconciling here; this has to be fixed on Pod level so that the Pod reaches Failed state, which I think it should after a while. Using force will create a mess in storage. It does _NOT_ force the delete, it just ignores the errors. The main issue here is that if a node become Not Ready when a deployment currently running and scheduled to the node. There is no way to do any rollout or deploys for the deployment with out bring the Not Ready node back to the cluster. The actual deployer pod is stuck in the unknown state, and we are prevented from taking any deployment action until that pod is deleted or enters a different state. What is the status of this bug fix? Do we have a reasonable workaround to provide in the meantime? |