Bug 1685074
| Summary: | Rollouts continuously get cancelled when using oc replace | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Robert Sandu <rsandu> | |
| Component: | Installer | Assignee: | Russell Teague <rteague> | |
| Installer sub component: | openshift-ansible | QA Contact: | Weihua Meng <wmeng> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | high | CC: | aos-bugs, jokerman, maszulik, mfojtik, mmccomas, rsandu, rteague, scuppett, vlaad, wmeng | |
| Version: | 3.9.0 | Keywords: | Reopened | |
| Target Milestone: | --- | |||
| Target Release: | 3.11.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: When using `oc replace --force`, dependent objects were not being properly removed/updated.
Consequence: Deployment rollouts would not complete and would be canceled.
Fix: The options `--cascade` and `--grace-period` are added to the module using `oc replace`.
Result: Deployment are properly rolled out when using `oc replace`.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1745027 1745030 (view as bug list) | Environment: | ||
| Last Closed: | 2019-09-03 15:56:02 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1745027, 1745030 | |||
The issue does not happen when using a higher oc client version, as atomic-openshift-clients-3.11.82-1.git.0.08bc31b.el7.x86_64 This is related to the GC changes that were introduced after 3.9, iow. previously we need to manually remove all dependant objects and it looks like we didn't do a great job in case of replace and delete. Newer versions have that fixed with proper deletion strategies. This was fixed in newer versions and based on my previous comment we're not going to fix it in 3.9. Hi Maciej. Following up our earlier conversation, I'm reopening this as it seems the issue affects the ansible service broker role in openshift-ansible and 3.9 z-stream upgrades: - https://github.com/openshift/openshift-ansible/blob/e88b6afadd622cf2e9f6f3a3ac5e85a22c2c425d/roles/ansible_service_broker/tasks/install.yml#L174-L180 - https://github.com/openshift/openshift-ansible/blob/5f79e1cb1a6c697e17749a169cd9fcccecd0ee09/roles/lib_openshift/library/oc_obj.py#L950-L962 Can we either reassess as backport fix for 3.9 or including the "--cascade=true" flag when using "oc replace" in the openshift-ansible service broker role? I'll check what's possible. Hi. Any update regarding this bug? Thank you. Hi Maciej. Any progress regarding this issue? Maciej, Can you speak to the safety of `oc replace --force --cascade` in 3.10? Is it generally safe to use in all situations? Yeah, I don't see any objections on using newer version. Opened a release-3.11 PR for discussion, https://github.com/openshift/openshift-ansible/pull/11848. Fixed. openshift-ansible-3.11.141-1.git.0.a7e91cd.el7 before fix rhel7-atomic-1-deploy 0/1 Terminating 0 3s rhel7-atomic-3-x4m4l 1/1 Running 1 1h Normal DeploymentCancelled 1h (x2555 over 1h) deploymentconfig-controller Cancelled deployment "rhel7-atomic-1" superceded by version 1 after fix # oc get pods NAME READY STATUS RESTARTS AGE rhel7-atomic-1-vfwkr 1/1 Running 0 7m no DeploymentCancelled event reported Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2580 |
Description of problem: using "oc replace --force" in v3.9.{60,68} ends up with rollouts being continuously cancelled: The "oc describe dc/rhel7-atomic" output: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal DeploymentAwaitingCancellation 5s (x2 over 5s) deploymentconfig-controller Deployment of version 1 awaiting cancellation of older running deployments Normal DeploymentCancelled 5s (x2 over 5s) deploymentconfig-controller Cancelled deployment "rhel7-atomic-1" superceded by version 1 Normal DeploymentCreated 4s (x21 over 5s) deploymentconfig-controller Created new replication controller "rhel7-atomic-1" for version 1 # oc get pods -o wide -w [...] rhel7-atomic-1-deploy 0/1 ContainerCreating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 10s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 10s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Pending 0 0s <none> <none> rhel7-atomic-1-deploy 0/1 Pending 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 10s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 10s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Pending 0 0s <none> <none> rhel7-atomic-1-deploy 0/1 Pending 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 ContainerCreating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab [...] Version-Release number of selected component (if applicable): atomic-openshift-clients-3.9.68-1.git.0.76fd86e.el7.x86_64 How reproducible: always Steps to Reproduce: 1. Create a project called "test-force-replace" 2. Run the attached break_dc.sh script 3. See "oc get pods -o wide -w" output Actual results: rollout pods being continuously terminated in background. Expected results: successful deployments. Additional info: seems to be a similar issue as described in [1]. --- [1] https://bugzilla.redhat.com/show_bug.cgi?id=1632654