Description of problem: using "oc replace --force" in v3.9.{60,68} ends up with rollouts being continuously cancelled: The "oc describe dc/rhel7-atomic" output: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal DeploymentAwaitingCancellation 5s (x2 over 5s) deploymentconfig-controller Deployment of version 1 awaiting cancellation of older running deployments Normal DeploymentCancelled 5s (x2 over 5s) deploymentconfig-controller Cancelled deployment "rhel7-atomic-1" superceded by version 1 Normal DeploymentCreated 4s (x21 over 5s) deploymentconfig-controller Created new replication controller "rhel7-atomic-1" for version 1 # oc get pods -o wide -w [...] rhel7-atomic-1-deploy 0/1 ContainerCreating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 10s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 10s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Pending 0 0s <none> <none> rhel7-atomic-1-deploy 0/1 Pending 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 10s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 10s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Pending 0 0s <none> <none> rhel7-atomic-1-deploy 0/1 Pending 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 ContainerCreating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab rhel7-atomic-1-deploy 0/1 Terminating 0 0s <none> node-1.local.lab [...] Version-Release number of selected component (if applicable): atomic-openshift-clients-3.9.68-1.git.0.76fd86e.el7.x86_64 How reproducible: always Steps to Reproduce: 1. Create a project called "test-force-replace" 2. Run the attached break_dc.sh script 3. See "oc get pods -o wide -w" output Actual results: rollout pods being continuously terminated in background. Expected results: successful deployments. Additional info: seems to be a similar issue as described in [1]. --- [1] https://bugzilla.redhat.com/show_bug.cgi?id=1632654
The issue does not happen when using a higher oc client version, as atomic-openshift-clients-3.11.82-1.git.0.08bc31b.el7.x86_64
This is related to the GC changes that were introduced after 3.9, iow. previously we need to manually remove all dependant objects and it looks like we didn't do a great job in case of replace and delete. Newer versions have that fixed with proper deletion strategies.
This was fixed in newer versions and based on my previous comment we're not going to fix it in 3.9.
Hi Maciej. Following up our earlier conversation, I'm reopening this as it seems the issue affects the ansible service broker role in openshift-ansible and 3.9 z-stream upgrades: - https://github.com/openshift/openshift-ansible/blob/e88b6afadd622cf2e9f6f3a3ac5e85a22c2c425d/roles/ansible_service_broker/tasks/install.yml#L174-L180 - https://github.com/openshift/openshift-ansible/blob/5f79e1cb1a6c697e17749a169cd9fcccecd0ee09/roles/lib_openshift/library/oc_obj.py#L950-L962 Can we either reassess as backport fix for 3.9 or including the "--cascade=true" flag when using "oc replace" in the openshift-ansible service broker role?
I'll check what's possible.
Hi. Any update regarding this bug? Thank you.
Hi Maciej. Any progress regarding this issue?
Maciej, Can you speak to the safety of `oc replace --force --cascade` in 3.10? Is it generally safe to use in all situations?
Yeah, I don't see any objections on using newer version.
Opened a release-3.11 PR for discussion, https://github.com/openshift/openshift-ansible/pull/11848.
Fixed. openshift-ansible-3.11.141-1.git.0.a7e91cd.el7 before fix rhel7-atomic-1-deploy 0/1 Terminating 0 3s rhel7-atomic-3-x4m4l 1/1 Running 1 1h Normal DeploymentCancelled 1h (x2555 over 1h) deploymentconfig-controller Cancelled deployment "rhel7-atomic-1" superceded by version 1 after fix # oc get pods NAME READY STATUS RESTARTS AGE rhel7-atomic-1-vfwkr 1/1 Running 0 7m no DeploymentCancelled event reported
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2580