1685074 – Rollouts continuously get cancelled when using oc replace

Bug 1685074 - Rollouts continuously get cancelled when using oc replace

Summary: Rollouts continuously get cancelled when using oc replace

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Russell Teague
QA Contact:	Weihua Meng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1745027 1745030
TreeView+	depends on / blocked

Reported:	2019-03-04 10:12 UTC by Robert Sandu
Modified:	2019-09-03 15:56 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When using `oc replace --force`, dependent objects were not being properly removed/updated. Consequence: Deployment rollouts would not complete and would be canceled. Fix: The options `--cascade` and `--grace-period` are added to the module using `oc replace`. Result: Deployment are properly rolled out when using `oc replace`.
Clone Of:
Clones:	1745027 1745030 (view as bug list)
Environment:
Last Closed:	2019-09-03 15:56:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 11848	0	None	None	None	2019-08-22 19:30:32 UTC
Red Hat Product Errata	RHBA-2019:2580	0	None	None	None	2019-09-03 15:56:21 UTC

Description Robert Sandu 2019-03-04 10:12:32 UTC

Description of problem: using "oc replace --force" in v3.9.{60,68} ends up with rollouts being continuously cancelled:

The "oc describe dc/rhel7-atomic" output:

Events:
  Type		Reason				Age			From				Message
  ----		------				----			----				-------
  Normal	DeploymentAwaitingCancellation	5s (x2 over 5s)		deploymentconfig-controller	Deployment of version 1 awaiting cancellation of older running deployments
  Normal	DeploymentCancelled		5s (x2 over 5s)		deploymentconfig-controller	Cancelled deployment "rhel7-atomic-1" superceded by version 1
  Normal	DeploymentCreated		4s (x21 over 5s)	deploymentconfig-controller	Created new replication controller "rhel7-atomic-1" for version 1

# oc get pods -o wide -w
[...]
rhel7-atomic-1-deploy   0/1       ContainerCreating   0         0s        <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         0s        <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         0s        <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         10s       <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         10s       <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Pending   0         0s        <none>    <none>
rhel7-atomic-1-deploy   0/1       Pending   0         0s        <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         0s        <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         0s        <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         10s       <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         10s       <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Pending   0         0s        <none>    <none>
rhel7-atomic-1-deploy   0/1       Pending   0         0s        <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       ContainerCreating   0         0s        <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         0s        <none>    node-1.local.lab
rhel7-atomic-1-deploy   0/1       Terminating   0         0s        <none>    node-1.local.lab
[...]

Version-Release number of selected component (if applicable):
atomic-openshift-clients-3.9.68-1.git.0.76fd86e.el7.x86_64

How reproducible: always


Steps to Reproduce:
1. Create a project called "test-force-replace"
2. Run the attached break_dc.sh script
3. See "oc get pods -o wide -w" output

Actual results: rollout pods being continuously terminated in background.


Expected results: successful deployments.


Additional info: seems to be a similar issue as described in [1].

---

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1632654

Comment 2 Robert Sandu 2019-03-04 10:21:40 UTC

The issue does not happen when using a higher oc client version, as atomic-openshift-clients-3.11.82-1.git.0.08bc31b.el7.x86_64

Comment 3 Maciej Szulik 2019-03-07 12:09:09 UTC

This is related to the GC changes that were introduced after 3.9, iow. previously we need to manually
remove all dependant objects and it looks like we didn't do a great job in case of replace and delete.
Newer versions have that fixed with proper deletion strategies.

Comment 4 Maciej Szulik 2019-03-07 12:11:47 UTC

This was fixed in newer versions and based on my previous comment we're not going to fix it in 3.9.

Comment 5 Robert Sandu 2019-03-07 14:50:14 UTC

Hi Maciej.

Following up our earlier conversation, I'm reopening this as it seems the issue affects the ansible service broker role in openshift-ansible and 3.9 z-stream upgrades:

- https://github.com/openshift/openshift-ansible/blob/e88b6afadd622cf2e9f6f3a3ac5e85a22c2c425d/roles/ansible_service_broker/tasks/install.yml#L174-L180
- https://github.com/openshift/openshift-ansible/blob/5f79e1cb1a6c697e17749a169cd9fcccecd0ee09/roles/lib_openshift/library/oc_obj.py#L950-L962

Can we either reassess as backport fix for 3.9 or including the "--cascade=true" flag when using "oc replace" in the openshift-ansible service broker role?

Comment 6 Maciej Szulik 2019-03-08 12:55:21 UTC

I'll check what's possible.

Comment 7 Robert Sandu 2019-04-01 10:03:58 UTC

Hi.

Any update regarding this bug?

Thank you.

Comment 8 Robert Sandu 2019-07-12 13:07:43 UTC

Hi Maciej.

Any progress regarding this issue?

Comment 11 Scott Dodson 2019-08-16 21:48:53 UTC

Maciej, Can you speak to the safety of `oc replace --force --cascade` in 3.10? Is it generally safe to use in all situations?

Comment 12 Maciej Szulik 2019-08-19 15:05:44 UTC

Yeah, I don't see any objections on using newer version.

Comment 13 Russell Teague 2019-08-21 20:49:31 UTC

Opened a release-3.11 PR for discussion, https://github.com/openshift/openshift-ansible/pull/11848.

Comment 18 Weihua Meng 2019-08-29 13:42:19 UTC

Fixed.

openshift-ansible-3.11.141-1.git.0.a7e91cd.el7


before fix
rhel7-atomic-1-deploy      0/1       Terminating   0          3s
rhel7-atomic-3-x4m4l       1/1       Running       1          1h

 Normal	DeploymentCancelled		1h (x2555 over 1h)	deploymentconfig-controller	Cancelled deployment "rhel7-atomic-1" superceded by version 1


after fix
# oc get pods
NAME                   READY     STATUS    RESTARTS   AGE
rhel7-atomic-1-vfwkr   1/1       Running   0          7m

no DeploymentCancelled event reported

Comment 20 errata-xmlrpc 2019-09-03 15:56:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2580

Note You need to log in before you can comment on or make changes to this bug.