1792004 – Cannot abort an upgrade from 4.2 to 4.3 and rollback to 4.2 - probe changes are not correctly applied (console-operator cannot be reverted)

Bug 1792004 - Cannot abort an upgrade from 4.2 to 4.3 and rollback to 4.2 - probe changes are not correctly applied (console-operator cannot be reverted)

Summary: Cannot abort an upgrade from 4.2 to 4.3 and rollback to 4.2 - probe changes a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.3.z
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:	1791863
Blocks:	1792005
TreeView+	depends on / blocked

Reported:	2020-01-16 20:45 UTC by W. Trevor King
Modified:	2020-02-25 06:18 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1791863
Clones:	1792005 (view as bug list)
Environment:
Last Closed:	2020-02-25 06:17:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 299	0	None	closed	Bug 1792004: lib/resourcemerge/core: Clear livenessProbe and readinessProbe if nil in required	2020-06-15 07:45:25 UTC
Red Hat Product Errata	RHBA-2020:0528	0	None	None	None	2020-02-25 06:18:15 UTC

Description W. Trevor King 2020-01-16 20:45:29 UTC

+++ This bug was initially created as a clone of Bug #1791863 +++

--- Additional comment from Samuel Padgett on 2020-01-16 19:53:02 UTC ---

I believe this is happening because the CVO is not removing the console-operator readiness probe added in 4.3 when downgrading to 4.2.

The 4.2 operator deployment manifest does not have a readiness probe:
https://github.com/openshift/console-operator/blob/release-4.2/manifests/07-operator.yaml

The 4.3 operator deployment does:
https://github.com/openshift/console-operator/blob/release-4.3/manifests/07-operator.yaml#L66-L75

The failing 4.2 console-operator pod has the probe when it shouldn't:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/235/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-57eb844d3650921acacb016df97664a30e55a45a554fe71a4fda297015321d0e/namespaces/openshift-console-operator/pods/console-operator-7bb76df6d6-m4qh7/console-operator-7bb76df6d6-m4qh7.yaml

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/235/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-57eb844d3650921acacb016df97664a30e55a45a554fe71a4fda297015321d0e/namespaces/openshift-console-operator/apps/deployments.yaml

Since the 4.2 console operator has no `/readyz` endpoint, the readiness probe fails:
> Jan 16 03:34:39.681 W ns/openshift-console-operator pod/console-operator-7bb76df6d6-m4qh7 Readiness probe failed: HTTP probe failed with statuscode: 404 (390 times)

The workaround would be to edit the console-operator deployment YAML after downgrade to manually from the liveness and readiness probes.

Comment 4 liujia 2020-02-17 06:15:29 UTC

Version:
4.3.0-0.nightly-2020-02-16-235204

1. install ocp 4.2.19
2. trigger upgrade from 4.2.19 to 4.3.0-0.nightly-2020-02-16-235204
3. monitor the upgrade progress and all cluster operators status
4. abort above upgrade and trigger downgrade to 4.3.0 after all operators updated to target version but the upgrade status is still not 100%.
5. check downgrade from 4.3.0-0.nightly-2020-02-16-235204 to 4.2.19 failed.
{
      "lastTransitionTime": "2020-02-17T06:10:44Z",
      "message": "Could not update deployment \"openshift-console-operator/console-operator\" (293 of 433)",
      "reason": "UpdatePayloadFailed",
      "status": "True",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2020-02-17T05:09:21Z",
      "message": "Unable to apply 4.2.19: the update could not be applied",
      "reason": "UpdatePayloadFailed",
      "status": "True",
      "type": "Progressing"
    },

Moreover, check ci job https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/296 failed.
Feb 16 22:23:30.354: INFO: cluster upgrade is Progressing: Working towards 4.2.19: 82% complete
Feb 16 22:23:40.354: INFO: cluster upgrade is Progressing: Unable to apply 4.2.19: the update could not be applied
Feb 16 22:23:40.354: INFO: cluster upgrade is Failing: Could not update deployment "openshift-console-operator/console-operator" (293 of 433)
Feb 16 22:23:50.355: INFO: cluster upgrade is Progressing: Unable to apply 4.2.19: the update could not be applied

Comment 5 W. Trevor King 2020-02-17 20:54:09 UTC

> 5. check downgrade from 4.3.0-0.nightly-2020-02-16-235204 to 4.2.19 failed.

On upgrades and downgrades, the important CVO is that for the target release.  More on that in [1].  So to verify the 4.3.z fix for this bug, you'd need to use a whatever -> 4.3-nightly upgrade in which there is a manifest change that removes either a container or a service port that removes a livenessProbe and/or a readinessProbe.  If you want to check on rollback in particular, that would be 4.3-nightly -> whatever -> rollback to 4.3-nightly.  What you did, with 4.2.19 -> 4.3-nightly -> rollback to 4.2.19 shows that the 4.2.19 CVO is still impacted by the bug, which makes sense with the 4.2.z bug 1792005 still in NEW (blocked on this bug getting VERIFIED).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1798049#c4

Comment 6 liujia 2020-02-18 06:32:23 UTC

@W. Trevor King 
Since the original bug1791863 which happened on v4.3.0, in job[1], in which the rollback test path was 4.2-4.3-4.2(4.2-4.3 succeed,4.3-4.2 failed). So this cloned bug is targeted for 4.3, which should be verified following the same test path as [1](4.2-4.3-4.2). Right? Or bz1791863 should be an issue of 4.2, which means it should be fixed in cvo of 4.2, but not cvo of 4.3? Then this bug is actually for 4.3, which can be verified with 4.3-4.4-4.3?

[1]https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/235

Comment 7 W. Trevor King 2020-02-19 00:17:29 UTC

> Since the original bug1791863 which happened on v4.3.0...

I pointed bug 1791863 at Target Release 4.4.0 on 2020-01-16 20:24:05 UTC.  [1] links out the 4.4, 4.3, and 4.2 bug chain.  Ideally that one would have been verified with a whatever -> 4.4 update.  Your [2] included a 4.3.0 -> 4.4.0-0.nightly-2020-01-22-045318, which exercised the patched 4.4 CVO (appropriate for that 4.4-targeted bug).  The rollback from your test was orthogonal.  The test in [2] was weakened by the fact that we may not have made any probe removals in 4.3.0 -> 4.4.0-0.nightly-2020-01-22-045318.  But it showed that at least the patched 4.4 CVO wasn't regressing for normal operation.

> Then this bug is actually for 4.3, which can be verified with 4.3-4.4-4.3?

Yeah, this bug is for 4.3.  4.3 -> 4.4 -> 4.3 would exercise the patched 4.3 CVO, but without a probe removal in 4.4 -> 4.3 (I'm not aware of any, but haven't actively looked for them either), it won't be exercising the bug fixed in this series.  You could work up a custom 4.4 release image that added a probe to a Deployment which lacked probes in the 4.3 nightly.  But I'm also fine with a simple regression test here, so we can get on to landing the 4.2 bug 1792005 and see if we actually fixed the 4.3 -> 4.2 problem.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1791863#c7
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1791863#c5

Comment 8 liujia 2020-02-19 06:33:33 UTC

Ok. Since path 4.3-4.4-4.3 was covered in bug 1791863, i will verify this bug by path 4.3-nightly -> whatever -> rollback to 4.3-nightly suggested in comment 5

Version:
4.3.0-0.nightly-2020-02-17-205936

1. install ocp 4.3.2
2. trigger upgrade from 4.3.2 to 4.3.0-0.nightly-2020-02-17-205936
3. monitor the upgrade progress and all cluster operators status
4. abort above upgrade and trigger downgrade to 4.3.2 after all operators updated to target version but the upgrade status is still not 100%.
5. check downgrade from 4.3.0-0.nightly-2020-02-17-205936 to 4.3.2 successfully.

Comment 10 errata-xmlrpc 2020-02-25 06:17:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0528

Note You need to log in before you can comment on or make changes to this bug.