Bug 1791863 - Cannot abort an upgrade from 4.2 to 4.3 and rollback to 4.2 - probe changes are not correctly applied (console-operator cannot be reverted)
Summary: Cannot abort an upgrade from 4.2 to 4.3 and rollback to 4.2 - probe changes a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.4.0
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks: 1792004
TreeView+ depends on / blocked
 
Reported: 2020-01-16 16:04 UTC by Clayton Coleman
Modified: 2020-05-04 11:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1792004 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:24:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 298 0 None closed Bug 1791863: lib/resourcemerge/core: Clear livenessProbe and readinessProbe if nil in required 2020-09-19 11:11:50 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:25:19 UTC

Description Clayton Coleman 2020-01-16 16:04:43 UTC
A 4.2 to 4.3 upgrade is falling to rollback successfully to 4.2 due to the console-operator.

Jan 16 02:48:44.969: INFO: cluster upgrade is Progressing: Working towards 4.2.14: 82% complete
Jan 16 02:48:54.969: INFO: cluster upgrade is Progressing: Unable to apply 4.2.14: the update could not be applied
Jan 16 02:48:54.969: INFO: cluster upgrade is Failing: Could not update deployment "openshift-console-operator/console-operator" (293 of 433)
Jan 16 02:49:04.968: INFO: cluster upgrade is Progressing: Unable to apply 4.2.14: the update could not be applied
Jan 16 02:49:04.968: INFO: cluster upgrade is Failing: Could not update deployment "openshift-console-operator/console-operator" (293 of 433)

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/235

Rolling upgrades require toleration of old and new state at the same time, so any error on downgrade must be verified because it likely indicates that a one way upgrade cannot succeed.  Second, we must preserve the ability to rollback so that if a customer hits an issue we can recover from it.

This blocks 4.3.0 upgrades being available for end users in the fast channel (if a customer hit a serious issue with their apps, we would be unable to rollback).

Comment 1 Samuel Padgett 2020-01-16 19:53:02 UTC
I believe this is happening because the CVO is not removing the console-operator readiness probe added in 4.3 when downgrading to 4.2.

The 4.2 operator deployment manifest does not have a readiness probe:
https://github.com/openshift/console-operator/blob/release-4.2/manifests/07-operator.yaml

The 4.3 operator deployment does:
https://github.com/openshift/console-operator/blob/release-4.3/manifests/07-operator.yaml#L66-L75

The failing 4.2 console-operator pod has the probe when it shouldn't:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/235/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-57eb844d3650921acacb016df97664a30e55a45a554fe71a4fda297015321d0e/namespaces/openshift-console-operator/pods/console-operator-7bb76df6d6-m4qh7/console-operator-7bb76df6d6-m4qh7.yaml

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/235/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-57eb844d3650921acacb016df97664a30e55a45a554fe71a4fda297015321d0e/namespaces/openshift-console-operator/apps/deployments.yaml

Since the 4.2 console operator has no `/readyz` endpoint, the readiness probe fails:
> Jan 16 03:34:39.681 W ns/openshift-console-operator pod/console-operator-7bb76df6d6-m4qh7 Readiness probe failed: HTTP probe failed with statuscode: 404 (390 times)

The workaround would be to edit the console-operator deployment YAML after downgrade to manually from the liveness and readiness probes.

Comment 4 liujia 2020-01-20 09:17:26 UTC
Version:
4.4.0-0.nightly-2020-01-20-002356
Tried to abort the upgrade from 4.3.0-rc.3 to 4.4.0-0.nightly-2020-01-20-002356 and do rollback when upgrade finish 99%. But hit bz1792842, the downgrade can not start, so can not check if original console-operator issue during downgrade fix, will verify this bug after bz1792842 fix.

Comment 5 liujia 2020-01-23 08:31:01 UTC
Version:
4.4.0-0.nightly-2020-01-22-045318

1. install ocp 4.3.0
2. trigger upgrade from 4.3.0 to 4.4.0-0.nightly-2020-01-22-045318
3. monitor the upgrade progress and all cluster operators status
4. abort above upgrade and trigger downgrade to 4.3.0 after all operators updated to target version but the upgrade status is still not 100%.
5. check downgrade from 4.4.0-0.nightly-2020-01-22-045318 to 4.3.0 succeed.
# ./oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version'
2020-01-23T07:59:27Z|2020-01-23T08:27:58Z|Completed|4.3.0
2020-01-23T07:19:09Z|2020-01-23T07:59:27Z|Partial|4.4.0-0.nightly-2020-01-22-221818
2020-01-23T04:26:54Z|2020-01-23T04:47:24Z|Completed|4.3.0

Comment 6 Sinny Kumari 2020-02-10 11:54:38 UTC
buildcop update:

Still seeing 4.2 to 4.3 upgrade failing to rollback issue (originally reported in #c0 ) in various CI run:
Feb 10 10:16:37.817: INFO: cluster upgrade is Progressing: Working towards 4.2.16: 82% complete
Feb 10 10:16:47.817: INFO: cluster upgrade is Progressing: Unable to apply 4.2.16: the update could not be applied
Feb 10 10:16:47.817: INFO: cluster upgrade is Failing: Could not update deployment "openshift-console-operator/console-operator" (293 of 433)
Feb 10 10:16:57.817: INFO: cluster upgrade is Progressing: Unable to apply 4.2.16: the update could not be applied

prow job link - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/283

Let me know if we are tracking this issue at some other place.

Comment 7 W. Trevor King 2020-02-11 22:48:11 UTC
> Still seeing 4.2 to 4.3 upgrade failing to rollback issue...

This bug targets 4.4, and the associated cvo#1791863 landed in master.  Walking the "Blocks" dependency chain back:

* Bug 1792004 targets 4.3.z and is still in POST (I just dropped a CVO-maintainer approval on cvo#299, but it's still blocked on a patch manager's cherry-pick-approved).
* Bug 1792005 targets 4.2.z and is still in POST (cvo#301 is blocked on the 4.3 bug 1792004 getting VERIFIED, a CVO-maintainer approval, and a patch manager's cherry-pick-approved).

Once the 4.2 PR lands, the 4.2<->4.3 rollback tests will start passing again, so the 4.2 bug 1792005 is probably the best place to track.

Comment 8 W. Trevor King 2020-02-11 22:48:47 UTC
> ...and the associated cvo#1791863 landed in master...

Oops, I meant cvo#298.

Comment 10 errata-xmlrpc 2020-05-04 11:24:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.