Bug 2072389

Summary: CVO exits upgrade immediately rather than waiting for etcd backup
Product: OpenShift Container Platform Reporter: Yang Yang <yanyang>
Component: Cluster Version OperatorAssignee: Jack Ottofaro <jack.ottofaro>
Status: CLOSED ERRATA QA Contact: Yang Yang <yanyang>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: jack.ottofaro, jhou, lmohanty, lxia, wking
Target Milestone: ---Keywords: TestBlocker, UpgradeBlocker, Upgrades
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: UpdateRecommendationsBlocked
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2076793 2083370 (view as bug list) Environment:
Last Closed: 2022-08-10 11:04:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2076793, 2083370    
Attachments:
Description Flags
CVO log file none

Description Yang Yang 2022-04-06 08:35:17 UTC
Created attachment 1871007 [details]
CVO log file

Created attachment 1871007 [details]
CVO log file

Description of problem:

During minor upgrade from 4.10 to 4.11, CVO sets ReleaseAccepted=False once it finds etcd RecentBackup not true so that upgrade is never started. Previously, CVO checked etcd RecentBackup if it’s not true, CVO set Failing=true, then etcd started to take backup. After backup has been taken, CVO set Failing to false and proceeded the upgrade.

# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2022-04-06T06:34:30Z"
    generation: 3
    name: version
    resourceVersion: "36529"
    uid: d3d4b24e-9b77-4e49-8796-350c2f8cd96f
  spec:
    channel: stable-4.10
    clusterID: c6e49e12-4a08-4795-beb9-fd819a14ea33
    desiredUpdate:
      force: false
      image: registry.ci.openshift.org/ocp/release@sha256:28d4c78bd2ce3fa33479c0ee57372777908fead95e150f664fec8e4310cd85e4
      version: ""
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2022-04-06T06:34:31Z"
      message: 'Unable to retrieve available updates: currently reconciling cluster
        version 4.10.7 not found in the "stable-4.10" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    - lastTransitionTime: "2022-04-06T07:14:51Z"
      message: 'Preconditions failed for payload loaded version="4.11.0-0.nightly-2022-03-29-152521"
        image="registry.ci.openshift.org/ocp/release@sha256:28d4c78bd2ce3fa33479c0ee57372777908fead95e150f664fec8e4310cd85e4":
        Precondition "EtcdRecentBackup" failed because of "ControllerStarted": '
      reason: PreconditionChecks
      status: "False"
      type: ReleaseAccepted
    - lastTransitionTime: "2022-04-06T06:55:47Z"
      message: Done applying 4.10.7
      status: "True"
      type: Available
    - lastTransitionTime: "2022-04-06T06:55:47Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2022-04-06T07:15:47Z"
      message: Cluster version is 4.10.7
      status: "False"
      type: Progressing
    desired:
      image: quay.io/openshift-release-dev/ocp-release@sha256:347fcefa4cff84074fa56ff73a483b9fee7ba98b9a71752763502f11182a11af
      url: https://access.redhat.com/errata/RHBA-2022:1162
      version: 4.10.7
    history:
    - completionTime: "2022-04-06T06:55:47Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:347fcefa4cff84074fa56ff73a483b9fee7ba98b9a71752763502f11182a11af
      startedTime: "2022-04-06T06:34:31Z"
      state: Completed
      verified: false
      version: 4.10.7
    observedGeneration: 3
    versionHash: o09_Mvm2ad0=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


# oc get co/etcd -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-04-06T06:34:31Z"
  generation: 1
  name: etcd
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: d3d4b24e-9b77-4e49-8796-350c2f8cd96f
  resourceVersion: "29598"
  uid: b77ad218-07a7-4db5-b862-7d3df7954c36
spec: {}
status:
  conditions:
  - lastTransitionTime: "2022-04-06T06:39:47Z"
    message: |-
      NodeControllerDegraded: All master nodes are ready
      EtcdMembersDegraded: No unhealthy members found
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-04-06T06:56:30Z"
    message: |-
      NodeInstallerProgressing: 3 nodes are at revision 7
      EtcdMembersProgressing: No unstarted etcd members found
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-04-06T06:41:55Z"
    message: |-
      StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7
      EtcdMembersAvailable: 3 members are available
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2022-04-06T06:39:47Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2022-04-06T06:39:47Z"
    reason: ControllerStarted
    status: Unknown
    type: RecentBackup
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: etcds
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-etcd-operator
    resource: namespaces
  - group: ""
    name: openshift-etcd
    resource: namespaces
  versions:
  - name: raw-internal
    version: 4.10.7
  - name: etcd
    version: 4.10.7
  - name: operator
    version: 4.10.7


Version-Release number of the following components:
4.10.7

How reproducible:
Always

Steps to Reproduce:
1. Install a 4.10 cluster
2. Upgrade to 4.11

# oc adm upgrade --allow-explicit-upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:28d4c78bd2ce3fa33479c0ee57372777908fead95e150f664fec8e4310cd85e4

Actual results:
Upgrade exits because precondition check fails on etcd backup


Expected results:
CVO sets failing to true and waits for etcd backup

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 W. Trevor King 2022-04-06 19:09:38 UTC
(In reply to Yang Yang from comment #0)
> During minor upgrade from 4.10 to 4.11, CVO sets ReleaseAccepted=False once
> it finds etcd RecentBackup not true so that upgrade is never started.
> Previously, CVO checked etcd RecentBackup if it’s not true, CVO set
> Failing=true, then etcd started to take backup. After backup has been taken,
> CVO set Failing to false and proceeded the upgrade.
> ...
> Steps to Reproduce:
> 1. Install a 4.10 cluster
> 2. Upgrade to 4.11

This makes it a problem in the 4.10 CVO, probably introduced into 4.10.z by [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2064991

Comment 2 W. Trevor King 2022-04-06 19:14:06 UTC
And we'll want etcd snapshots working again by the time we are recommending 4.10 -> 4.11 updates, so setting blocker+ on this 4.11.0-targeted bug.

Comment 5 Yang Yang 2022-05-05 03:38:43 UTC
Seems like we need to wait for the available signed 4.12 builds to do the verification.

Comment 6 W. Trevor King 2022-05-08 05:46:27 UTC
This is blocker+, so we're committed to fixing before 4.11 GAs, which means we're unlikely to make update-graph changes based on this.  Since update-graph changes are what UpgradeBlocker is for [1], I'm dropping the keyword.

[1]: https://github.com/openshift/enhancements/blob/bdf15e7a57a1f5a766e67c27c4ed9e0d03ef4bb4/enhancements/update/update-blocker-lifecycle/README.md

Comment 7 Yang Yang 2022-05-09 07:02:58 UTC
Verifying with 4.11.0-0.nightly-2022-05-06-215225

Steps are as below:
1. Install a cluster with 4.11.0-0.nightly-2022-05-06-215225

2. Overrides openshift-network-operator/network-operator
# oc patch clusterversion version --type=merge -p '{"spec": {"overrides":[{"kind": "Deployment", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true, "group": "apps/v1"}]}}'

3. Upgrade to 4.11.0-0.nightly-2022-05-07-161754

# oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:a655fcffc1bf299563471eb71625eedf142b4a953f15dc2c8fa76438092495ac --allow-explicit-upgrade 
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Updating to release image registry.ci.openshift.org/ocp/release@sha256:a655fcffc1bf299563471eb71625eedf142b4a953f15dc2c8fa76438092495ac

4. Check cv conditions
# oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-05-09T06:14:01Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-05-06-215225 not found in the "stable-4.11" channel
2022-05-09T06:14:01Z Upgradeable=False MultipleReasons: Cluster should not be upgraded between minor versions for multiple reasons: ClusterVersionOverridesSet,AdminAckRequired
* Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.
* Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see
the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.

2022-05-09T06:14:01Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec
2022-05-09T06:51:11Z ReleaseAccepted=False PreconditionChecks: Preconditions failed for payload loaded version="4.11.0-0.nightly-2022-05-07-161754" image="registry.ci.openshift.org/ocp/release@sha256:a655fcffc1bf299563471eb71625eedf142b4a953f15dc2c8fa76438092495ac": Precondition "ClusterVersionUpgradeable" failed because of "ClusterVersionOverridesSet": Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.
2022-05-09T06:34:01Z Available=True : Done applying 4.11.0-0.nightly-2022-05-06-215225
2022-05-09T06:49:31Z Failing=False : 
2022-05-09T06:49:46Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-05-06-215225
2022-05-09T06:50:47Z UpgradeableAdminAckRequired=False AdminAckRequired: Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see
the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.

2022-05-09T06:50:47Z UpgradeableClusterVersionOverrides=False ClusterVersionOverridesSet: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.

Nice, we get ReleaseAccepted=False due to overrides

5. Remove overrides
# oc patch clusterversion version --type json -p '[{"op": "remove", "path": "/spec/overrides"}]'
clusterversion.config.openshift.io/version patched

6. Check cv conditions
# oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-05-09T06:14:01Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-05-07-161754 not found in the "stable-4.11" channel
2022-05-09T06:14:01Z Upgradeable=False AdminAckRequired: Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see
the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.

2022-05-09T06:14:01Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec
2022-05-09T06:56:36Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.11.0-0.nightly-2022-05-07-161754" image="registry.ci.openshift.org/ocp/release@sha256:a655fcffc1bf299563471eb71625eedf142b4a953f15dc2c8fa76438092495ac"
2022-05-09T06:34:01Z Available=True : Done applying 4.11.0-0.nightly-2022-05-06-215225
2022-05-09T06:49:31Z Failing=False : 
2022-05-09T06:56:36Z Progressing=True : Working towards 4.11.0-0.nightly-2022-05-07-161754: 105 of 795 done (13% complete)

Payload loaded and upgrade proceeds. Looks good to me.

Comment 9 Yang Yang 2022-05-10 01:59:04 UTC
Thanks to Trevor and Justin, finally we get a 4.12. Verifying with 4.11.0-0.nightly-2022-05-07-161754

Steps to verify:
1. Install a 4.11 cluster
2. Upgrade to 4.12

# oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release-nightly@sha256:fb152ef66937c9cbb05467ff5b23f3b327485a90cae6686a5742375c980fea26 --allow-explicit-upgrade 

3. Check cv conditions
# oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-05-10T01:12:02Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-05-07-161754 not found in the "stable-4.11" channel
2022-05-10T01:12:02Z Upgradeable=False AdminAckRequired: Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see
the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.

2022-05-10T01:12:02Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec
2022-05-10T01:45:35Z ReleaseAccepted=False PreconditionChecks: Preconditions failed for payload loaded version="4.12.0-0.nightly-0" image="quay.io/openshift-release-dev/ocp-release-nightly@sha256:fb152ef66937c9cbb05467ff5b23f3b327485a90cae6686a5742375c980fea26": Multiple precondition checks failed:
* Precondition "ClusterVersionUpgradeable" failed because of "AdminAckRequired": Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see
the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.

* Precondition "EtcdRecentBackup" failed because of "UpgradeBackupInProgress": RecentBackup: Backup pod phase: "Pending"
2022-05-10T01:30:19Z Available=True : Done applying 4.11.0-0.nightly-2022-05-07-161754
2022-05-10T01:30:19Z Failing=False : 
2022-05-10T01:30:19Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-05-07-161754


ReleaseAccepted=False due to EtcdRecentBackup and AdminAckRequired


# oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-05-10T01:12:02Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-05-07-161754 not found in the "stable-4.11" channel
2022-05-10T01:12:02Z Upgradeable=False AdminAckRequired: Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see
the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.

2022-05-10T01:12:02Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec
2022-05-10T01:45:35Z ReleaseAccepted=False PreconditionChecks: Preconditions failed for payload loaded version="4.12.0-0.nightly-0" image="quay.io/openshift-release-dev/ocp-release-nightly@sha256:fb152ef66937c9cbb05467ff5b23f3b327485a90cae6686a5742375c980fea26": Precondition "ClusterVersionUpgradeable" failed because of "AdminAckRequired": Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see
the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.

2022-05-10T01:30:19Z Available=True : Done applying 4.11.0-0.nightly-2022-05-07-161754
2022-05-10T01:30:19Z Failing=False : 
2022-05-10T01:30:19Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-05-07-161754

EtcdRecentBackup precondition validation passed.

Then we manually provide the administrator acknowledgement
# oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-4.11-kube-1.25-api-removals-in-4.12":"true"}}' --type=merge
configmap/admin-acks patched

# oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-05-10T01:12:02Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-0 not found in the "stable-4.11" channel
2022-05-10T01:12:02Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec
2022-05-10T01:47:37Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.12.0-0.nightly-0" image="quay.io/openshift-release-dev/ocp-release-nightly@sha256:fb152ef66937c9cbb05467ff5b23f3b327485a90cae6686a5742375c980fea26"
2022-05-10T01:30:19Z Available=True : Done applying 4.11.0-0.nightly-2022-05-07-161754
2022-05-10T01:30:19Z Failing=False : 
2022-05-10T01:47:37Z Progressing=True : Working towards 4.12.0-0.nightly-0: 9 of 795 done (1% complete)

AdminAck precondition validation passed. And upgrade proceeds. Looks good to me. Moving it to verified state.

Comment 11 Lalatendu Mohanty 2022-06-16 19:00:23 UTC
Impact assessment
 
Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?
  * Upgrades will be impacted if the current cluster version is >= 4.10.8 and <= to 4.10.14. 
  * Upgrades to 4.11 are specially impacted because as part of the upgrade we need to take backup of etcd and CVO does not wait for it because of this bug .

What is the impact? Is it serious enough to warrant removing update recommendations?
  * In the event of an initial pre conditional check failure, CVO does not re-check the state of pre-conditional check.
  * The upgrade will not proceed because CVO will not accept the new target release because of this bug and it will continue to reconcile the current version. 

How involved is remediation?
  * The bug only impacts upgrades. So as long as the cluster stays in the current version there is nothing to be remediated. 
  * To avoid the etcd backup issue, we suggest updating to 4.10.15 or later versions (which contains the fix) before updating to 4.11.z. Because for z stream updates CVO does not check the etcd backup precondition check.
  * If you stay in ReleaseAccepted = False then address the condition , clear the upgrade with “$ oc adm upgrade –clear” and try to upgrade again.  

Is this a regression?
  * Yes. The fix for https://bugzilla.redhat.com/show_bug.cgi?id=2064991 introduced it.

Comment 12 W. Trevor King 2022-06-22 23:13:48 UTC
[1] landed a conditional risk for 4.10.14 to 4.11.0-fc.0, so I'm setting UpdateRecommendationsBlocked.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/2069

Comment 13 errata-xmlrpc 2022-08-10 11:04:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069