Bug 2076793

Summary: CVO exits upgrade immediately rather than waiting for etcd backup
Product: OpenShift Container Platform Reporter: Jack Ottofaro <jack.ottofaro>
Component: EtcdAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: Yang Yang <yanyang>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: alray, emoss, jack.ottofaro, jhou, lmohanty, lxia, wking, yanyang
Target Milestone: ---Keywords: Upgrades
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2072389 Environment:
Last Closed: 2022-08-10 11:07:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2072389, 2083370    
Bug Blocks: 2079660    

Description Jack Ottofaro 2022-04-19 20:32:09 UTC
+++ This bug was initially created as a clone of Bug #2072389 +++

Created attachment 1871007 [details]
CVO log file

Description of problem:

During minor upgrade from 4.10 to 4.11, CVO sets ReleaseAccepted=False once it finds etcd RecentBackup not true so that upgrade is never started. Previously, CVO checked etcd RecentBackup if it’s not true, CVO set Failing=true, then etcd started to take backup. After backup has been taken, CVO set Failing to false and proceeded the upgrade.

# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2022-04-06T06:34:30Z"
    generation: 3
    name: version
    resourceVersion: "36529"
    uid: d3d4b24e-9b77-4e49-8796-350c2f8cd96f
  spec:
    channel: stable-4.10
    clusterID: c6e49e12-4a08-4795-beb9-fd819a14ea33
    desiredUpdate:
      force: false
      image: registry.ci.openshift.org/ocp/release@sha256:28d4c78bd2ce3fa33479c0ee57372777908fead95e150f664fec8e4310cd85e4
      version: ""
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2022-04-06T06:34:31Z"
      message: 'Unable to retrieve available updates: currently reconciling cluster
        version 4.10.7 not found in the "stable-4.10" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    - lastTransitionTime: "2022-04-06T07:14:51Z"
      message: 'Preconditions failed for payload loaded version="4.11.0-0.nightly-2022-03-29-152521"
        image="registry.ci.openshift.org/ocp/release@sha256:28d4c78bd2ce3fa33479c0ee57372777908fead95e150f664fec8e4310cd85e4":
        Precondition "EtcdRecentBackup" failed because of "ControllerStarted": '
      reason: PreconditionChecks
      status: "False"
      type: ReleaseAccepted
    - lastTransitionTime: "2022-04-06T06:55:47Z"
      message: Done applying 4.10.7
      status: "True"
      type: Available
    - lastTransitionTime: "2022-04-06T06:55:47Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2022-04-06T07:15:47Z"
      message: Cluster version is 4.10.7
      status: "False"
      type: Progressing
    desired:
      image: quay.io/openshift-release-dev/ocp-release@sha256:347fcefa4cff84074fa56ff73a483b9fee7ba98b9a71752763502f11182a11af
      url: https://access.redhat.com/errata/RHBA-2022:1162
      version: 4.10.7
    history:
    - completionTime: "2022-04-06T06:55:47Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:347fcefa4cff84074fa56ff73a483b9fee7ba98b9a71752763502f11182a11af
      startedTime: "2022-04-06T06:34:31Z"
      state: Completed
      verified: false
      version: 4.10.7
    observedGeneration: 3
    versionHash: o09_Mvm2ad0=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


# oc get co/etcd -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-04-06T06:34:31Z"
  generation: 1
  name: etcd
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: d3d4b24e-9b77-4e49-8796-350c2f8cd96f
  resourceVersion: "29598"
  uid: b77ad218-07a7-4db5-b862-7d3df7954c36
spec: {}
status:
  conditions:
  - lastTransitionTime: "2022-04-06T06:39:47Z"
    message: |-
      NodeControllerDegraded: All master nodes are ready
      EtcdMembersDegraded: No unhealthy members found
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-04-06T06:56:30Z"
    message: |-
      NodeInstallerProgressing: 3 nodes are at revision 7
      EtcdMembersProgressing: No unstarted etcd members found
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-04-06T06:41:55Z"
    message: |-
      StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7
      EtcdMembersAvailable: 3 members are available
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2022-04-06T06:39:47Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2022-04-06T06:39:47Z"
    reason: ControllerStarted
    status: Unknown
    type: RecentBackup
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: etcds
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-etcd-operator
    resource: namespaces
  - group: ""
    name: openshift-etcd
    resource: namespaces
  versions:
  - name: raw-internal
    version: 4.10.7
  - name: etcd
    version: 4.10.7
  - name: operator
    version: 4.10.7


Version-Release number of the following components:
4.10.7

How reproducible:
Always

Steps to Reproduce:
1. Install a 4.10 cluster
2. Upgrade to 4.11

# oc adm upgrade --allow-explicit-upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:28d4c78bd2ce3fa33479c0ee57372777908fead95e150f664fec8e4310cd85e4

Actual results:
Upgrade exits because precondition check fails on etcd backup


Expected results:
CVO sets failing to true and waits for etcd backup

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

--- Additional comment from W. Trevor King on 2022-04-06 19:09:38 UTC ---

(In reply to Yang Yang from comment #0)
> During minor upgrade from 4.10 to 4.11, CVO sets ReleaseAccepted=False once
> it finds etcd RecentBackup not true so that upgrade is never started.
> Previously, CVO checked etcd RecentBackup if it’s not true, CVO set
> Failing=true, then etcd started to take backup. After backup has been taken,
> CVO set Failing to false and proceeded the upgrade.
> ...
> Steps to Reproduce:
> 1. Install a 4.10 cluster
> 2. Upgrade to 4.11

This makes it a problem in the 4.10 CVO, probably introduced into 4.10.z by [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2064991

--- Additional comment from W. Trevor King on 2022-04-06 19:14:06 UTC ---

And we'll want etcd snapshots working again by the time we are recommending 4.10 -> 4.11 updates, so setting blocker+ on this 4.11.0-targeted bug.

Comment 1 Jack Ottofaro 2022-04-19 20:36:29 UTC
As of change https://github.com/openshift/cluster-version-operator/pull/683 CVO no longer sets Failing=true when the preconditions, including the etcd backup precondition, fail. CVO now sets the ReleaseAccepted condition to indicate whether payload has been successfully loaded. Etcd should now instead check ReleaseAccepted!=true.

Comment 4 ge liu 2022-04-26 10:30:15 UTC
Verified with 4.11.0-0.nightly-2022-04-26-030643, upgrade from 4.10.11 to 4.11.0-0.nightly-2022-04-26-030643, it succeed.

Comment 5 Yang Yang 2022-04-27 12:51:05 UTC
Verifying with 4.11.0-0.nightly-2022-04-26-181148 by patching the cv status to change the ReleaseAccepted to false

Before patching cv status

# oc get co/etcd -oyaml
  - lastTransitionTime: "2022-04-27T05:50:20Z"
    reason: ControllerStarted
    status: Unknown
    type: RecentBackup


Patching cv to change ReleaseAccepted to false

# oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator
deployment.apps/cluster-version-operator scaled

# oc proxy &
# curl -k -XPATCH -H "Accept: application/json" -H "Content-Type: applicaton/json-patch+json" 'http://127.0.0.1:8001/apis/config.openshift.io/v1/clusterversions/version/status' -d '[{"op": "add", "path": "/status/conditions", "value": [{"type":"ReleaseAccepted", "status": "False", "reason": "UpgradePreconditionCheckFailed", "message": "EtcdRecentBackup failed", "lastTransitionTime": "2022-04-27T18:25:51Z"}]}]'
{
  "apiVersion": "config.openshift.io/v1",
  "kind": "ClusterVersion",
  "metadata": {
    "creationTimestamp": "2022-04-27T05:47:06Z",
    "generation": 4,
    "managedFields": [
      {
        "apiVersion": "config.openshift.io/v1",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:spec": {
            ".": {},
            "f:channel": {},
            "f:clusterID": {}
          }
        },
        "manager": "cluster-bootstrap",
        "operation": "Update",
        "time": "2022-04-27T05:47:06Z"
      },
      {
        "apiVersion": "config.openshift.io/v1",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:status": {
            ".": {},
            "f:availableUpdates": {},
            "f:capabilities": {
              ".": {},
              "f:enabledCapabilities": {},
              "f:knownCapabilities": {}
            },
            "f:desired": {
              ".": {},
              "f:image": {},
              "f:version": {}
            },
            "f:history": {},
            "f:observedGeneration": {},
            "f:versionHash": {}
          }
        },
        "manager": "cluster-version-operator",
        "operation": "Update",
        "subresource": "status",
        "time": "2022-04-27T05:47:10Z"
      },
      {
        "apiVersion": "config.openshift.io/v1",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:status": {
            "f:conditions": {}
          }
        },
        "manager": "curl",
        "operation": "Update",
        "subresource": "status",
        "time": "2022-04-27T12:42:23Z"
      }
    ],
    "name": "version",
    "resourceVersion": "165197",
    "uid": "f1924212-4134-4bfb-a860-b24d8e084bad"
  },
  "spec": {
    "channel": "stable-4.11",
    "clusterID": "09edcc03-502b-4f63-81f3-d307a002253f"
  },
  "status": {
    "availableUpdates": null,
    "capabilities": {
      "enabledCapabilities": [
        "baremetal",
        "marketplace",
        "openshift-samples"
      ],
      "knownCapabilities": [
        "baremetal",
        "marketplace",
        "openshift-samples"
      ]
    },
    "conditions": [
      {
        "lastTransitionTime": "2022-04-27T18:25:51Z",
        "message": "EtcdRecentBackup failed",
        "reason": "UpgradePreconditionCheckFailed",
        "status": "False",
        "type": "ReleaseAccepted"
      }
    ],
    "desired": {
      "image": "registry.ci.openshift.org/ocp/release@sha256:30452e14cbefed21f883ac38652b9dbaf653a922a1ca0efd6f3a1a10acfc2e1c",
      "version": "4.11.0-0.nightly-2022-04-26-181148"
    },
    "history": [
      {
        "completionTime": "2022-04-27T06:07:32Z",
        "image": "registry.ci.openshift.org/ocp/release@sha256:30452e14cbefed21f883ac38652b9dbaf653a922a1ca0efd6f3a1a10acfc2e1c",
        "startedTime": "2022-04-27T05:47:10Z",
        "state": "Completed",
        "verified": false,
        "version": "4.11.0-0.nightly-2022-04-26-181148"
      }
    ],
    "observedGeneration": 2,
    "versionHash": "QNLRulmodCo="
  }
}

# oc get co/etcd -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-04-27T05:47:10Z"
  generation: 1
  name: etcd
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: f1924212-4134-4bfb-a860-b24d8e084bad
  resourceVersion: "165237"
  uid: 1ebee225-51a9-4de3-9b2d-1a1c9d240a4c
spec: {}
status:
  conditions:
  - lastTransitionTime: "2022-04-27T05:59:50Z"
    message: |-
      NodeControllerDegraded: All master nodes are ready
      EtcdMembersDegraded: No unhealthy members found
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-04-27T06:10:04Z"
    message: |-
      EtcdMembersProgressing: No unstarted etcd members found
      NodeInstallerProgressing: 3 nodes are at revision 8
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-04-27T05:52:50Z"
    message: |-
      EtcdMembersAvailable: 3 members are available
      StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2022-04-27T05:50:19Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2022-04-27T12:42:29Z"
    message: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-04-27_124223
      on node "yanyang-0427a-j7zrw-master-0.c.openshift-qe.internal"
    reason: UpgradeBackupSuccessful
    status: "True"
    type: RecentBackup
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: etcds
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-etcd-operator
    resource: namespaces
  - group: ""
    name: openshift-etcd
    resource: namespaces
  versions:
  - name: raw-internal
    version: 4.11.0-0.nightly-2022-04-26-181148
  - name: etcd
    version: 4.11.0-0.nightly-2022-04-26-181148
  - name: operator
    version: 4.11.0-0.nightly-2022-04-26-181148

# oc get co/etcd -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-04-27T05:47:10Z"
  generation: 1
  name: etcd
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: f1924212-4134-4bfb-a860-b24d8e084bad
  resourceVersion: "165237"
  uid: 1ebee225-51a9-4de3-9b2d-1a1c9d240a4c
spec: {}
status:
  conditions:
  - lastTransitionTime: "2022-04-27T05:59:50Z"
    message: |-
      NodeControllerDegraded: All master nodes are ready
      EtcdMembersDegraded: No unhealthy members found
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-04-27T06:10:04Z"
    message: |-
      EtcdMembersProgressing: No unstarted etcd members found
      NodeInstallerProgressing: 3 nodes are at revision 8
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-04-27T05:52:50Z"
    message: |-
      EtcdMembersAvailable: 3 members are available
      StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2022-04-27T05:50:19Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2022-04-27T12:42:29Z"
    message: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-04-27_124223
      on node "yanyang-0427a-j7zrw-master-0.c.openshift-qe.internal"
    reason: UpgradeBackupSuccessful
    status: "True"
    type: RecentBackup
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: etcds
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-etcd-operator
    resource: namespaces
  - group: ""
    name: openshift-etcd
    resource: namespaces
  versions:
  - name: raw-internal
    version: 4.11.0-0.nightly-2022-04-26-181148
  - name: etcd
    version: 4.11.0-0.nightly-2022-04-26-181148
  - name: operator
    version: 4.11.0-0.nightly-2022-04-26-181148

Etcd RecentBackup goes to True. Looks good to me.

Comment 6 Yang Yang 2022-04-28 02:30:36 UTC
Moving it to verified state based on comment#5.

Comment 9 errata-xmlrpc 2022-08-10 11:07:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069