Bug 1952266

Summary: etcd operator bumps status.version[name=operator] before operands update
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: sbatsche, wlewis
Target Milestone: ---Keywords: Upgrades
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: On upgrade, the etcd operator immediately marked itself as having fully upgraded, before it began rolling out the 4.7 versions of the openshift-etcd pods. Consequence: If the etcd pod upgrade fails, the cluster might mistakenly report that it had fully upgraded to new version despite some nodes still running a older etcd. (The etcd operator would be Degraded in this case, but it would mistakenly be reporting that it was on newer version and Degraded, rather than the older version and Degraded.) Fix: The etcd operator now correctly waits for the etcd to be upgraded before declaring itself as upgraded. Result: Version reporting should be correct. Upgrades should proceed in proper sequence.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:02:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-04-21 21:52:29 UTC
Similar to bug 1928157 and bug 1952174, but different operator.  Example update from 4.7.8 to 4.8.0-0.ci-2021-04-21-123839 [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1384851693719523328/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'I clusteroperator/etcd.*versions'
  Apr 21 13:31:30.629 I clusteroperator/etcd versions: operator 4.7.8 -> 4.8.0-0.ci-2021-04-21-123839, raw-internal 4.7.8 -> 4.8.0-0.ci-2021-04-21-123839
  Apr 21 13:35:29.825 I clusteroperator/etcd versions: etcd 4.7.8 -> 4.8.0-0.ci-2021-04-21-123839

But from [2]:

  An operator reports a new "operator" version when it has rolled out the new version to all of its operands.

The operator should delay the 'operator' version bump until the operands have all leveled.  This isn't as bad as the bug 1952174 case, because the etcd operator is asking the cluster-version operator to wait on both the 'operator' and 'etcd' entries [3,4].  So we're still waiting on you to update.  But things like the cluster_operator_conditions metrics will be confused about what version the etcd component is at until this bug gets fixed.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1384851693719523328
[2]: https://github.com/openshift/api/blob/a99ffa1cac6709edf8f502b16890b16f9a557e00/config/v1/types_cluster_operator.go#L43-L47
[3]: https://github.com/openshift/cluster-etcd-operator/blob/a254ec3cfafed3cc4787cbf511070d6e5dd1517c/manifests/0000_12_etcd-operator_07_clusteroperator.yaml#L10-L15
[4]: https://github.com/openshift/cluster-version-operator/blob/6fdd1e0f313f9c67ddf93037a0d4e17ce62e89ab/docs/user/reconciliation.md#clusteroperator

Comment 2 ge liu 2021-04-29 10:15:23 UTC
Verified,the co version have not be updated in upgrade processing.

Comment 6 errata-xmlrpc 2021-07-27 23:02:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438