Bug 2097431 - Degraded=True noise with: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history
Summary: Degraded=True noise with: UpgradeBackupControllerDegraded: unable to retrieve...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.12.0
Assignee: melbeher
QA Contact: ge liu
URL:
Whiteboard:
Depends On: 2079803
Blocks: 2091604 2105146
TreeView+ depends on / blocked
 
Reported: 2022-06-15 16:25 UTC by W. Trevor King
Modified: 2023-01-17 19:50 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2079803
: 2105146 2105148 (view as bug list)
Environment:
Last Closed: 2023-01-17 19:50:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 857 0 None Merged Bug 2097431: fix degraded missing cluster version 2022-07-08 07:19:52 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:50:34 UTC

Description W. Trevor King 2022-06-15 16:25:46 UTC
The fix for bug 2079803 included a ClusterVersion fetch and history inspection.  But until install-completion, there will be no completed history entries.  And installation failures which include entries like [1]:

  level=error msg=Cluster operator etcd Degraded is True with UpgradeBackupController_Error: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-08 19:26:01 +0000 UTC <nil> 4.11.0-0.ci.test-2022-06-08-190030-ci-op-zq5cm5gx-initial registry.build02.ci.openshift.org/ci-op-zq5cm5gx/release@sha256:e08abf8ba61271954f9b785a4cbdf6571723b925872b05fd9f4d3ecc1dc6e135 false }]

may be distracting to users trying to understand a failed install.

We can fix by getting the current version from the etcd operator's OPERATOR_IMAGE_VERSION environment variable [2].

I'm filing a new bug for this, because bug 2079803 has already been backported to 4.10.z as bug 2091604, and that bug is likely to ship in this week's 4.10.z (likely to be named 4.10.19, but not actually built yet).

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3167/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1534611208269729792#1:build-log.txt%3A84
[2]: https://github.com/openshift/cluster-etcd-operator/blob/28a4ae406ff736b00af68c4f4d249319d62e48dd/manifests/0000_20_etcd-operator_06_deployment.yaml#L71-L72

Comment 1 Xingxing Xia 2022-06-17 04:13:18 UTC
Today I launched a cluster with latest payload 4.11.0-0.nightly-2022-06-16-221335, installation failed with similar errors:

06-17 03:04:45.610  level=error msg=Cluster operator etcd Degraded is True with UpgradeBackupController_Error: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-17 02:13:35 +0000 UTC <nil> 4.11.0-0.nightly-2022-06-16-221335 registry.ci.openshift.org/ocp/release@sha256:7d6c5e2594bd9d89592712c60f0af8f1ec750951c3ded3a16326551f431c8719 false }]
06-17 03:04:45.610  level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required

06-17 03:04:45.610  level=info msg=Cluster operator monitoring Available is False with MultipleTasksFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
06-17 03:04:45.610  level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
06-17 03:04:45.610  level=error msg=Cluster operator monitoring Degraded is True with MultipleTasksFailed: Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas
06-17 03:04:45.611  level=error msg=updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas

06-17 03:04:45.611  level=error msg=Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-17T02:25:11Z
06-17 03:04:45.611  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
06-17 03:04:45.611  level=info msg=Cluster operator network Progressing is True with Deploying: DaemonSet "/openshift-sdn/sdn" is not available (awaiting 6 nodes)

oc get co | grep -v "True .*False .*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd                                       4.11.0-0.nightly-2022-06-16-221335   True        False         True       59m     UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-17 02:13:35 +0000 UTC <nil> 4.11.0-0.nightly-2022-06-16-221335 registry.ci.openshift.org/ocp/release@sha256:7d6c5e2594bd9d89592712c60f0af8f1ec750951c3ded3a16326551f431c8719 false }]
monitoring                                                                      False       True          True       40m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.11.0-0.nightly-2022-06-16-221335   True        True          True       62m     DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-17T02:25:11Z

Checked the other 2 clusteroperators, seems they show separate issue, I'd like to file separate bug 2097954 for the other 2 clusteroperators.

Comment 3 ge liu 2022-07-05 10:01:13 UTC
Tried some installation covered different platform, have not hit this issue, close it.

Comment 9 errata-xmlrpc 2023-01-17 19:50:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.