Bug 2097431

Summary:	Degraded=True noise with: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Etcd	Assignee:	melbeher
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	low	Docs Contact:
Priority:	medium
Version:	4.10	CC:	aos-bugs, geliu, jiajliu, maxu, melbeher, wking, xxia
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2079803
Clones:	2105146 2105148 (view as bug list)		Environment:
Last Closed:	2023-01-17 19:50:02 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2079803
Bug Blocks:	2091604, 2105146

Description W. Trevor King 2022-06-15 16:25:46 UTC

The fix for bug 2079803 included a ClusterVersion fetch and history inspection.  But until install-completion, there will be no completed history entries.  And installation failures which include entries like [1]:

  level=error msg=Cluster operator etcd Degraded is True with UpgradeBackupController_Error: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-08 19:26:01 +0000 UTC <nil> 4.11.0-0.ci.test-2022-06-08-190030-ci-op-zq5cm5gx-initial registry.build02.ci.openshift.org/ci-op-zq5cm5gx/release@sha256:e08abf8ba61271954f9b785a4cbdf6571723b925872b05fd9f4d3ecc1dc6e135 false }]

may be distracting to users trying to understand a failed install.

We can fix by getting the current version from the etcd operator's OPERATOR_IMAGE_VERSION environment variable [2].

I'm filing a new bug for this, because bug 2079803 has already been backported to 4.10.z as bug 2091604, and that bug is likely to ship in this week's 4.10.z (likely to be named 4.10.19, but not actually built yet).

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3167/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1534611208269729792#1:build-log.txt%3A84
[2]: https://github.com/openshift/cluster-etcd-operator/blob/28a4ae406ff736b00af68c4f4d249319d62e48dd/manifests/0000_20_etcd-operator_06_deployment.yaml#L71-L72

Comment 1 Xingxing Xia 2022-06-17 04:13:18 UTC

Today I launched a cluster with latest payload 4.11.0-0.nightly-2022-06-16-221335, installation failed with similar errors:

06-17 03:04:45.610  level=error msg=Cluster operator etcd Degraded is True with UpgradeBackupController_Error: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-17 02:13:35 +0000 UTC <nil> 4.11.0-0.nightly-2022-06-16-221335 registry.ci.openshift.org/ocp/release@sha256:7d6c5e2594bd9d89592712c60f0af8f1ec750951c3ded3a16326551f431c8719 false }]
06-17 03:04:45.610  level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required

06-17 03:04:45.610  level=info msg=Cluster operator monitoring Available is False with MultipleTasksFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
06-17 03:04:45.610  level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
06-17 03:04:45.610  level=error msg=Cluster operator monitoring Degraded is True with MultipleTasksFailed: Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas
06-17 03:04:45.611  level=error msg=updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas

06-17 03:04:45.611  level=error msg=Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-17T02:25:11Z
06-17 03:04:45.611  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
06-17 03:04:45.611  level=info msg=Cluster operator network Progressing is True with Deploying: DaemonSet "/openshift-sdn/sdn" is not available (awaiting 6 nodes)

oc get co | grep -v "True .*False .*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd                                       4.11.0-0.nightly-2022-06-16-221335   True        False         True       59m     UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-17 02:13:35 +0000 UTC <nil> 4.11.0-0.nightly-2022-06-16-221335 registry.ci.openshift.org/ocp/release@sha256:7d6c5e2594bd9d89592712c60f0af8f1ec750951c3ded3a16326551f431c8719 false }]
monitoring                                                                      False       True          True       40m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.11.0-0.nightly-2022-06-16-221335   True        True          True       62m     DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-17T02:25:11Z

Checked the other 2 clusteroperators, seems they show separate issue, I'd like to file separate bug 2097954 for the other 2 clusteroperators.

Comment 3 ge liu 2022-07-05 10:01:13 UTC

Tried some installation covered different platform, have not hit this issue, close it.

Comment 9 errata-xmlrpc 2023-01-17 19:50:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399