+++ This bug was initially created as a clone of Bug #1957991 +++ During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in [1]. That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [2], where: * ClusterOperatorDegraded started pending at 5:00:15Z [3]. * Install completed at 5:09:58Z [4]. * ClusterOperatorDegraded started firing at 5:10:04Z [3]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [3]. * The e2e suite complained about [2]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate bug. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [5], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. [1]: https://github.com/openshift/cluster-version-operator/pull/136 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [3]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [5]: https://github.com/openshift/api/pull/916
Checked the cluster that launched by cluster-bot: 4.7.0-0.nightly,openshift/cluster-version-operator#587 Get authentication operator degraded and ClusterOperatorDegraded alert enabled with severity warning. # oc get co authentication -ojson|jq -r '.status.conditions[]|select(.type=="Degraded").status' True { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.0.4:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-65cbf4cf85-vxqn8", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-06-10T03:47:29.21303266Z", "value": "1e+00" } After 10min, the alert was still pending, and 20min later, the alert was firing. # curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://$(oc get route prometheus-k8s -n openshift-monitoring --no-headers|awk '{print $2}')/api/v1/alerts | jq -r '.data.alerts[]|select(.labels.alertname == "ClusterOperatorDegraded")' { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.0.4:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-65cbf4cf85-vxqn8", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-06-10T03:47:29.21303266Z", "value": "1e+00" }
# oc adm release info registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-06-12-151209 --commits|grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator f3d25082a09312718718fa3a85b8aba8b4574781 # git log --date local --pretty="%h %an %cd - %s" f3d2508|grep '#587' a0eacf89 OpenShift Merge Robot Fri Jun 11 17:48:32 2021 - Merge pull request #587 from wking/ClusterOperatorDegraded-softening The PR was included into 4.7.0-0.nightly-2021-06-12-151209. The bug has been verified via pre-merge (comment#1) but the bot did not move it to "verified" automatically. Change the status manually.
OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2502