During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in [1]. That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [2], where: * ClusterOperatorDegraded started pending at 5:00:15Z [3]. * Install completed at 5:09:58Z [4]. * ClusterOperatorDegraded started firing at 5:10:04Z [3]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [3]. * The e2e suite complained about [2]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate bug. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [5], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. [1]: https://github.com/openshift/cluster-version-operator/pull/136 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [3]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [5]: https://github.com/openshift/api/pull/916
*** Bug 1958792 has been marked as a duplicate of this bug. ***
Reproduce this bug with 4.8.0-fc.2 using the following steps. 1. create a clutser 2. Run the following command to try to catch ClusterOperatorDegraded alerts. [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' return empty [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.2 True False True 28h 3. configure authentication co with wrong setting on purpose # cat <<EOF >oauth.yaml apiVersion: config.openshift.io/v1 kind: OAuth metadata: name: cluster spec: identityProviders: - name: oidcidp mappingMethod: claim type: OpenID openID: clientID: test clientSecret: name: test claims: preferredUsername: - preferred_username name: - name email: - email issuer: https://www.idp-issuer.example.com EOF # oc apply -f oauth.yaml 4. Wait some minutes, catch ClusterOperatorDegraded alerts again, this time found some ClusterOperatorDegraded alert started pending [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.58.146:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-67cc6c7b4f-7x4wj", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "critical" }, "annotations": { "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-05-11T07:38:29.21303266Z", "value": "1e+00" } [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-fc.2 True False 28h Cluster version is 4.8.0-fc.2 [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.2 True False True 28h 5. After about 10 mins, the alerts get firing. [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.58.146:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-67cc6c7b4f-7x4wj", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "critical" }, "annotations": { "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-05-11T07:38:29.21303266Z", "value": "1e+00" } [root@preserve-jialiu-ansible ~]# oc get co authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.2 True False True 28h [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-fc.2 True False 28h Error while reconciling 4.8.0-fc.2: the cluster operator authentication is degraded From https://prometheus-k8s-openshift-monitoring.apps.jialiu48.qe.devcluster.openshift.com/graph?g0.expr=ALERTS&g0.tab=0&g0.stacked=0&g0.range_input=2d, I can see the alert start pending at 07:38:59, start firing at 07:50:30, about 10 mins
Verified this bug with 4.8.0-0.nightly-2021-05-10-225140 using the steps in comment 4, passed. [root@preserve-jialiu-ansible ~]# oc get PrometheusRule -n openshift-cluster-version -o yaml <--snip--> - alert: ClusterOperatorDegraded annotations: message: Cluster operator {{ $labels.name }} has been degraded for 30 minutes. Operator is degraded because {{ $labels.reason }} and cluster upgrades will be unstable. expr: | ( cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} or on (name) group by (name) (cluster_operator_up{job="cluster-version-operator"}) ) == 1 for: 30m labels: severity: warning <--snip--> [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' Tue May 11 11:41:50 UTC 2021 [root@preserve-jialiu-ansible ~]# date -u; oc get co authentication; oc get clusterversion Tue May 11 11:42:11 UTC 2021 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False False 132m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS [root@preserve-jialiu-ansible ~]# date -u; oc get co authentication; oc get clusterversion Tue May 11 11:43:36 UTC 2021 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False True 133m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-10-225140 True False 137m Cluster version is 4.8.0-0.nightly-2021-05-10-225140 [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' Tue May 11 11:43:44 UTC 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.53.0:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-74cc585456-ll8tz", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-05-11T11:44:29.21303266Z", "value": "1e+00" } Wait for 10+ mins, the alert still in "pending" state. [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'; oc get co authentication; oc get clusterversion Tue May 11 11:55:57 UTC 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.53.0:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-74cc585456-ll8tz", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-05-11T11:44:29.21303266Z", "value": "1e+00" } NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False True 145m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-10-225140 True False 150m Error while reconciling 4.8.0-0.nightly-2021-05-10-225140: the cluster operator authentication is degraded After 30mins, the alert get into "firing" state. [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'; oc get co authentication; oc get clusterversion Tue May 11 12:17:32 UTC 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.53.0:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-74cc585456-ll8tz", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-05-11T11:44:29.21303266Z", "value": "1e+00" } NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False True 167m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-10-225140 True False 171m Error while reconciling 4.8.0-0.nightly-2021-05-10-225140: the cluster operator authentication is degraded
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438