Bug 1957991
| Summary: | ClusterOperatorDegraded can fire during installation | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
| Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> | |
| Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.1.z | CC: | aos-bugs, bleanhar, jiajliu, jokerman | |
| Target Milestone: | --- | |||
| Target Release: | 4.8.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: The cluster-version operator fired ClusterOperatorDegraded after 10 minutes of unhappy Degraded conditions on ClusterOperator resources. During installs, the ClusterOperator resources are pre-created by the cluster-version operator well before some of the second-level operators are running.
Consequence: Second-level operators who only become happy later in installs would have ClusterOperatorDegraded firing because their ClusterOperator had a sad or missing Degraded condition for more than 10 minutes.
Fix: ClusterOperatorDegraded now requires 30 minutes of sad or missing Degraded conditions before it fires.
Result: With this phase of installation generally completing within 30 minutes, ClusterOperatorDegraded is now much less likely to fire prematurely. When second-level operators go Degraded post-install, we will alert administrators to that degradation within 30 minutes, which still seems sufficiently low-latency for that level of degradation.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1969501 (view as bug list) | Environment: | ||
| Last Closed: | 2021-07-27 23:07:23 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1969501 | |||
*** Bug 1958792 has been marked as a duplicate of this bug. *** Reproduce this bug with 4.8.0-fc.2 using the following steps. 1. create a clutser 2. Run the following command to try to catch ClusterOperatorDegraded alerts. [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' return empty [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.2 True False True 28h 3. configure authentication co with wrong setting on purpose # cat <<EOF >oauth.yaml apiVersion: config.openshift.io/v1 kind: OAuth metadata: name: cluster spec: identityProviders: - name: oidcidp mappingMethod: claim type: OpenID openID: clientID: test clientSecret: name: test claims: preferredUsername: - preferred_username name: - name email: - email issuer: https://www.idp-issuer.example.com EOF # oc apply -f oauth.yaml 4. Wait some minutes, catch ClusterOperatorDegraded alerts again, this time found some ClusterOperatorDegraded alert started pending [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.58.146:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-67cc6c7b4f-7x4wj", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "critical" }, "annotations": { "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-05-11T07:38:29.21303266Z", "value": "1e+00" } [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-fc.2 True False 28h Cluster version is 4.8.0-fc.2 [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.2 True False True 28h 5. After about 10 mins, the alerts get firing. [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.58.146:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-67cc6c7b4f-7x4wj", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "critical" }, "annotations": { "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-05-11T07:38:29.21303266Z", "value": "1e+00" } [root@preserve-jialiu-ansible ~]# oc get co authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.2 True False True 28h [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-fc.2 True False 28h Error while reconciling 4.8.0-fc.2: the cluster operator authentication is degraded From https://prometheus-k8s-openshift-monitoring.apps.jialiu48.qe.devcluster.openshift.com/graph?g0.expr=ALERTS&g0.tab=0&g0.stacked=0&g0.range_input=2d, I can see the alert start pending at 07:38:59, start firing at 07:50:30, about 10 mins Verified this bug with 4.8.0-0.nightly-2021-05-10-225140 using the steps in comment 4, passed. [root@preserve-jialiu-ansible ~]# oc get PrometheusRule -n openshift-cluster-version -o yaml <--snip--> - alert: ClusterOperatorDegraded annotations: message: Cluster operator {{ $labels.name }} has been degraded for 30 minutes. Operator is degraded because {{ $labels.reason }} and cluster upgrades will be unstable. expr: | ( cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} or on (name) group by (name) (cluster_operator_up{job="cluster-version-operator"}) ) == 1 for: 30m labels: severity: warning <--snip--> [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' Tue May 11 11:41:50 UTC 2021 [root@preserve-jialiu-ansible ~]# date -u; oc get co authentication; oc get clusterversion Tue May 11 11:42:11 UTC 2021 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False False 132m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS [root@preserve-jialiu-ansible ~]# date -u; oc get co authentication; oc get clusterversion Tue May 11 11:43:36 UTC 2021 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False True 133m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-10-225140 True False 137m Cluster version is 4.8.0-0.nightly-2021-05-10-225140 [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' Tue May 11 11:43:44 UTC 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.53.0:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-74cc585456-ll8tz", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-05-11T11:44:29.21303266Z", "value": "1e+00" } Wait for 10+ mins, the alert still in "pending" state. [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'; oc get co authentication; oc get clusterversion Tue May 11 11:55:57 UTC 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.53.0:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-74cc585456-ll8tz", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-05-11T11:44:29.21303266Z", "value": "1e+00" } NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False True 145m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-10-225140 True False 150m Error while reconciling 4.8.0-0.nightly-2021-05-10-225140: the cluster operator authentication is degraded After 30mins, the alert get into "firing" state. [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'; oc get co authentication; oc get clusterversion Tue May 11 12:17:32 UTC 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.53.0:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-74cc585456-ll8tz", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-05-11T11:44:29.21303266Z", "value": "1e+00" } NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False True 167m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-10-225140 True False 171m Error while reconciling 4.8.0-0.nightly-2021-05-10-225140: the cluster operator authentication is degraded Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in [1]. That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [2], where: * ClusterOperatorDegraded started pending at 5:00:15Z [3]. * Install completed at 5:09:58Z [4]. * ClusterOperatorDegraded started firing at 5:10:04Z [3]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [3]. * The e2e suite complained about [2]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate bug. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [5], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. [1]: https://github.com/openshift/cluster-version-operator/pull/136 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [3]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [5]: https://github.com/openshift/api/pull/916