Bug 1957991
Summary: | ClusterOperatorDegraded can fire during installation | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> | |
Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 4.1.z | CC: | aos-bugs, bleanhar, jiajliu, jokerman | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: The cluster-version operator fired ClusterOperatorDegraded after 10 minutes of unhappy Degraded conditions on ClusterOperator resources. During installs, the ClusterOperator resources are pre-created by the cluster-version operator well before some of the second-level operators are running.
Consequence: Second-level operators who only become happy later in installs would have ClusterOperatorDegraded firing because their ClusterOperator had a sad or missing Degraded condition for more than 10 minutes.
Fix: ClusterOperatorDegraded now requires 30 minutes of sad or missing Degraded conditions before it fires.
Result: With this phase of installation generally completing within 30 minutes, ClusterOperatorDegraded is now much less likely to fire prematurely. When second-level operators go Degraded post-install, we will alert administrators to that degradation within 30 minutes, which still seems sufficiently low-latency for that level of degradation.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1969501 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 23:07:23 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1969501 |
Description
W. Trevor King
2021-05-06 23:56:43 UTC
*** Bug 1958792 has been marked as a duplicate of this bug. *** Reproduce this bug with 4.8.0-fc.2 using the following steps. 1. create a clutser 2. Run the following command to try to catch ClusterOperatorDegraded alerts. [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' return empty [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.2 True False True 28h 3. configure authentication co with wrong setting on purpose # cat <<EOF >oauth.yaml apiVersion: config.openshift.io/v1 kind: OAuth metadata: name: cluster spec: identityProviders: - name: oidcidp mappingMethod: claim type: OpenID openID: clientID: test clientSecret: name: test claims: preferredUsername: - preferred_username name: - name email: - email issuer: https://www.idp-issuer.example.com EOF # oc apply -f oauth.yaml 4. Wait some minutes, catch ClusterOperatorDegraded alerts again, this time found some ClusterOperatorDegraded alert started pending [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.58.146:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-67cc6c7b4f-7x4wj", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "critical" }, "annotations": { "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-05-11T07:38:29.21303266Z", "value": "1e+00" } [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-fc.2 True False 28h Cluster version is 4.8.0-fc.2 [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.2 True False True 28h 5. After about 10 mins, the alerts get firing. [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.58.146:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-67cc6c7b4f-7x4wj", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "critical" }, "annotations": { "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-05-11T07:38:29.21303266Z", "value": "1e+00" } [root@preserve-jialiu-ansible ~]# oc get co authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.2 True False True 28h [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-fc.2 True False 28h Error while reconciling 4.8.0-fc.2: the cluster operator authentication is degraded From https://prometheus-k8s-openshift-monitoring.apps.jialiu48.qe.devcluster.openshift.com/graph?g0.expr=ALERTS&g0.tab=0&g0.stacked=0&g0.range_input=2d, I can see the alert start pending at 07:38:59, start firing at 07:50:30, about 10 mins Verified this bug with 4.8.0-0.nightly-2021-05-10-225140 using the steps in comment 4, passed. [root@preserve-jialiu-ansible ~]# oc get PrometheusRule -n openshift-cluster-version -o yaml <--snip--> - alert: ClusterOperatorDegraded annotations: message: Cluster operator {{ $labels.name }} has been degraded for 30 minutes. Operator is degraded because {{ $labels.reason }} and cluster upgrades will be unstable. expr: | ( cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} or on (name) group by (name) (cluster_operator_up{job="cluster-version-operator"}) ) == 1 for: 30m labels: severity: warning <--snip--> [root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' Tue May 11 11:41:50 UTC 2021 [root@preserve-jialiu-ansible ~]# date -u; oc get co authentication; oc get clusterversion Tue May 11 11:42:11 UTC 2021 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False False 132m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS [root@preserve-jialiu-ansible ~]# date -u; oc get co authentication; oc get clusterversion Tue May 11 11:43:36 UTC 2021 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False True 133m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-10-225140 True False 137m Cluster version is 4.8.0-0.nightly-2021-05-10-225140 [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")' Tue May 11 11:43:44 UTC 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.53.0:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-74cc585456-ll8tz", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-05-11T11:44:29.21303266Z", "value": "1e+00" } Wait for 10+ mins, the alert still in "pending" state. [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'; oc get co authentication; oc get clusterversion Tue May 11 11:55:57 UTC 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.53.0:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-74cc585456-ll8tz", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-05-11T11:44:29.21303266Z", "value": "1e+00" } NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False True 145m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-10-225140 True False 150m Error while reconciling 4.8.0-0.nightly-2021-05-10-225140: the cluster operator authentication is degraded After 30mins, the alert get into "firing" state. [root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'; oc get co authentication; oc get clusterversion Tue May 11 12:17:32 UTC 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.53.0:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-74cc585456-ll8tz", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-05-11T11:44:29.21303266Z", "value": "1e+00" } NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-225140 True False True 167m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-10-225140 True False 171m Error while reconciling 4.8.0-0.nightly-2021-05-10-225140: the cluster operator authentication is degraded Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |