Bug 1969501
Summary: | ClusterOperatorDegraded can fire during installation | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> |
Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 4.1.z | CC: | aos-bugs, bleanhar, jiajliu, jialiu, jokerman |
Target Milestone: | --- | ||
Target Release: | 4.7.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The cluster-version operator fired ClusterOperatorDegraded after 10 minutes of unhappy Degraded conditions on ClusterOperator resources. During installs, the ClusterOperator resources are pre-created by the cluster-version operator well before some of the second-level operators are running.
Consequence: Second-level operators who only become happy later in installs would have ClusterOperatorDegraded firing because their ClusterOperator had a sad or missing Degraded condition for more than 10 minutes.
Fix: ClusterOperatorDegraded now requires 30 minutes of sad or missing Degraded conditions before it fires.
Result: With this phase of installation generally completing within 30 minutes, ClusterOperatorDegraded is now much less likely to fire prematurely. When second-level operators go Degraded post-install, we will alert administrators to that degradation within 30 minutes, which still seems sufficiently low-latency for that level of degradation.
|
Story Points: | --- |
Clone Of: | 1957991 | Environment: | |
Last Closed: | 2021-06-29 04:20:14 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1957991 | ||
Bug Blocks: |
Description
W. Trevor King
2021-06-08 14:15:39 UTC
Checked the cluster that launched by cluster-bot: 4.7.0-0.nightly,openshift/cluster-version-operator#587 Get authentication operator degraded and ClusterOperatorDegraded alert enabled with severity warning. # oc get co authentication -ojson|jq -r '.status.conditions[]|select(.type=="Degraded").status' True { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.0.4:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-65cbf4cf85-vxqn8", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "pending", "activeAt": "2021-06-10T03:47:29.21303266Z", "value": "1e+00" } After 10min, the alert was still pending, and 20min later, the alert was firing. # curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://$(oc get route prometheus-k8s -n openshift-monitoring --no-headers|awk '{print $2}')/api/v1/alerts | jq -r '.data.alerts[]|select(.labels.alertname == "ClusterOperatorDegraded")' { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.0.4:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-65cbf4cf85-vxqn8", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-06-10T03:47:29.21303266Z", "value": "1e+00" } # oc adm release info registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-06-12-151209 --commits|grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator f3d25082a09312718718fa3a85b8aba8b4574781 # git log --date local --pretty="%h %an %cd - %s" f3d2508|grep '#587' a0eacf89 OpenShift Merge Robot Fri Jun 11 17:48:32 2021 - Merge pull request #587 from wking/ClusterOperatorDegraded-softening The PR was included into 4.7.0-0.nightly-2021-06-12-151209. The bug has been verified via pre-merge (comment#1) but the bot did not move it to "verified" automatically. Change the status manually. OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2502 |