Bug 1991010
Summary: | Backport Ignore Degraded for cluster_operator_up | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Matthew Robson <mrobson> |
Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> |
Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 4.3.z | CC: | aos-bugs, jokerman, wking, yanyang |
Target Milestone: | --- | ||
Target Release: | 4.7.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-20 19:33:06 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1834551 | ||
Bug Blocks: |
Description
Matthew Robson
2021-08-06 19:34:16 UTC
Attempting the PR Pre-Merge verification process: 1. Launch a cluster with cluster-bot with the PR 2. Check there's no ClusterOperatorDown and ClusterOperatorDegraded alerts # token=`oc -n openshift-monitoring sa get-token prometheus-k8s` # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDown").annotations.message' # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDegraded").annotations.message' 3. Degrade the authentication operator # cat <<EOF >oauth.yaml apiVersion: config.openshift.io/v1 kind: OAuth metadata: name: cluster spec: identityProviders: - name: oidcidp mappingMethod: claim type: OpenID openID: clientID: does-not-exist clientSecret: name: does-not-exist claims: preferredUsername: - preferred_username name: - name email: - email issuer: https://www.idp-issuer.example.com EOF # oc apply -f oauth.yaml Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically. oauth.config.openshift.io/cluster configured 4. Check the authentication is degraded # oc get co authentication --watch authentication 4.7.0-0.ci.test-2021-08-09-063231-ci-ln-ryvs38b-latest True False False 31m ... # oc get co | grep auth authentication 4.7.0-0.ci.test-2021-08-09-063231-ci-ln-ryvs38b-latest True False True 41m 5. Check the ClusterOperatorDown and ClusterOperatorDegraded alerts # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDegraded").annotations.message' Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable. # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDown").annotations.message' Only ClusterOperatorDegraded fired. But the time in the message is incorrect, it's been degraded only 10 minutes. > But the time in the message is incorrect, it's been degraded only 10 minutes. hmm, not sure how that's possible given 'for: 30m' [1]. And bug 1969501 should already have verified the 30m response time for ClusterOperatorDegraded. So... I dunno? Maybe we can test again post-merge? [1]: https://github.com/openshift/cluster-version-operator/pull/638/files#diff-fabad9e1d73a4f70c3d47836ed62e1982b1c6fbb947fce9a633b9cb0a98ecb24R90 Thanks Trevor! I can test it again after it goes to ON_QA. In my case, the authentication operator was in degrade=false state after install. I made it go degraded state manually by applying an incorrect yaml file in post-install. Then I got the ClusterOperatorDegraded as expected. Whenever I check the ClusterOperatorDegraded alert message, it always tells me "has been degraded for 30 minutes". In my case, you can see that the authentication operator had been degraded for 10 minutes more or less. I think the hardcoded 30 minutes makes sense during install because the installation generally takes 30 minutes or so. Do you think it make sense in post-install as well? Bugzilla bot seems to have missed this one, and the PR landed in 4.7.25 [1]. Which was tombstoned [2]. But then this code shipped with 4.7.28 [3]: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.7.28-x86_64 | grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator cc81827c1bfe322bd78d2fa0d9b34d532190d850 $ git --no-pager log --oneline -1 cc81827c1bfe322bd78d2fa0d9b34d532190d850 cc81827c Merge pull request #638 from wking/decouple-degraded-from-ClusterOperatorDown So... I guess the upside is that it should be possible to verify now :p. > Whenever I check the ClusterOperatorDegraded alert message, it always tells me "has been degraded for 30 minutes". In my case, you can see that the authentication operator had been degraded for 10 minutes more or less. Moving to 30m was the separate bug 1969501, which should have shipped in 4.7.18. If it's firing more quickly than that, seems like it would be a monitoring bug, because it's definitely 'for: 30m': $ git cat-file -p cc81827c1bfe322bd78d2fa0d9b34d532190d850:install/0000_90_cluster-version-operator_02_servicemonitor.yaml | grep -A11 ClusterOperatorDegraded - alert: ClusterOperatorDegraded annotations: message: Cluster operator {{ "{{ $labels.name }}" }} has been degraded for 30 minutes. Operator is degraded because {{ "{{ $labels.reason }}" }} and cluster upgrades will be unstable. expr: | ( cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} or on (name) group by (name) (cluster_operator_up{job="cluster-version-operator"}) ) == 1 for: 30m labels: severity: warning [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.7.25 [2]: https://github.com/openshift/cincinnati-graph-data/pull/995 [3]: https://access.redhat.com/errata/RHSA-2021:3262 Verifying with 4.7.0-0.nightly-2021-10-07-212101. Tested with procedure described in comment#1 # date; oc get co | grep auth Fri Oct 8 00:05:26 EDT 2021 authentication 4.7.0-0.nightly-2021-10-07-212101 True False True 99m # date; curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang1008a.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDegraded")' Fri Oct 8 00:35:26 EDT 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.0.5:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-76df8c8788-2fbhf", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-10-08T04:05:29.21303266Z", "value": "1e+00" } # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang1008a.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDown")' Only ClusterOperatorDegraded is fired. Moving it to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.34 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3824 |