Description of problem: Backport https://github.com/openshift/cluster-version-operator/pull/550 for 4.7.z Customer is getting combined alerts for ClusterOperatorDegraded and ClusterOperatorDown due to ImagePrunerJobFailed events. Currently both of these are critical. ClusterOperatorDegraded was moved to warning in 4.7, but without PR550, they will both still trigger due to Available=True + Degraded state. In this case, ImagePrunerJobFailed is known to happen from time to time for a variety of reasons and triggering a ClusterOperatorDegraded makes sense, but not a critical alert ClusterOperatorDown. In this case, with the introduction of ignoreInvalidImageReferences in OCP 4.6, even when a ImagePrunerJobFailed triggers, the pruner will keep going and can be revisited as necessary. alertname = ClusterOperatorDegraded condition = Degraded endpoint = metrics instance = 142.34.151.135:9099 job = cluster-version-operator name = image-registry namespace = openshift-cluster-version pod = cluster-version-operator-5f4559dfbb-sxjcv prometheus = openshift-monitoring/k8s reason = ImagePrunerJobFailed service = cluster-version-operator severity = critical Annotations message = Cluster operator image-registry has been degraded for 10 minutes. Operator is degraded because ImagePrunerJobFailed and cluster upgrades will be unstable. alertname = ClusterOperatorDown endpoint = metrics instance = 142.34.151.135:9099 job = cluster-version-operator name = image-registry namespace = openshift-cluster-version pod = cluster-version-operator-5f4559dfbb-sxjcv prometheus = openshift-monitoring/k8s service = cluster-version-operator severity = critical version = 4.6.25 Annotations message = Cluster operator image-registry has not been available for 10 minutes. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible. Actual results: Both alerts trigger with ClusterOperatorDown being unnecessary / incorrect in this situation. Expected results: Only trigger degraded if available is still true.
Attempting the PR Pre-Merge verification process: 1. Launch a cluster with cluster-bot with the PR 2. Check there's no ClusterOperatorDown and ClusterOperatorDegraded alerts # token=`oc -n openshift-monitoring sa get-token prometheus-k8s` # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDown").annotations.message' # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDegraded").annotations.message' 3. Degrade the authentication operator # cat <<EOF >oauth.yaml apiVersion: config.openshift.io/v1 kind: OAuth metadata: name: cluster spec: identityProviders: - name: oidcidp mappingMethod: claim type: OpenID openID: clientID: does-not-exist clientSecret: name: does-not-exist claims: preferredUsername: - preferred_username name: - name email: - email issuer: https://www.idp-issuer.example.com EOF # oc apply -f oauth.yaml Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically. oauth.config.openshift.io/cluster configured 4. Check the authentication is degraded # oc get co authentication --watch authentication 4.7.0-0.ci.test-2021-08-09-063231-ci-ln-ryvs38b-latest True False False 31m ... # oc get co | grep auth authentication 4.7.0-0.ci.test-2021-08-09-063231-ci-ln-ryvs38b-latest True False True 41m 5. Check the ClusterOperatorDown and ClusterOperatorDegraded alerts # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDegraded").annotations.message' Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable. # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDown").annotations.message' Only ClusterOperatorDegraded fired. But the time in the message is incorrect, it's been degraded only 10 minutes.
> But the time in the message is incorrect, it's been degraded only 10 minutes. hmm, not sure how that's possible given 'for: 30m' [1]. And bug 1969501 should already have verified the 30m response time for ClusterOperatorDegraded. So... I dunno? Maybe we can test again post-merge? [1]: https://github.com/openshift/cluster-version-operator/pull/638/files#diff-fabad9e1d73a4f70c3d47836ed62e1982b1c6fbb947fce9a633b9cb0a98ecb24R90
Thanks Trevor! I can test it again after it goes to ON_QA. In my case, the authentication operator was in degrade=false state after install. I made it go degraded state manually by applying an incorrect yaml file in post-install. Then I got the ClusterOperatorDegraded as expected. Whenever I check the ClusterOperatorDegraded alert message, it always tells me "has been degraded for 30 minutes". In my case, you can see that the authentication operator had been degraded for 10 minutes more or less. I think the hardcoded 30 minutes makes sense during install because the installation generally takes 30 minutes or so. Do you think it make sense in post-install as well?
Bugzilla bot seems to have missed this one, and the PR landed in 4.7.25 [1]. Which was tombstoned [2]. But then this code shipped with 4.7.28 [3]: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.7.28-x86_64 | grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator cc81827c1bfe322bd78d2fa0d9b34d532190d850 $ git --no-pager log --oneline -1 cc81827c1bfe322bd78d2fa0d9b34d532190d850 cc81827c Merge pull request #638 from wking/decouple-degraded-from-ClusterOperatorDown So... I guess the upside is that it should be possible to verify now :p. > Whenever I check the ClusterOperatorDegraded alert message, it always tells me "has been degraded for 30 minutes". In my case, you can see that the authentication operator had been degraded for 10 minutes more or less. Moving to 30m was the separate bug 1969501, which should have shipped in 4.7.18. If it's firing more quickly than that, seems like it would be a monitoring bug, because it's definitely 'for: 30m': $ git cat-file -p cc81827c1bfe322bd78d2fa0d9b34d532190d850:install/0000_90_cluster-version-operator_02_servicemonitor.yaml | grep -A11 ClusterOperatorDegraded - alert: ClusterOperatorDegraded annotations: message: Cluster operator {{ "{{ $labels.name }}" }} has been degraded for 30 minutes. Operator is degraded because {{ "{{ $labels.reason }}" }} and cluster upgrades will be unstable. expr: | ( cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} or on (name) group by (name) (cluster_operator_up{job="cluster-version-operator"}) ) == 1 for: 30m labels: severity: warning [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.7.25 [2]: https://github.com/openshift/cincinnati-graph-data/pull/995 [3]: https://access.redhat.com/errata/RHSA-2021:3262
Verifying with 4.7.0-0.nightly-2021-10-07-212101. Tested with procedure described in comment#1 # date; oc get co | grep auth Fri Oct 8 00:05:26 EDT 2021 authentication 4.7.0-0.nightly-2021-10-07-212101 True False True 99m # date; curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang1008a.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDegraded")' Fri Oct 8 00:35:26 EDT 2021 { "labels": { "alertname": "ClusterOperatorDegraded", "condition": "Degraded", "endpoint": "metrics", "instance": "10.0.0.5:9099", "job": "cluster-version-operator", "name": "authentication", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-76df8c8788-2fbhf", "reason": "OAuthServerConfigObservation_Error", "service": "cluster-version-operator", "severity": "warning" }, "annotations": { "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable." }, "state": "firing", "activeAt": "2021-10-08T04:05:29.21303266Z", "value": "1e+00" } # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang1008a.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDown")' Only ClusterOperatorDegraded is fired. Moving it to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.34 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3824