Bug 1991010

Summary:	Backport Ignore Degraded for cluster_operator_up
Product:	OpenShift Container Platform	Reporter:	Matthew Robson <mrobson>
Component:	Cluster Version Operator	Assignee:	W. Trevor King <wking>
Status:	CLOSED ERRATA	QA Contact:	Yang Yang <yanyang>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.3.z	CC:	aos-bugs, jokerman, wking, yanyang
Target Milestone:	---
Target Release:	4.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-20 19:33:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1834551
Bug Blocks:

Description Matthew Robson 2021-08-06 19:34:16 UTC

Description of problem:

Backport https://github.com/openshift/cluster-version-operator/pull/550 for 4.7.z

Customer is getting combined alerts for ClusterOperatorDegraded and ClusterOperatorDown due to ImagePrunerJobFailed events. Currently both of these are critical. ClusterOperatorDegraded was moved to warning in 4.7, but without PR550, they will both still trigger due to Available=True + Degraded state.

In this case, ImagePrunerJobFailed is known to happen from time to time for a variety of reasons and triggering a ClusterOperatorDegraded makes sense, but not a critical alert ClusterOperatorDown. In this case, with the introduction of ignoreInvalidImageReferences in OCP 4.6, even when a ImagePrunerJobFailed triggers, the pruner will keep going and can be revisited as necessary.

alertname = ClusterOperatorDegraded
condition = Degraded
endpoint = metrics
instance = 142.34.151.135:9099
job = cluster-version-operator
name = image-registry
namespace = openshift-cluster-version
pod = cluster-version-operator-5f4559dfbb-sxjcv
prometheus = openshift-monitoring/k8s
reason = ImagePrunerJobFailed
service = cluster-version-operator
severity = critical
Annotations
message = Cluster operator image-registry has been degraded for 10 minutes. Operator is degraded because ImagePrunerJobFailed and cluster upgrades will be unstable.

alertname = ClusterOperatorDown
endpoint = metrics
instance = 142.34.151.135:9099
job = cluster-version-operator
name = image-registry
namespace = openshift-cluster-version
pod = cluster-version-operator-5f4559dfbb-sxjcv
prometheus = openshift-monitoring/k8s
service = cluster-version-operator
severity = critical
version = 4.6.25
Annotations
message = Cluster operator image-registry has not been available for 10 minutes. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.

Actual results:

Both alerts trigger with ClusterOperatorDown being unnecessary / incorrect in this situation.

Expected results:

Only trigger degraded if available is still true.

Comment 1 Yang Yang 2021-08-09 08:01:16 UTC

Attempting the PR Pre-Merge verification process:

1. Launch a cluster with cluster-bot with the PR

2. Check there's no ClusterOperatorDown and ClusterOperatorDegraded alerts

# token=`oc -n openshift-monitoring sa get-token prometheus-k8s`
# curl -s -k -H "Authorization: Bearer $token"  https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDown").annotations.message'
# curl -s -k -H "Authorization: Bearer $token"  https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDegraded").annotations.message'

3. Degrade the authentication operator

   # cat <<EOF >oauth.yaml
     apiVersion: config.openshift.io/v1
     kind: OAuth
     metadata:
       name: cluster
     spec:
       identityProviders:
       - name: oidcidp 
         mappingMethod: claim 
         type: OpenID
         openID:
           clientID: does-not-exist
           clientSecret: 
             name: does-not-exist
           claims: 
             preferredUsername:
             - preferred_username
             name:
             - name
             email:
             - email
           issuer: https://www.idp-issuer.example.com 
     EOF
# oc apply -f oauth.yaml
Warning: resource oauths/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
oauth.config.openshift.io/cluster configured

4. Check the authentication is degraded 
# oc get co authentication --watch
authentication                             4.7.0-0.ci.test-2021-08-09-063231-ci-ln-ryvs38b-latest   True        False         False      31m
...
# oc get co | grep auth
authentication                             4.7.0-0.ci.test-2021-08-09-063231-ci-ln-ryvs38b-latest   True        False         True       41m

5. Check the ClusterOperatorDown and ClusterOperatorDegraded alerts

# curl -s -k -H "Authorization: Bearer $token"  https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDegraded").annotations.message'
Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable.

# curl -s -k -H "Authorization: Bearer $token"  https://prometheus-k8s-openshift-monitoring.apps.ci-ln-ryvs38b-d5d6b.origin-ci-int-aws.dev.rhcloud.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDown").annotations.message'

Only ClusterOperatorDegraded fired. But the time in the message is incorrect, it's been degraded only 10 minutes.

Comment 2 W. Trevor King 2021-08-09 19:03:10 UTC

> But the time in the message is incorrect, it's been degraded only 10 minutes.

hmm, not sure how that's possible given 'for: 30m' [1].  And bug 1969501 should already have verified the 30m response time for ClusterOperatorDegraded.  So... I dunno?  Maybe we can test again post-merge?


[1]: https://github.com/openshift/cluster-version-operator/pull/638/files#diff-fabad9e1d73a4f70c3d47836ed62e1982b1c6fbb947fce9a633b9cb0a98ecb24R90

Comment 3 Yang Yang 2021-08-10 02:04:31 UTC

Thanks Trevor! I can test it again after it goes to ON_QA.

In my case, the authentication operator was in degrade=false state after install. I made it go degraded state manually by applying an incorrect yaml file in post-install. Then I got the ClusterOperatorDegraded as expected. Whenever I check the ClusterOperatorDegraded alert message, it always tells me "has been degraded for 30 minutes". In my case, you can see that the authentication operator had been degraded for 10 minutes more or less. I think the hardcoded 30 minutes makes sense during install because the installation generally takes 30 minutes or so. Do you think it make sense in post-install as well?

Comment 4 W. Trevor King 2021-10-07 03:00:55 UTC

Bugzilla bot seems to have missed this one, and the PR landed in 4.7.25 [1].  Which was tombstoned [2].  But then this code shipped with 4.7.28 [3]:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.7.28-x86_64 | grep cluster-version
    cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       cc81827c1bfe322bd78d2fa0d9b34d532190d850
  $ git --no-pager log --oneline -1 cc81827c1bfe322bd78d2fa0d9b34d532190d850
  cc81827c Merge pull request #638 from wking/decouple-degraded-from-ClusterOperatorDown

So... I guess the upside is that it should be possible to verify now :p.

> Whenever I check the ClusterOperatorDegraded alert message, it always tells me "has been degraded for 30 minutes". In my case, you can see that the authentication operator had been degraded for 10 minutes more or less. 

Moving to 30m was the separate bug 1969501, which should have shipped in 4.7.18.  If it's firing more quickly than that, seems like it would be a monitoring bug, because it's definitely 'for: 30m':

  $ git cat-file -p cc81827c1bfe322bd78d2fa0d9b34d532190d850:install/0000_90_cluster-version-operator_02_servicemonitor.yaml | grep -A11 ClusterOperatorDegraded
    - alert: ClusterOperatorDegraded
      annotations:
        message: Cluster operator {{ "{{ $labels.name }}" }} has been degraded for 30 minutes. Operator is degraded because {{ "{{ $labels.reason }}" }} and cluster upgrades will be unstable.
      expr: |
        (
          cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"}
          or on (name)
          group by (name) (cluster_operator_up{job="cluster-version-operator"})
        ) == 1
      for: 30m
      labels:
        severity: warning

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.7.25
[2]: https://github.com/openshift/cincinnati-graph-data/pull/995
[3]: https://access.redhat.com/errata/RHSA-2021:3262

Comment 6 Yang Yang 2021-10-08 04:40:11 UTC

Verifying with 4.7.0-0.nightly-2021-10-07-212101. 

Tested with procedure described in comment#1

# date; oc get co | grep auth
Fri Oct  8 00:05:26 EDT 2021
authentication                             4.7.0-0.nightly-2021-10-07-212101   True        False         True       99m

# date; curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang1008a.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDegraded")'
Fri Oct  8 00:35:26 EDT 2021
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.0.5:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-76df8c8788-2fbhf",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "firing",
  "activeAt": "2021-10-08T04:05:29.21303266Z",
  "value": "1e+00"
}

# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang1008a.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname=="ClusterOperatorDown")'

Only ClusterOperatorDegraded is fired. Moving it to verified state.

Comment 9 errata-xmlrpc 2021-10-20 19:33:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.34 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3824