Bug 1957991

Summary: ClusterOperatorDegraded can fire during installation
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: Johnny Liu <jialiu>
Severity: medium Docs Contact:
Priority: high    
Version: 4.1.zCC: aos-bugs, bleanhar, jiajliu, jokerman
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator fired ClusterOperatorDegraded after 10 minutes of unhappy Degraded conditions on ClusterOperator resources. During installs, the ClusterOperator resources are pre-created by the cluster-version operator well before some of the second-level operators are running. Consequence: Second-level operators who only become happy later in installs would have ClusterOperatorDegraded firing because their ClusterOperator had a sad or missing Degraded condition for more than 10 minutes. Fix: ClusterOperatorDegraded now requires 30 minutes of sad or missing Degraded conditions before it fires. Result: With this phase of installation generally completing within 30 minutes, ClusterOperatorDegraded is now much less likely to fire prematurely. When second-level operators go Degraded post-install, we will alert administrators to that degradation within 30 minutes, which still seems sufficiently low-latency for that level of degradation.
Story Points: ---
Clone Of:
: 1969501 (view as bug list) Environment:
Last Closed: 2021-07-27 23:07:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1969501    

Description W. Trevor King 2021-05-06 23:56:43 UTC
During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in [1].  That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [2], where:

* ClusterOperatorDegraded started pending at 5:00:15Z [3].
* Install completed at 5:09:58Z [4].
* ClusterOperatorDegraded started firing at 5:10:04Z [3].
* ClusterOperatorDegraded stopped firing at 5:10:23Z [3].
* The e2e suite complained about [2]:

    alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580)

ClusterOperatorDown is similar, but I'll leave addressing it to a separate bug.  For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [5], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m.

[1]: https://github.com/openshift/cluster-version-operator/pull/136
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
[3]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
     group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"})
[4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json
[5]: https://github.com/openshift/api/pull/916

Comment 2 W. Trevor King 2021-05-10 22:27:53 UTC
*** Bug 1958792 has been marked as a duplicate of this bug. ***

Comment 4 Johnny Liu 2021-05-11 08:20:50 UTC
Reproduce this bug with 4.8.0-fc.2 using the following steps.

1. create a clutser
2. Run the following command to try to catch ClusterOperatorDegraded alerts.
[root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)"  https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'

return empty
[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-fc.2   True        False         True       28h

3. configure authentication co with wrong setting on purpose 
# cat <<EOF >oauth.yaml
     apiVersion: config.openshift.io/v1
     kind: OAuth
     metadata:
       name: cluster
     spec:
       identityProviders:
       - name: oidcidp 
         mappingMethod: claim 
         type: OpenID
         openID:
           clientID: test
           clientSecret: 
             name: test
           claims: 
             preferredUsername:
             - preferred_username
             name:
             - name
             email:
             - email
           issuer: https://www.idp-issuer.example.com 
EOF

# oc apply -f oauth.yaml 
4. Wait some minutes, catch ClusterOperatorDegraded alerts again, this time found some ClusterOperatorDegraded alert started pending
[root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)"  https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.58.146:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-67cc6c7b4f-7x4wj",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "critical"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "pending",
  "activeAt": "2021-05-11T07:38:29.21303266Z",
  "value": "1e+00"
}
[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-fc.2   True        False         28h     Cluster version is 4.8.0-fc.2
[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-fc.2   True        False         True       28h


5. After about 10 mins, the alerts get firing.
[root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)"  https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.58.146:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-67cc6c7b4f-7x4wj",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "critical"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "firing",
  "activeAt": "2021-05-11T07:38:29.21303266Z",
  "value": "1e+00"
}
[root@preserve-jialiu-ansible ~]# oc get co authentication
NAME             VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.8.0-fc.2   True        False         True       28h
[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-fc.2   True        False         28h     Error while reconciling 4.8.0-fc.2: the cluster operator authentication is degraded

From https://prometheus-k8s-openshift-monitoring.apps.jialiu48.qe.devcluster.openshift.com/graph?g0.expr=ALERTS&g0.tab=0&g0.stacked=0&g0.range_input=2d, I can see the alert start pending at 07:38:59, start firing at 07:50:30, about 10 mins

Comment 5 Johnny Liu 2021-05-11 12:19:59 UTC
Verified this bug with 4.8.0-0.nightly-2021-05-10-225140 using the steps in comment 4, passed.

[root@preserve-jialiu-ansible ~]# oc get PrometheusRule -n openshift-cluster-version -o yaml
<--snip-->
      - alert: ClusterOperatorDegraded
        annotations:
          message: Cluster operator {{ $labels.name }} has been degraded for 30 minutes. Operator is degraded because {{ $labels.reason }} and cluster upgrades will be unstable.
        expr: |
          (
            cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"}
            or on (name)
            group by (name) (cluster_operator_up{job="cluster-version-operator"})
          ) == 1
        for: 30m
        labels:
          severity: warning
<--snip-->

[root@preserve-jialiu-ansible ~]# curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)"  https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'
[root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)"  https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'
Tue May 11 11:41:50 UTC 2021
[root@preserve-jialiu-ansible ~]# date -u; oc get co authentication; oc get clusterversion
Tue May 11 11:42:11 UTC 2021
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.8.0-0.nightly-2021-05-10-225140   True        False         False      132m
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS





[root@preserve-jialiu-ansible ~]# date -u; oc get co authentication; oc get clusterversion
Tue May 11 11:43:36 UTC 2021
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.8.0-0.nightly-2021-05-10-225140   True        False         True       133m
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-10-225140   True        False         137m    Cluster version is 4.8.0-0.nightly-2021-05-10-225140
[root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)"  https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'
Tue May 11 11:43:44 UTC 2021
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.53.0:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-74cc585456-ll8tz",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "pending",
  "activeAt": "2021-05-11T11:44:29.21303266Z",
  "value": "1e+00"
}

Wait for 10+ mins, the alert still in "pending" state.

[root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)"  https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'; oc get co authentication; oc get clusterversion
Tue May 11 11:55:57 UTC 2021
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.53.0:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-74cc585456-ll8tz",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "pending",
  "activeAt": "2021-05-11T11:44:29.21303266Z",
  "value": "1e+00"
}
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.8.0-0.nightly-2021-05-10-225140   True        False         True       145m
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-10-225140   True        False         150m    Error while reconciling 4.8.0-0.nightly-2021-05-10-225140: the cluster operator authentication is degraded


After 30mins, the alert get into "firing" state.
[root@preserve-jialiu-ansible ~]# date -u; curl -s -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)"  https://$(oc get route -n openshift-monitoring prometheus-k8s --no-headers -o json | jq -r '.status.ingress[].host')/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'; oc get co authentication; oc get clusterversion
Tue May 11 12:17:32 UTC 2021
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.53.0:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-74cc585456-ll8tz",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "firing",
  "activeAt": "2021-05-11T11:44:29.21303266Z",
  "value": "1e+00"
}
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.8.0-0.nightly-2021-05-10-225140   True        False         True       167m
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-10-225140   True        False         171m    Error while reconciling 4.8.0-0.nightly-2021-05-10-225140: the cluster operator authentication is degraded

Comment 8 errata-xmlrpc 2021-07-27 23:07:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438