Bug 1969501

Summary: ClusterOperatorDegraded can fire during installation
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: medium Docs Contact:
Priority: high    
Version: 4.1.zCC: aos-bugs, bleanhar, jiajliu, jialiu, jokerman
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator fired ClusterOperatorDegraded after 10 minutes of unhappy Degraded conditions on ClusterOperator resources. During installs, the ClusterOperator resources are pre-created by the cluster-version operator well before some of the second-level operators are running. Consequence: Second-level operators who only become happy later in installs would have ClusterOperatorDegraded firing because their ClusterOperator had a sad or missing Degraded condition for more than 10 minutes. Fix: ClusterOperatorDegraded now requires 30 minutes of sad or missing Degraded conditions before it fires. Result: With this phase of installation generally completing within 30 minutes, ClusterOperatorDegraded is now much less likely to fire prematurely. When second-level operators go Degraded post-install, we will alert administrators to that degradation within 30 minutes, which still seems sufficiently low-latency for that level of degradation.
Story Points: ---
Clone Of: 1957991 Environment:
Last Closed: 2021-06-29 04:20:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1957991    
Bug Blocks:    

Description W. Trevor King 2021-06-08 14:15:39 UTC
+++ This bug was initially created as a clone of Bug #1957991 +++

During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in [1].  That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [2], where:

* ClusterOperatorDegraded started pending at 5:00:15Z [3].
* Install completed at 5:09:58Z [4].
* ClusterOperatorDegraded started firing at 5:10:04Z [3].
* ClusterOperatorDegraded stopped firing at 5:10:23Z [3].
* The e2e suite complained about [2]:

    alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580)

ClusterOperatorDown is similar, but I'll leave addressing it to a separate bug.  For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [5], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m.

[1]: https://github.com/openshift/cluster-version-operator/pull/136
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
[3]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
     group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"})
[4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json
[5]: https://github.com/openshift/api/pull/916

Comment 1 liujia 2021-06-10 04:25:42 UTC
Checked the cluster that launched by cluster-bot: 4.7.0-0.nightly,openshift/cluster-version-operator#587

Get authentication operator degraded and ClusterOperatorDegraded alert enabled with severity warning.
# oc get co authentication -ojson|jq -r '.status.conditions[]|select(.type=="Degraded").status'
True

{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.0.4:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-65cbf4cf85-vxqn8",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "pending",
  "activeAt": "2021-06-10T03:47:29.21303266Z",
  "value": "1e+00"
}

After 10min, the alert was still pending, and 20min later, the alert was firing.
# curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://$(oc get route prometheus-k8s -n openshift-monitoring --no-headers|awk '{print $2}')/api/v1/alerts | jq -r '.data.alerts[]|select(.labels.alertname == "ClusterOperatorDegraded")'
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.0.4:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-65cbf4cf85-vxqn8",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "firing",
  "activeAt": "2021-06-10T03:47:29.21303266Z",
  "value": "1e+00"
}

Comment 4 liujia 2021-06-15 06:19:58 UTC
# oc adm release info registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-06-12-151209 --commits|grep cluster-version
  cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       f3d25082a09312718718fa3a85b8aba8b4574781

# git log --date local --pretty="%h %an %cd - %s" f3d2508|grep '#587'
a0eacf89 OpenShift Merge Robot Fri Jun 11 17:48:32 2021 - Merge pull request #587 from wking/ClusterOperatorDegraded-softening

The PR was included into 4.7.0-0.nightly-2021-06-12-151209. The bug has been verified via pre-merge (comment#1) but the bot did not move it to "verified" automatically. Change the status manually.

Comment 5 OpenShift Automated Release Tooling 2021-06-17 12:29:08 UTC
OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th.

Comment 9 errata-xmlrpc 2021-06-29 04:20:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2502