1969501 – ClusterOperatorDegraded can fire during installation

Bug 1969501 - ClusterOperatorDegraded can fire during installation

Summary: ClusterOperatorDegraded can fire during installation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.z
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:	1957991
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-08 14:15 UTC by W. Trevor King
Modified:	2021-06-29 04:20 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The cluster-version operator fired ClusterOperatorDegraded after 10 minutes of unhappy Degraded conditions on ClusterOperator resources. During installs, the ClusterOperator resources are pre-created by the cluster-version operator well before some of the second-level operators are running. Consequence: Second-level operators who only become happy later in installs would have ClusterOperatorDegraded firing because their ClusterOperator had a sad or missing Degraded condition for more than 10 minutes. Fix: ClusterOperatorDegraded now requires 30 minutes of sad or missing Degraded conditions before it fires. Result: With this phase of installation generally completing within 30 minutes, ClusterOperatorDegraded is now much less likely to fire prematurely. When second-level operators go Degraded post-install, we will alert administrators to that degradation within 30 minutes, which still seems sufficiently low-latency for that level of degradation.
Clone Of:	1957991
Environment:
Last Closed:	2021-06-29 04:20:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 587	0	None	open	Bug 1969501: install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded	2021-06-08 14:45:52 UTC
Red Hat Product Errata	RHBA-2021:2502	0	None	None	None	2021-06-29 04:20:37 UTC

Description W. Trevor King 2021-06-08 14:15:39 UTC

+++ This bug was initially created as a clone of Bug #1957991 +++

During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in [1].  That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [2], where:

* ClusterOperatorDegraded started pending at 5:00:15Z [3].
* Install completed at 5:09:58Z [4].
* ClusterOperatorDegraded started firing at 5:10:04Z [3].
* ClusterOperatorDegraded stopped firing at 5:10:23Z [3].
* The e2e suite complained about [2]:

    alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580)

ClusterOperatorDown is similar, but I'll leave addressing it to a separate bug.  For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [5], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m.

[1]: https://github.com/openshift/cluster-version-operator/pull/136
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
[3]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
     group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"})
[4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json
[5]: https://github.com/openshift/api/pull/916

Comment 1 liujia 2021-06-10 04:25:42 UTC

Checked the cluster that launched by cluster-bot: 4.7.0-0.nightly,openshift/cluster-version-operator#587

Get authentication operator degraded and ClusterOperatorDegraded alert enabled with severity warning.
# oc get co authentication -ojson|jq -r '.status.conditions[]|select(.type=="Degraded").status'
True

{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.0.4:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-65cbf4cf85-vxqn8",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "pending",
  "activeAt": "2021-06-10T03:47:29.21303266Z",
  "value": "1e+00"
}

After 10min, the alert was still pending, and 20min later, the alert was firing.
# curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://$(oc get route prometheus-k8s -n openshift-monitoring --no-headers|awk '{print $2}')/api/v1/alerts | jq -r '.data.alerts[]|select(.labels.alertname == "ClusterOperatorDegraded")'
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.0.4:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-65cbf4cf85-vxqn8",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 30 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "firing",
  "activeAt": "2021-06-10T03:47:29.21303266Z",
  "value": "1e+00"
}

Comment 4 liujia 2021-06-15 06:19:58 UTC

# oc adm release info registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-06-12-151209 --commits|grep cluster-version
  cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       f3d25082a09312718718fa3a85b8aba8b4574781

# git log --date local --pretty="%h %an %cd - %s" f3d2508|grep '#587'
a0eacf89 OpenShift Merge Robot Fri Jun 11 17:48:32 2021 - Merge pull request #587 from wking/ClusterOperatorDegraded-softening

The PR was included into 4.7.0-0.nightly-2021-06-12-151209. The bug has been verified via pre-merge (comment#1) but the bot did not move it to "verified" automatically. Change the status manually.

Comment 5 OpenShift Automated Release Tooling 2021-06-17 12:29:08 UTC

OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th.

Comment 9 errata-xmlrpc 2021-06-29 04:20:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2502

Note You need to log in before you can comment on or make changes to this bug.