Bug 1958792

Summary:	Alert pending time is not consistent with the alert message
Product:	OpenShift Container Platform	Reporter:	liujia <jiajliu>
Component:	Cluster Version Operator	Assignee:	Over the Air Updates <aos-team-ota>
Status:	CLOSED DUPLICATE	QA Contact:	liujia <jiajliu>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.8	CC:	aos-bugs, jokerman, wking
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-10 22:27:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description liujia 2021-05-10 07:28:49 UTC

Description of problem:
Degraded one operator to trigger ClusterOperatorDegraded alert. It cost 30min+ to get it firing from pending state, but the message says the operator has been degraded for 10m.

# curl -s -k -H "Authorization: Bearer $token"  https://prometheus-k8s-openshift-monitoring.apps.jliu-48.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.0.7:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-5bcbddcc86-lqlk7",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "firing",
  "activeAt": "2021-05-10T03:04:29.21303266Z",
  "value": "1e+00"
}

The wait time was updated recently in[1] since it was 10min before.

[1] https://github.com/openshift/cluster-version-operator/commit/fb5257d4be8e1b18a80a171a24ba6e8386026b94#diff-fabad9e1d73a4f70c3d47836ed62e1982b1c6fbb947fce9a633b9cb0a98ecb24

Version-Release number of the following components:
4.8.0-0.nightly-2021-05-08-025039

How reproducible:
always

Steps to Reproduce:
1. Degraded cluster operator and check ClusterOperatorDegraded is firing correctly and timely
2.
3.

Actual results:


Expected results:
The alert message should be consistent with the wait time.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 W. Trevor King 2021-05-10 22:27:56 UTC

I'd just attached the fix for this to bug 1957991, no need for a separate bug.

*** This bug has been marked as a duplicate of bug 1957991 ***