Bug 1958792

Summary: Alert pending time is not consistent with the alert message
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: Cluster Version OperatorAssignee: Over the Air Updates <aos-team-ota>
Status: CLOSED DUPLICATE QA Contact: liujia <jiajliu>
Severity: low Docs Contact:
Priority: low    
Version: 4.8CC: aos-bugs, jokerman, wking
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-10 22:27:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liujia 2021-05-10 07:28:49 UTC
Description of problem:
Degraded one operator to trigger ClusterOperatorDegraded alert. It cost 30min+ to get it firing from pending state, but the message says the operator has been degraded for 10m.

# curl -s -k -H "Authorization: Bearer $token"  https://prometheus-k8s-openshift-monitoring.apps.jliu-48.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")'
{
  "labels": {
    "alertname": "ClusterOperatorDegraded",
    "condition": "Degraded",
    "endpoint": "metrics",
    "instance": "10.0.0.7:9099",
    "job": "cluster-version-operator",
    "name": "authentication",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-5bcbddcc86-lqlk7",
    "reason": "OAuthServerConfigObservation_Error",
    "service": "cluster-version-operator",
    "severity": "warning"
  },
  "annotations": {
    "message": "Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable."
  },
  "state": "firing",
  "activeAt": "2021-05-10T03:04:29.21303266Z",
  "value": "1e+00"
}

The wait time was updated recently in[1] since it was 10min before.

[1] https://github.com/openshift/cluster-version-operator/commit/fb5257d4be8e1b18a80a171a24ba6e8386026b94#diff-fabad9e1d73a4f70c3d47836ed62e1982b1c6fbb947fce9a633b9cb0a98ecb24

Version-Release number of the following components:
4.8.0-0.nightly-2021-05-08-025039

How reproducible:
always

Steps to Reproduce:
1. Degraded cluster operator and check ClusterOperatorDegraded is firing correctly and timely
2.
3.

Actual results:


Expected results:
The alert message should be consistent with the wait time.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 W. Trevor King 2021-05-10 22:27:56 UTC
I'd just attached the fix for this to bug 1957991, no need for a separate bug.

*** This bug has been marked as a duplicate of bug 1957991 ***