Bug 1997596

Summary: UpdateAvailable alert is re-triggered on pod and other label changes
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: Yang Yang <yanyang>
Severity: low Docs Contact:
Priority: medium    
Version: 4.9CC: aos-bugs, jokerman, wking, yanyang
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:49:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vadim Rutkovsky 2021-08-25 13:39:02 UTC
Description of problem:
When cluster has available updates, the alert is triggered. This alert would be re-triggered every 4 hours, even when new updates are not available.

The alert should be displayed once per CVO pods lifetime (until its restarted or cluster is updated)

Comment 3 Yang Yang 2021-08-27 08:04:51 UTC
Reproducing with 4.9.0-0.nightly-2021-08-25-185404

Steps to reproduce:
1. Install a cluster with 4.9.0-0.nightly-2021-08-25-185404
2. Check UpdateAvailable alert has many labels

# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0826.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")'
{
  "labels": {
    "alertname": "UpdateAvailable",
    "channel": "nightly-4.9",
    "endpoint": "metrics",
    "instance": "10.0.0.2:9099",
    "job": "cluster-version-operator",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-594c7d895b-5sjf2",
    "service": "cluster-version-operator",
    "severity": "info",
    "upstream": "https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy.json"
  },
  "annotations": {
    "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0826.qe.gcp.devcluster.openshift.com/settings/cluster/.",
    "summary": "Your upstream update recommendation service recommends you update your cluster."
  },
  "state": "firing",
  "activeAt": "2021-08-27T08:00:16.457044785Z",
  "value": "1e+00"
}

3. Delete CVO pod, and then re-check the alert
# oc delete pod/cluster-version-operator-594c7d895b-5sjf2
pod "cluster-version-operator-594c7d895b-5sjf2" deleted

# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0826.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")'
{
  "labels": {
    "alertname": "UpdateAvailable",
    "channel": "nightly-4.9",
    "endpoint": "metrics",
    "instance": "10.0.0.4:9099",
    "job": "cluster-version-operator",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-594c7d895b-rcdq7",
    "service": "cluster-version-operator",
    "severity": "info",
    "upstream": "https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy.json"
  },
  "annotations": {
    "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0826.qe.gcp.devcluster.openshift.com/settings/cluster/.",
    "summary": "Your upstream update recommendation service recommends you update your cluster."
  },
  "state": "firing",
  "activeAt": "2021-08-27T08:03:16.457044785Z",
  "value": "1e+00"
}

The alert is re-triggered.

Comment 4 Yang Yang 2021-08-27 08:11:40 UTC
Verifying with 4.9.0-0.nightly-2021-08-26-040328

Steps to verify:

1. Install a cluster with 4.9.0-0.nightly-2021-08-26-040328

2. Check UpdateAvailable alert has fewer labels

# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0827.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")'
{
  "labels": {
    "alertname": "UpdateAvailable",
    "channel": "stable-4.9",
    "severity": "info",
    "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"
  },
  "annotations": {
    "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0827.qe.gcp.devcluster.openshift.com/settings/cluster/.",
    "summary": "Your upstream update recommendation service recommends you update your cluster."
  },
  "state": "firing",
  "activeAt": "2021-08-27T06:17:16.457044785Z",
  "value": "2e+00"
}

3. Delete the cvo pod, and then re-check the alert
# oc delete pod/cluster-version-operator-7f45c4f96d-vtcgx
pod "cluster-version-operator-7f45c4f96d-vtcgx" deleted

# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0827.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")'
{
  "labels": {
    "alertname": "UpdateAvailable",
    "channel": "stable-4.9",
    "severity": "info",
    "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"
  },
  "annotations": {
    "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0827.qe.gcp.devcluster.openshift.com/settings/cluster/.",
    "summary": "Your upstream update recommendation service recommends you update your cluster."
  },
  "state": "firing",
  "activeAt": "2021-08-27T08:04:16.457044785Z",
  "value": "2e+00"
}

The alert is still re-triggered due to cvo pod change.

Trevor, from my understanding, with your fix, the alert should only be affected by the channel and/or upstream change. But from my testing, it's re-triggered by cvo pod change. Could you please help check if it's expected? Thanks!

Comment 5 W. Trevor King 2021-08-28 22:47:30 UTC
In this case, I think the new 'activeAt' was because, while the new CVO was coming up, monitoring would fail to scrape us, and we'd get a gap, which... hmm, not clear to me why it took over 1h40m for the new alert to come back...  There's no 'for' on the alert, it should fire immediately once Prom scrapes us.  And I'm not sure what Prom's scrape period is, it's somewhat surprising to me that it's fast enough for them to care about the brief break while we swapped CVO pods...

[1]: https://github.com/openshift/cluster-version-operator/blob/17d9690bb6b85f786367382a279f058311772828/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L58-L65

Comment 6 Yang Yang 2021-08-31 12:18:22 UTC
Trevor, thanks for looking into it.

> not clear to me why it took over 1h40m for the new alert to come back...

Ah, it's because I deleted the CVO pod after the cluster was running for 1h40m.

Comment 7 Yang Yang 2021-08-31 12:24:20 UTC
With the change, the alert is re-triggered w/o upstream uri and channel change. 

# oc adm upgrade
Cluster version is 4.9.0-0.nightly-2021-08-28-082738

Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph
Channel: stable-4.9
Updates:

VERSION                           IMAGE
4.9.0-0.nightly-2021-08-28-134805 registry.ci.openshift.org/ocp/release@sha256:a10a4358850af7b5c288e81be38b92673cabd79f1f59d8e632dc122e5ab0561b
4.9.0-0.nightly-2021-08-28-150051 registry.ci.openshift.org/ocp/release@sha256:4f0ee87e83419d2e0a86bb386585a66652e6a072f50bcb42180ff547b0c995d6

# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0830-1.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")'
{
  "labels": {
    "alertname": "UpdateAvailable",
    "channel": "stable-4.9",
    "severity": "info",
    "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"
  },
  "annotations": {
    "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0830-1.qe.gcp.devcluster.openshift.com/settings/cluster/.",
    "summary": "Your upstream update recommendation service recommends you update your cluster."
  },
  "state": "firing",
  "activeAt": "2021-08-31T09:03:16.457044785Z",
  "value": "2e+00"
}

# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0830-1.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")'
{
  "labels": {
    "alertname": "UpdateAvailable",
    "channel": "stable-4.9",
    "severity": "info",
    "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"
  },
  "annotations": {
    "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0830-1.qe.gcp.devcluster.openshift.com/settings/cluster/.",
    "summary": "Your upstream update recommendation service recommends you update your cluster."
  },
  "state": "firing",
  "activeAt": "2021-08-31T10:38:46.457044785Z",
  "value": "2e+00"
}

Oops, the alert is re-triggered. The upstream uri and channel are not changed.

# oc adm upgrade 
Cluster version is 4.9.0-0.nightly-2021-08-28-082738

Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph
Channel: stable-4.9
Updates:

VERSION                           IMAGE
4.9.0-0.nightly-2021-08-28-134805 registry.ci.openshift.org/ocp/release@sha256:a10a4358850af7b5c288e81be38b92673cabd79f1f59d8e632dc122e5ab0561b
4.9.0-0.nightly-2021-08-28-150051 registry.ci.openshift.org/ocp/release@sha256:4f0ee87e83419d2e0a86bb386585a66652e6a072f50bcb42180ff547b0c995d6

Is it because the Cincinnati graph gets changed?

Comment 8 Yang Yang 2021-09-01 09:51:00 UTC
Created a cluster w/o the change

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-25-111423   True        False         2d6h    Cluster version is 4.9.0-0.nightly-2021-08-25-111423

Watching the UpdateAvailable alert for over 1 day, the alert is not re-triggered. I'm not able to reproduce it.

Created a cluster w/ the change

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-28-082738   True        False         2d6h    Cluster version is 4.9.0-0.nightly-2021-08-28-082738

The UpdateAvailable alert is re-triggered when channel and/or upstream uri get changed. In addition, I also update the dummy cincy by adding nodes, the alert is not re-triggered. I have no idea how comment#7 happened.

Trevor, could you please help check if it's enough to verify the BZ? Thanks!

Comment 9 Yang Yang 2021-09-03 01:08:09 UTC
Based on comment#8, I think it works as design. Moving it to verified state at this point.

Comment 12 errata-xmlrpc 2021-10-18 17:49:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759