Description of problem: When cluster has available updates, the alert is triggered. This alert would be re-triggered every 4 hours, even when new updates are not available. The alert should be displayed once per CVO pods lifetime (until its restarted or cluster is updated)
Reproducing with 4.9.0-0.nightly-2021-08-25-185404 Steps to reproduce: 1. Install a cluster with 4.9.0-0.nightly-2021-08-25-185404 2. Check UpdateAvailable alert has many labels # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0826.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")' { "labels": { "alertname": "UpdateAvailable", "channel": "nightly-4.9", "endpoint": "metrics", "instance": "10.0.0.2:9099", "job": "cluster-version-operator", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-594c7d895b-5sjf2", "service": "cluster-version-operator", "severity": "info", "upstream": "https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy.json" }, "annotations": { "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0826.qe.gcp.devcluster.openshift.com/settings/cluster/.", "summary": "Your upstream update recommendation service recommends you update your cluster." }, "state": "firing", "activeAt": "2021-08-27T08:00:16.457044785Z", "value": "1e+00" } 3. Delete CVO pod, and then re-check the alert # oc delete pod/cluster-version-operator-594c7d895b-5sjf2 pod "cluster-version-operator-594c7d895b-5sjf2" deleted # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0826.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")' { "labels": { "alertname": "UpdateAvailable", "channel": "nightly-4.9", "endpoint": "metrics", "instance": "10.0.0.4:9099", "job": "cluster-version-operator", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-594c7d895b-rcdq7", "service": "cluster-version-operator", "severity": "info", "upstream": "https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy.json" }, "annotations": { "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0826.qe.gcp.devcluster.openshift.com/settings/cluster/.", "summary": "Your upstream update recommendation service recommends you update your cluster." }, "state": "firing", "activeAt": "2021-08-27T08:03:16.457044785Z", "value": "1e+00" } The alert is re-triggered.
Verifying with 4.9.0-0.nightly-2021-08-26-040328 Steps to verify: 1. Install a cluster with 4.9.0-0.nightly-2021-08-26-040328 2. Check UpdateAvailable alert has fewer labels # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0827.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")' { "labels": { "alertname": "UpdateAvailable", "channel": "stable-4.9", "severity": "info", "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph" }, "annotations": { "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0827.qe.gcp.devcluster.openshift.com/settings/cluster/.", "summary": "Your upstream update recommendation service recommends you update your cluster." }, "state": "firing", "activeAt": "2021-08-27T06:17:16.457044785Z", "value": "2e+00" } 3. Delete the cvo pod, and then re-check the alert # oc delete pod/cluster-version-operator-7f45c4f96d-vtcgx pod "cluster-version-operator-7f45c4f96d-vtcgx" deleted # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0827.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")' { "labels": { "alertname": "UpdateAvailable", "channel": "stable-4.9", "severity": "info", "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph" }, "annotations": { "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0827.qe.gcp.devcluster.openshift.com/settings/cluster/.", "summary": "Your upstream update recommendation service recommends you update your cluster." }, "state": "firing", "activeAt": "2021-08-27T08:04:16.457044785Z", "value": "2e+00" } The alert is still re-triggered due to cvo pod change. Trevor, from my understanding, with your fix, the alert should only be affected by the channel and/or upstream change. But from my testing, it's re-triggered by cvo pod change. Could you please help check if it's expected? Thanks!
In this case, I think the new 'activeAt' was because, while the new CVO was coming up, monitoring would fail to scrape us, and we'd get a gap, which... hmm, not clear to me why it took over 1h40m for the new alert to come back... There's no 'for' on the alert, it should fire immediately once Prom scrapes us. And I'm not sure what Prom's scrape period is, it's somewhat surprising to me that it's fast enough for them to care about the brief break while we swapped CVO pods... [1]: https://github.com/openshift/cluster-version-operator/blob/17d9690bb6b85f786367382a279f058311772828/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L58-L65
Trevor, thanks for looking into it. > not clear to me why it took over 1h40m for the new alert to come back... Ah, it's because I deleted the CVO pod after the cluster was running for 1h40m.
With the change, the alert is re-triggered w/o upstream uri and channel change. # oc adm upgrade Cluster version is 4.9.0-0.nightly-2021-08-28-082738 Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph Channel: stable-4.9 Updates: VERSION IMAGE 4.9.0-0.nightly-2021-08-28-134805 registry.ci.openshift.org/ocp/release@sha256:a10a4358850af7b5c288e81be38b92673cabd79f1f59d8e632dc122e5ab0561b 4.9.0-0.nightly-2021-08-28-150051 registry.ci.openshift.org/ocp/release@sha256:4f0ee87e83419d2e0a86bb386585a66652e6a072f50bcb42180ff547b0c995d6 # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0830-1.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")' { "labels": { "alertname": "UpdateAvailable", "channel": "stable-4.9", "severity": "info", "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph" }, "annotations": { "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0830-1.qe.gcp.devcluster.openshift.com/settings/cluster/.", "summary": "Your upstream update recommendation service recommends you update your cluster." }, "state": "firing", "activeAt": "2021-08-31T09:03:16.457044785Z", "value": "2e+00" } # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0830-1.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[] | select (.labels.alertname == "UpdateAvailable")' { "labels": { "alertname": "UpdateAvailable", "channel": "stable-4.9", "severity": "info", "upstream": "https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph" }, "annotations": { "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.yangyang0830-1.qe.gcp.devcluster.openshift.com/settings/cluster/.", "summary": "Your upstream update recommendation service recommends you update your cluster." }, "state": "firing", "activeAt": "2021-08-31T10:38:46.457044785Z", "value": "2e+00" } Oops, the alert is re-triggered. The upstream uri and channel are not changed. # oc adm upgrade Cluster version is 4.9.0-0.nightly-2021-08-28-082738 Upstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph Channel: stable-4.9 Updates: VERSION IMAGE 4.9.0-0.nightly-2021-08-28-134805 registry.ci.openshift.org/ocp/release@sha256:a10a4358850af7b5c288e81be38b92673cabd79f1f59d8e632dc122e5ab0561b 4.9.0-0.nightly-2021-08-28-150051 registry.ci.openshift.org/ocp/release@sha256:4f0ee87e83419d2e0a86bb386585a66652e6a072f50bcb42180ff547b0c995d6 Is it because the Cincinnati graph gets changed?
Created a cluster w/o the change # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-25-111423 True False 2d6h Cluster version is 4.9.0-0.nightly-2021-08-25-111423 Watching the UpdateAvailable alert for over 1 day, the alert is not re-triggered. I'm not able to reproduce it. Created a cluster w/ the change # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-28-082738 True False 2d6h Cluster version is 4.9.0-0.nightly-2021-08-28-082738 The UpdateAvailable alert is re-triggered when channel and/or upstream uri get changed. In addition, I also update the dummy cincy by adding nodes, the alert is not re-triggered. I have no idea how comment#7 happened. Trevor, could you please help check if it's enough to verify the BZ? Thanks!
Based on comment#8, I think it works as design. Moving it to verified state at this point.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759