Description of problem: After a OLM pod restart, the OLM csv_suceeded metrics not reported anymore. Version-Release number of selected component (if applicable): 4.9.26 How reproducible: 100% Steps to Reproduce: 1. Check the OLM pod metrics: ~$ oc port-forward olm-operator-78958bfb4-zlvbj 8443 Forwarding from 127.0.0.1:8443 -> 8443 ~$ curl --insecure https://localhost:8443/metrics 2. Observe the csv_succeeded metrics present. Example: # TYPE csv_succeeded gauge csv_succeeded{name="managed-upgrade-operator.v0.1.807-d70ffc7",namespace="openshift-managed-upgrade-operator",version="0.1.807-d70ffc7"} 1 csv_succeeded{name="ocm-agent-operator.v0.1.93-608a6f5",namespace="openshift-ocm-agent-operator",version="0.1.93-608a6f5"} 1 csv_succeeded{name="ocs-operator.v4.10.0",namespace="openshift-storage",version="4.10.0"} 1 3. Restart the OLM pod: ~$ oc delete pod olm-operator-78958bfb4-zlvbj 4 . Check the OLM pod metrics again: ~$ oc port-forward olm-operator-6477cdfddf-c7mxz 8443 Forwarding from 127.0.0.1:8443 -> 8443 ~$ curl --insecure https://localhost:8443/metrics Actual results: csv_succeeded metrics are gone. Expected results: csv_succeeded metrics for all CSVs in the cluster are reported. Additional info: csv_succeeded metrics are a key component in the Managed OpenShift space, as we use it as a signal from the fleet to lifecycle the clusters and Managed Services
I could not observe the issue in 4.10. Maybe related: https://bugzilla.redhat.com/show_bug.cgi?id=1952576 Was this backported to 4.9? If not, can we?
I'm creating a cherry-pick PR from the 4.10 fix. We should probably initiate some discussion around the metrics, their meaning, and whether they meet your requirements as SRE. The original intention of the metrics were to provide information for PM. This was never built to be a resilient cluster health status metric.
*** This bug has been marked as a duplicate of bug 2072995 ***