Bug 2072923

Summary: OLM csv_suceeded metrics not reported after OLM pod restart
Product: OpenShift Container Platform Reporter: apahim
Component: OLMAssignee: Per da Silva <pegoncal>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: urgent CC: nschiede
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-07 12:39:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description apahim 2022-04-07 09:39:35 UTC
Description of problem:

After a OLM pod restart, the OLM csv_suceeded metrics not reported anymore.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Check the OLM pod metrics:

~$ oc port-forward olm-operator-78958bfb4-zlvbj 8443
Forwarding from -> 8443

~$ curl --insecure https://localhost:8443/metrics

2. Observe the csv_succeeded metrics present. Example:

# TYPE csv_succeeded gauge
csv_succeeded{name="managed-upgrade-operator.v0.1.807-d70ffc7",namespace="openshift-managed-upgrade-operator",version="0.1.807-d70ffc7"} 1
csv_succeeded{name="ocm-agent-operator.v0.1.93-608a6f5",namespace="openshift-ocm-agent-operator",version="0.1.93-608a6f5"} 1
csv_succeeded{name="ocs-operator.v4.10.0",namespace="openshift-storage",version="4.10.0"} 1

3. Restart the OLM pod:

~$ oc delete pod olm-operator-78958bfb4-zlvbj

4 . Check the OLM pod metrics again:

~$ oc port-forward olm-operator-6477cdfddf-c7mxz 8443
Forwarding from -> 8443

~$ curl --insecure https://localhost:8443/metrics

Actual results:

csv_succeeded metrics are gone.

Expected results:

csv_succeeded metrics for all CSVs in the cluster are reported.

Additional info: 

csv_succeeded metrics are a key component in the Managed OpenShift space, as we use it as a signal from the fleet to lifecycle the clusters and Managed Services

Comment 2 apahim 2022-04-07 10:04:59 UTC
I could not observe the issue in 4.10.

Maybe related:


Was this backported to 4.9? If not, can we?

Comment 3 Per da Silva 2022-04-07 12:37:11 UTC
I'm creating a cherry-pick PR from the 4.10 fix. We should probably initiate some discussion around the metrics, their meaning, and whether they meet your requirements as SRE.
The original intention of the metrics were to provide information for PM. This was never built to be a resilient cluster health status metric.

Comment 4 Per da Silva 2022-04-07 12:39:04 UTC

*** This bug has been marked as a duplicate of bug 2072995 ***