Bug 2072923 - OLM csv_suceeded metrics not reported after OLM pod restart
Summary: OLM csv_suceeded metrics not reported after OLM pod restart
Keywords:
Status: CLOSED DUPLICATE of bug 2072995
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.9
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: Per da Silva
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-07 09:39 UTC by apahim
Modified: 2022-04-07 12:39 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-07 12:39:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description apahim 2022-04-07 09:39:35 UTC
Description of problem:

After a OLM pod restart, the OLM csv_suceeded metrics not reported anymore.


Version-Release number of selected component (if applicable):

4.9.26


How reproducible:

100%


Steps to Reproduce:
1. Check the OLM pod metrics:

~$ oc port-forward olm-operator-78958bfb4-zlvbj 8443
Forwarding from 127.0.0.1:8443 -> 8443

~$ curl --insecure https://localhost:8443/metrics

2. Observe the csv_succeeded metrics present. Example:

# TYPE csv_succeeded gauge
csv_succeeded{name="managed-upgrade-operator.v0.1.807-d70ffc7",namespace="openshift-managed-upgrade-operator",version="0.1.807-d70ffc7"} 1
csv_succeeded{name="ocm-agent-operator.v0.1.93-608a6f5",namespace="openshift-ocm-agent-operator",version="0.1.93-608a6f5"} 1
csv_succeeded{name="ocs-operator.v4.10.0",namespace="openshift-storage",version="4.10.0"} 1

3. Restart the OLM pod:

~$ oc delete pod olm-operator-78958bfb4-zlvbj

4 . Check the OLM pod metrics again:

~$ oc port-forward olm-operator-6477cdfddf-c7mxz 8443
Forwarding from 127.0.0.1:8443 -> 8443

~$ curl --insecure https://localhost:8443/metrics


Actual results:

csv_succeeded metrics are gone.


Expected results:

csv_succeeded metrics for all CSVs in the cluster are reported.


Additional info: 

csv_succeeded metrics are a key component in the Managed OpenShift space, as we use it as a signal from the fleet to lifecycle the clusters and Managed Services

Comment 2 apahim 2022-04-07 10:04:59 UTC
I could not observe the issue in 4.10.

Maybe related:

https://bugzilla.redhat.com/show_bug.cgi?id=1952576

Was this backported to 4.9? If not, can we?

Comment 3 Per da Silva 2022-04-07 12:37:11 UTC
I'm creating a cherry-pick PR from the 4.10 fix. We should probably initiate some discussion around the metrics, their meaning, and whether they meet your requirements as SRE.
The original intention of the metrics were to provide information for PM. This was never built to be a resilient cluster health status metric.

Comment 4 Per da Silva 2022-04-07 12:39:04 UTC

*** This bug has been marked as a duplicate of bug 2072995 ***


Note You need to log in before you can comment on or make changes to this bug.