1914132 – No metrics available in the Object Service Dashboard in OCS 4.7, logs show "failed to retrieve metrics exporter servicemonitor"

Bug 1914132 - No metrics available in the Object Service Dashboard in OCS 4.7, logs show "failed to retrieve metrics exporter servicemonitor"

Summary: No metrics available in the Object Service Dashboard in OCS 4.7, logs show "f...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	umanga
QA Contact:	Martin Bukatovic
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1919385 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-08 08:18 UTC by Neha Berry
Modified:	2021-06-01 08:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.7.0-696.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:17:47 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 983	None	closed	Remove usage of outdated coreos/prometheus-operator	2021-02-07 10:04:21 UTC
Github	openshift ocs-operator pull 984	None	closed	Bug 1914132: [release-4.7] Remove usage of outdated coreos/prometheus-operator	2021-02-07 10:04:21 UTC
Red Hat Product Errata	RHSA-2021:2041	None	None	None	2021-05-19 09:18:12 UTC

Description Neha Berry 2021-01-08 08:18:29 UTC

Description of problem (please be detailed as possible and provide log
snippests):
-----------------------------------------------------
After deployment failures since some time, the latest stable build of OCS which passed deployment is 4.7.0-228.ci.

Installed OCS 4.7.0-228.ci on a vmware dynamic cluster and it is seen that the Object Service Dashboard and no metrics can be viewed. 

Also, since almost the beginning of OCS 4.7, we are seeing the following error message in the ocs-operator logs, but due to unsuccessful deployments, we could never check the dashboard before:

2021-01-07T14:50:39.427852375Z {"level":"error","ts":1610031039.427809,"logger":"controllers.StorageCluster","msg":"failed to reconcile metrics exporter","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","error":"failed to retrieve metrics exporter servicemonitor openshift-storage/ocs-metrics-exporter. no kind is registered for the type v1.ServiceMonitor in scheme \"pkg/runtime/scheme.go:101\"

Note: Due to clock skew in mons, the cluster is in health warn state since the beginning, but that should not be the cause of the issue


Version of all relevant components (if applicable):
=====================================================
OCP = 4.7.0-0.nightly-2021-01-07-034013
OCS =  ocs-operator.v4.7.0-228.ci and ocs-operator.v4.7.0-229.ci too


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
====================================================================
Yes no MCG or RGW metrics are available in the dashboard

Is there any workaround available to the best of your knowledge?
============================================================
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
============================

Can this issue reproducible?
=============================
Yes observed on 2 clusters

1. Vmware dynamic - OCs build 4.7.0-228.ci
2. Vmware LSO with arbiter enabled - OCS build 4.7.0-229.ci


Can this issue reproduce from the UI?
=====================================
OCS was installed via UI.


If this is a regression, please provide more details to justify this:
=====================================================
Yes

Steps to Reproduce:
=======================
1. Install latest OCP 4.7 nightly build
2. For vmware dynamic, Install OCS operator , here 4.7.0-228.ci.  The operator pods are created (though the pods resppinned a couple of time- bug 1909268)
3. Install Storagecluster
  a) the namespace openshift-storage gets labelled with [openshift.io/cluster-monitoring: "true"] once storagecluster creation starts
  b) All pods are UP. But due to NTP issue, ceph was in health warn with clock skew. See next comment.

 4. Logged into UI and checked the dashboards.

Actual results:
===================
The Object Service dashboard is not displaying any metrics

Expected results:
=====================
The Object Service dashboard should display some status and information for both MCG and RGW.

Comment 7 Martin Bukatovic 2021-02-02 22:27:42 UTC

I reproduced the issue with ocs-operator.v4.7.0-228.ci, where I see 2 TargetDown alerts (100% of the noobaa-mgmt/noobaa-mgmt targets in openshift-storage namespace are down, 100% of the s3/s3 targets in openshift-storage namespace are down), no NooBaa metrics (I checked NooBaa_bucket_status) can be queried via OCP Prometheus. Object dashboard is empty.

With 4.7.0-249.ci, there are no such TargetDown alerts, and NooBaa_bucket_status metric is present in OCP Prometheus. There is no delay comapred to other ceph metrics. Object dashboard reports data.

Verified

Comment 8 Martin Bukatovic 2021-02-02 22:35:36 UTC

*** Bug 1919385 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2021-05-19 09:17:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.