Bug 1914132

Summary:	No metrics available in the Object Service Dashboard in OCS 4.7, logs show "failed to retrieve metrics exporter servicemonitor"
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Neha Berry <nberry>
Component:	ocs-operator	Assignee:	umanga <uchapaga>
Status:	CLOSED ERRATA	QA Contact:	Martin Bukatovic <mbukatov>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	jefbrown, madam, mbukatov, muagarwa, ocs-bugs, sostapov, uchapaga
Target Milestone:	---	Keywords:	AutomationBackLog, Regression
Target Release:	OCS 4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.7.0-696.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-19 09:17:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Neha Berry 2021-01-08 08:18:29 UTC

Description of problem (please be detailed as possible and provide log
snippests):
-----------------------------------------------------
After deployment failures since some time, the latest stable build of OCS which passed deployment is 4.7.0-228.ci.

Installed OCS 4.7.0-228.ci on a vmware dynamic cluster and it is seen that the Object Service Dashboard and no metrics can be viewed. 

Also, since almost the beginning of OCS 4.7, we are seeing the following error message in the ocs-operator logs, but due to unsuccessful deployments, we could never check the dashboard before:

2021-01-07T14:50:39.427852375Z {"level":"error","ts":1610031039.427809,"logger":"controllers.StorageCluster","msg":"failed to reconcile metrics exporter","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","error":"failed to retrieve metrics exporter servicemonitor openshift-storage/ocs-metrics-exporter. no kind is registered for the type v1.ServiceMonitor in scheme \"pkg/runtime/scheme.go:101\"

Note: Due to clock skew in mons, the cluster is in health warn state since the beginning, but that should not be the cause of the issue


Version of all relevant components (if applicable):
=====================================================
OCP = 4.7.0-0.nightly-2021-01-07-034013
OCS =  ocs-operator.v4.7.0-228.ci and ocs-operator.v4.7.0-229.ci too


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
====================================================================
Yes no MCG or RGW metrics are available in the dashboard

Is there any workaround available to the best of your knowledge?
============================================================
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
============================

Can this issue reproducible?
=============================
Yes observed on 2 clusters

1. Vmware dynamic - OCs build 4.7.0-228.ci
2. Vmware LSO with arbiter enabled - OCS build 4.7.0-229.ci


Can this issue reproduce from the UI?
=====================================
OCS was installed via UI.


If this is a regression, please provide more details to justify this:
=====================================================
Yes

Steps to Reproduce:
=======================
1. Install latest OCP 4.7 nightly build
2. For vmware dynamic, Install OCS operator , here 4.7.0-228.ci.  The operator pods are created (though the pods resppinned a couple of time- bug 1909268)
3. Install Storagecluster
  a) the namespace openshift-storage gets labelled with [openshift.io/cluster-monitoring: "true"] once storagecluster creation starts
  b) All pods are UP. But due to NTP issue, ceph was in health warn with clock skew. See next comment.

 4. Logged into UI and checked the dashboards.

Actual results:
===================
The Object Service dashboard is not displaying any metrics

Expected results:
=====================
The Object Service dashboard should display some status and information for both MCG and RGW.

Comment 7 Martin Bukatovic 2021-02-02 22:27:42 UTC

I reproduced the issue with ocs-operator.v4.7.0-228.ci, where I see 2 TargetDown alerts (100% of the noobaa-mgmt/noobaa-mgmt targets in openshift-storage namespace are down, 100% of the s3/s3 targets in openshift-storage namespace are down), no NooBaa metrics (I checked NooBaa_bucket_status) can be queried via OCP Prometheus. Object dashboard is empty.

With 4.7.0-249.ci, there are no such TargetDown alerts, and NooBaa_bucket_status metric is present in OCP Prometheus. There is no delay comapred to other ceph metrics. Object dashboard reports data.

Verified

Comment 8 Martin Bukatovic 2021-02-02 22:35:36 UTC

*** Bug 1919385 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2021-05-19 09:17:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041