Bug 1940312

Summary: [OCS 4.8]Persistent Storage Dashboard throws Alert - Ceph Manager has disappeared from Prometheus target discovery and Object Service dashboard has Unknown Data Resiliency
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Neha Berry <nberry>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED ERRATA QA Contact: Neha Berry <nberry>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.8CC: asachan, madam, mbukatov, muagarwa, ocs-bugs, sostapov
Target Milestone: ---Keywords: TestBlocker
Target Release: OCS 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-03 18:15:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Neha Berry 2021-03-18 07:05:41 UTC
Description of problem:
==================================
Installed OCS in Internal Attached Mode in OCS 4.8 on VMware and post deployment , following Alert is seen in the Persistent Storage Dashboard

>> Ceph Manager has disappeared from Prometheus target discovery.

Not sure of the impact of this as the ceph MGR is up in ceph status.

POD
======
rook-ceph-mgr-a-6f79896dbf-qxpvx                                  2/2     Running     0          12m   10.131.0.33    compute-0   <none>           <none>




Version-Release number of selected component (if applicable):
================================================================
OCP = 4.8.0-0.nightly-2021-03-18-000857
OCS = ocs-operator.v4.8.0-303.ci and ocs-operator.v4.8.0-302.ci

"mgr": {
        "ceph version 14.2.11-133.el8cp (b35842cdf727a690afe60d0a32cdbca7da7171c8) nautilus (stable)": 1
    },


How reproducible:
====================
Always

Steps to Reproduce:
========================
1. Install OCP 4.8 nightly
2. Install OCS 4.8 latest In Internal Attached mode (need to confirm if similar issue is seen in dynamic mode too)

3. Once OCS is installed, check the Overview-> Persistent Storage Dashboard

Actual results:
==================
Following Alert is seen in the Status page

Mar 18, 12:16 pm
Ceph Manager has disappeared from Prometheus target discovery.


Expected results:
=====================
No Alert should be seen.



Additional info:
=======================
ceph status
---------------

=====ceph status ====
Thu Mar 18 07:00:31 UTC 2021
  cluster:
    id:     bab7a0f7-41bb-4de0-8ff6-526f4ce8b58f
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 14m)
    mgr: a(active, since 14m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 14m), 3 in (since 14m)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 


Namespace labelling

  labels:
    olm.operatorgroup.uid/9466e021-e5c1-4452-b9f4-0a734f5bf99b: ""
    olm.operatorgroup.uid/79202b46-7af8-4645-a0df-36daf35ce36e: ""
    openshift.io/cluster-monitoring: "true"

Comment 6 Travis Nielsen 2021-03-18 14:40:39 UTC
This looks related to the Rook change to support multiple mgrs... If there is only a single mgr, for consistency the label with the mgr name also needs to be added.

Comment 7 Travis Nielsen 2021-03-18 22:39:01 UTC
Now adding the active mgr name to the labels as required by the service monitor...
https://github.com/rook/rook/pull/7440

Comment 10 Travis Nielsen 2021-03-21 14:27:33 UTC
Will be picked up with the next 4.8 build since the sync from rook master:
https://github.com/openshift/rook/pull/197

Comment 15 errata-xmlrpc 2021-08-03 18:15:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003