Bug 2244873

Summary:	[RDR]Ceph reports "no active mgr" after workload deployment
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	kmanohar
Component:	ceph	Assignee:	Nitzan mordechai <nmordech>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	amagrawa, bhubbard, bniver, edonnell, kramdoss, muagarwa, nojha, pdhange, prsurve, sagrawal, sheggodu, sostapov, srangana, tnielsen
Version:	4.14
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	.Ceph reports "no active mgr" after workload deployment After workload deployment, Ceph manager loses connectivity to MONs or is unable to respond to its liveness probe. This causes the ODF cluster status to report that there is "no active mgr". This causes multiple operations that use the Ceph manager for request processing to fail. For example, volume provisioning, creating CephFS snapshots, and others. To check the status of the ODF cluster, use the command `oc get cephcluster -n openshift-storage`. In the status output, the `status.ceph.details.MGR_DOWN` field will have the message "no active mgr" if your cluster has this issue. To workaround this issue, restart the Ceph manager pods using the following commands: + ---- # oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=0 ---- + ---- # oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=1 ---- After running these commands, the ODF cluster status reports a healthy cluster, with no warnings or errors regarding `MGR_DOWN`.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-06-24 04:58:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2244409

Description kmanohar 2023-10-18 16:58:34 UTC

Description of problem (please be detailed as possible and provide log
snippests):
On the Regional DR setup, in few hours after the workload deployment, ceph reports no active mgr on managed cluster C2.
Cluster had rbd and cephfs based pvcs DR protected.

Output in C1

sh-5.1$ ceph status
  cluster:
    id:     018d44db-a132-443d-b7ff-7c1a07d303de
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon:        3 daemons, quorum d,e,f (age 22h)
    mgr:        no daemons active (since 17h)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 22h), 3 in (since 23h)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 8.06k objects, 16 GiB
    usage:   45 GiB used, 6.0 TiB / 6.0 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr

Version of all relevant components (if applicable):

ODF - 4.14.0-150
OCP - 4.14.0-0.nightly-2023-10-15-164249
Submariner - 0.16.0(brew.registry.redhat.io/rh-osbs/iib:594788)
ACM - 2.9.0(2.9.0-DOWNSTREAM-2023-10-03-20-08-35)
Ceph Version - ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable) 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On Regional DR setup, keep the rbd and cephfs based workloads on managed clusters(C1,C2) running for few hours
2. On managed cluster C2, ceph reports no active mgr. At the same time, a ceph mgr pod is in running state.

$ odf-pods | grep mgr
rook-ceph-mgr-a-7cdbc5b5db-9l74n                                  2/2     Running     0             27h

ceph status
  cluster:
    id:     018d44db-a132-443d-b7ff-7c1a07d303de
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon:        3 daemons, quorum d,e,f (age 27h)
    mgr:        no daemons active (since 21h)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 27h), 3 in (since 27h)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 8.06k objects, 16 GiB
    usage:   45 GiB used, 6.0 TiB / 6.0 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr

$ ceph osd blocklist ls
10.129.2.72:0/688561354 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/1561870797 2023-10-18T18:51:11.057931+0000
10.129.2.72:6801/2358525361 2023-10-18T18:51:11.057931+0000
10.129.2.72:6800/2358525361 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/1224440434 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/704399710 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/1701803902 2023-10-18T18:51:11.057931+0000
listed 7 entries

$ kubectl rook-ceph -n openshift-storage dr health
Info: fetching the cephblockpools with mirroring enabled
Info: found "ocs-storagecluster-cephblockpool" cephblockpool with mirroring enabled
Info: running ceph status from peer cluster
Info:   cluster:
    id:     af1877b4-e193-4373-be97-290e8eae4ce7
    health: HEALTH_WARN
            1 slow ops, oldest one blocked for 97448 sec, mon.f has slow ops
 
  services:
    mon:        3 daemons, quorum d,e,f (age 27h)
    mgr:        a(active, since 28h)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 27h), 3 in (since 28h)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 22.89k objects, 74 GiB
    usage:   215 GiB used, 5.8 TiB / 6.0 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   80 KiB/s rd, 1.0 MiB/s wr, 57 op/s rd, 201 op/s wr
 

Info: running mirroring daemon health

=====> Final output hangs here  



Subctl verify 

C1 - http://pastebin.test.redhat.com/1110653
C2 - http://pastebin.test.redhat.com/1110652

In critical alerts on UI

CephMgrIsAbsent 
Ceph Manager has disappeared from Prometheus target discovery. 

Expected results:
Ceph should not report no active mgr, and it should remain healthy

Live Cluster is available for debugging:

HUB - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30559/

C1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30560/

C2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30561/


Must gather 

c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c1/

c2(affected cluster) - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c2/

hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/hub/

Comment 5 Mudit Agarwal 2023-10-26 13:59:54 UTC

Shyam, can you please help with documenting the workaround here?

Comment 15 krishnaram Karthick 2024-01-16 06:23:12 UTC

We have seen this issue in 4.14 testing. 
With 4.15, we have 2 MGRs, so we might need to see the behaviour with 4.15 testing.

Comment 16 Mudit Agarwal 2024-01-21 13:52:58 UTC

(In reply to krishnaram Karthick from comment #15)
> We have seen this issue in 4.14 testing. 
> With 4.15, we have 2 MGRs, so we might need to see the behaviour with 4.15
> testing.

Hi Karthick,

I don't understand the reason behind making it blocker, in 4.15 we will have 2 mgr by default so according to https://bugzilla.redhat.com/show_bug.cgi?id=2255616#c3 we will have extra cushion in case this issue is hit. Please correct me if I my understanding is wrong.

Comment 17 Mudit Agarwal 2024-01-21 13:53:08 UTC

*** Bug 2255616 has been marked as a duplicate of this bug. ***

Comment 18 Travis Nielsen 2024-02-15 19:07:58 UTC

Could this be related to https://bugzilla.redhat.com/show_bug.cgi?id=2171847?

Comment 20 Red Hat Bugzilla 2024-10-23 04:25:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days