2244873 – [RDR]Ceph reports "no active mgr" after workload deployment

Bug 2244873 - [RDR]Ceph reports "no active mgr" after workload deployment

Summary: [RDR]Ceph reports "no active mgr" after workload deployment

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nitzan mordechai
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2255616 (view as bug list)
Depends On:
Blocks:	2244409
TreeView+	depends on / blocked

Reported:	2023-10-18 16:58 UTC by kmanohar
Modified:	2024-10-23 04:25 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	.Ceph reports "no active mgr" after workload deployment After workload deployment, Ceph manager loses connectivity to MONs or is unable to respond to its liveness probe. This causes the ODF cluster status to report that there is "no active mgr". This causes multiple operations that use the Ceph manager for request processing to fail. For example, volume provisioning, creating CephFS snapshots, and others. To check the status of the ODF cluster, use the command `oc get cephcluster -n openshift-storage`. In the status output, the `status.ceph.details.MGR_DOWN` field will have the message "no active mgr" if your cluster has this issue. To workaround this issue, restart the Ceph manager pods using the following commands: + ---- # oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=0 ---- + ---- # oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=1 ---- After running these commands, the ODF cluster status reports a healthy cluster, with no warnings or errors regarding `MGR_DOWN`.
Clone Of:
Environment:
Last Closed:	2024-06-24 04:58:25 UTC
Embargoed:

Attachments	(Terms of Use)

Description kmanohar 2023-10-18 16:58:34 UTC

Description of problem (please be detailed as possible and provide log
snippests):
On the Regional DR setup, in few hours after the workload deployment, ceph reports no active mgr on managed cluster C2.
Cluster had rbd and cephfs based pvcs DR protected.

Output in C1

sh-5.1$ ceph status
  cluster:
    id:     018d44db-a132-443d-b7ff-7c1a07d303de
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon:        3 daemons, quorum d,e,f (age 22h)
    mgr:        no daemons active (since 17h)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 22h), 3 in (since 23h)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 8.06k objects, 16 GiB
    usage:   45 GiB used, 6.0 TiB / 6.0 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr

Version of all relevant components (if applicable):

ODF - 4.14.0-150
OCP - 4.14.0-0.nightly-2023-10-15-164249
Submariner - 0.16.0(brew.registry.redhat.io/rh-osbs/iib:594788)
ACM - 2.9.0(2.9.0-DOWNSTREAM-2023-10-03-20-08-35)
Ceph Version - ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable) 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On Regional DR setup, keep the rbd and cephfs based workloads on managed clusters(C1,C2) running for few hours
2. On managed cluster C2, ceph reports no active mgr. At the same time, a ceph mgr pod is in running state.

$ odf-pods | grep mgr
rook-ceph-mgr-a-7cdbc5b5db-9l74n                                  2/2     Running     0             27h

ceph status
  cluster:
    id:     018d44db-a132-443d-b7ff-7c1a07d303de
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon:        3 daemons, quorum d,e,f (age 27h)
    mgr:        no daemons active (since 21h)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 27h), 3 in (since 27h)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 8.06k objects, 16 GiB
    usage:   45 GiB used, 6.0 TiB / 6.0 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr

$ ceph osd blocklist ls
10.129.2.72:0/688561354 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/1561870797 2023-10-18T18:51:11.057931+0000
10.129.2.72:6801/2358525361 2023-10-18T18:51:11.057931+0000
10.129.2.72:6800/2358525361 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/1224440434 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/704399710 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/1701803902 2023-10-18T18:51:11.057931+0000
listed 7 entries

$ kubectl rook-ceph -n openshift-storage dr health
Info: fetching the cephblockpools with mirroring enabled
Info: found "ocs-storagecluster-cephblockpool" cephblockpool with mirroring enabled
Info: running ceph status from peer cluster
Info:   cluster:
    id:     af1877b4-e193-4373-be97-290e8eae4ce7
    health: HEALTH_WARN
            1 slow ops, oldest one blocked for 97448 sec, mon.f has slow ops
 
  services:
    mon:        3 daemons, quorum d,e,f (age 27h)
    mgr:        a(active, since 28h)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 27h), 3 in (since 28h)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 22.89k objects, 74 GiB
    usage:   215 GiB used, 5.8 TiB / 6.0 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   80 KiB/s rd, 1.0 MiB/s wr, 57 op/s rd, 201 op/s wr
 

Info: running mirroring daemon health

=====> Final output hangs here  



Subctl verify 

C1 - http://pastebin.test.redhat.com/1110653
C2 - http://pastebin.test.redhat.com/1110652

In critical alerts on UI

CephMgrIsAbsent 
Ceph Manager has disappeared from Prometheus target discovery. 

Expected results:
Ceph should not report no active mgr, and it should remain healthy

Live Cluster is available for debugging:

HUB - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30559/

C1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30560/

C2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30561/


Must gather 

c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c1/

c2(affected cluster) - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c2/

hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/hub/

Comment 5 Mudit Agarwal 2023-10-26 13:59:54 UTC

Shyam, can you please help with documenting the workaround here?

Comment 15 krishnaram Karthick 2024-01-16 06:23:12 UTC

We have seen this issue in 4.14 testing. 
With 4.15, we have 2 MGRs, so we might need to see the behaviour with 4.15 testing.

Comment 16 Mudit Agarwal 2024-01-21 13:52:58 UTC

(In reply to krishnaram Karthick from comment #15)
> We have seen this issue in 4.14 testing. 
> With 4.15, we have 2 MGRs, so we might need to see the behaviour with 4.15
> testing.

Hi Karthick,

I don't understand the reason behind making it blocker, in 4.15 we will have 2 mgr by default so according to https://bugzilla.redhat.com/show_bug.cgi?id=2255616#c3 we will have extra cushion in case this issue is hit. Please correct me if I my understanding is wrong.

Comment 17 Mudit Agarwal 2024-01-21 13:53:08 UTC

*** Bug 2255616 has been marked as a duplicate of this bug. ***

Comment 18 Travis Nielsen 2024-02-15 19:07:58 UTC

Could this be related to https://bugzilla.redhat.com/show_bug.cgi?id=2171847?

Comment 20 Red Hat Bugzilla 2024-10-23 04:25:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.