Description of problem (please be detailed as possible and provide log snippests): On the Regional DR setup, in few hours after the workload deployment, ceph reports no active mgr on managed cluster C2. Cluster had rbd and cephfs based pvcs DR protected. Output in C1 sh-5.1$ ceph status cluster: id: 018d44db-a132-443d-b7ff-7c1a07d303de health: HEALTH_WARN no active mgr services: mon: 3 daemons, quorum d,e,f (age 22h) mgr: no daemons active (since 17h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 22h), 3 in (since 23h) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 8.06k objects, 16 GiB usage: 45 GiB used, 6.0 TiB / 6.0 TiB avail pgs: 169 active+clean io: client: 2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr Version of all relevant components (if applicable): ODF - 4.14.0-150 OCP - 4.14.0-0.nightly-2023-10-15-164249 Submariner - 0.16.0(brew.registry.redhat.io/rh-osbs/iib:594788) ACM - 2.9.0(2.9.0-DOWNSTREAM-2023-10-03-20-08-35) Ceph Version - ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. On Regional DR setup, keep the rbd and cephfs based workloads on managed clusters(C1,C2) running for few hours 2. On managed cluster C2, ceph reports no active mgr. At the same time, a ceph mgr pod is in running state. $ odf-pods | grep mgr rook-ceph-mgr-a-7cdbc5b5db-9l74n 2/2 Running 0 27h ceph status cluster: id: 018d44db-a132-443d-b7ff-7c1a07d303de health: HEALTH_WARN no active mgr services: mon: 3 daemons, quorum d,e,f (age 27h) mgr: no daemons active (since 21h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 27h), 3 in (since 27h) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 8.06k objects, 16 GiB usage: 45 GiB used, 6.0 TiB / 6.0 TiB avail pgs: 169 active+clean io: client: 2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr $ ceph osd blocklist ls 10.129.2.72:0/688561354 2023-10-18T18:51:11.057931+0000 10.129.2.72:0/1561870797 2023-10-18T18:51:11.057931+0000 10.129.2.72:6801/2358525361 2023-10-18T18:51:11.057931+0000 10.129.2.72:6800/2358525361 2023-10-18T18:51:11.057931+0000 10.129.2.72:0/1224440434 2023-10-18T18:51:11.057931+0000 10.129.2.72:0/704399710 2023-10-18T18:51:11.057931+0000 10.129.2.72:0/1701803902 2023-10-18T18:51:11.057931+0000 listed 7 entries $ kubectl rook-ceph -n openshift-storage dr health Info: fetching the cephblockpools with mirroring enabled Info: found "ocs-storagecluster-cephblockpool" cephblockpool with mirroring enabled Info: running ceph status from peer cluster Info: cluster: id: af1877b4-e193-4373-be97-290e8eae4ce7 health: HEALTH_WARN 1 slow ops, oldest one blocked for 97448 sec, mon.f has slow ops services: mon: 3 daemons, quorum d,e,f (age 27h) mgr: a(active, since 28h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 27h), 3 in (since 28h) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 22.89k objects, 74 GiB usage: 215 GiB used, 5.8 TiB / 6.0 TiB avail pgs: 169 active+clean io: client: 80 KiB/s rd, 1.0 MiB/s wr, 57 op/s rd, 201 op/s wr Info: running mirroring daemon health =====> Final output hangs here Subctl verify C1 - http://pastebin.test.redhat.com/1110653 C2 - http://pastebin.test.redhat.com/1110652 In critical alerts on UI CephMgrIsAbsent Ceph Manager has disappeared from Prometheus target discovery. Expected results: Ceph should not report no active mgr, and it should remain healthy Live Cluster is available for debugging: HUB - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30559/ C1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30560/ C2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30561/ Must gather c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c1/ c2(affected cluster) - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c2/ hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/hub/
Shyam, can you please help with documenting the workaround here?
We have seen this issue in 4.14 testing. With 4.15, we have 2 MGRs, so we might need to see the behaviour with 4.15 testing.
(In reply to krishnaram Karthick from comment #15) > We have seen this issue in 4.14 testing. > With 4.15, we have 2 MGRs, so we might need to see the behaviour with 4.15 > testing. Hi Karthick, I don't understand the reason behind making it blocker, in 4.15 we will have 2 mgr by default so according to https://bugzilla.redhat.com/show_bug.cgi?id=2255616#c3 we will have extra cushion in case this issue is hit. Please correct me if I my understanding is wrong.
*** Bug 2255616 has been marked as a duplicate of this bug. ***
Could this be related to https://bugzilla.redhat.com/show_bug.cgi?id=2171847?