.Ceph reports "no active mgr" after workload deployment
After workload deployment, Ceph manager loses connectivity to MONs or is unable to respond to its liveness probe.
This causes the ODF cluster status to report that there is "no active mgr". This causes multiple operations that use the Ceph manager for request processing to fail. For example, volume provisioning, creating CephFS snapshots, and others.
To check the status of the ODF cluster, use the command `oc get cephcluster -n openshift-storage`. In the status output, the `status.ceph.details.MGR_DOWN` field will have the message "no active mgr" if your cluster has this issue.
To workaround this issue, restart the Ceph manager pods using the following commands:
+
----
# oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=0
----
+
----
# oc scale deployment -n openshift-storage rook-ceph-mgr-a --replicas=1
----
After running these commands, the ODF cluster status reports a healthy cluster, with no warnings or errors regarding `MGR_DOWN`.
Description of problem (please be detailed as possible and provide log
snippests):
On the Regional DR setup, in few hours after the workload deployment, ceph reports no active mgr on managed cluster C2.
Cluster had rbd and cephfs based pvcs DR protected.
Output in C1
sh-5.1$ ceph status
cluster:
id: 018d44db-a132-443d-b7ff-7c1a07d303de
health: HEALTH_WARN
no active mgr
services:
mon: 3 daemons, quorum d,e,f (age 22h)
mgr: no daemons active (since 17h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 22h), 3 in (since 23h)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 8.06k objects, 16 GiB
usage: 45 GiB used, 6.0 TiB / 6.0 TiB avail
pgs: 169 active+clean
io:
client: 2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr
Version of all relevant components (if applicable):
ODF - 4.14.0-150
OCP - 4.14.0-0.nightly-2023-10-15-164249
Submariner - 0.16.0(brew.registry.redhat.io/rh-osbs/iib:594788)
ACM - 2.9.0(2.9.0-DOWNSTREAM-2023-10-03-20-08-35)
Ceph Version - ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable)
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. On Regional DR setup, keep the rbd and cephfs based workloads on managed clusters(C1,C2) running for few hours
2. On managed cluster C2, ceph reports no active mgr. At the same time, a ceph mgr pod is in running state.
$ odf-pods | grep mgr
rook-ceph-mgr-a-7cdbc5b5db-9l74n 2/2 Running 0 27h
ceph status
cluster:
id: 018d44db-a132-443d-b7ff-7c1a07d303de
health: HEALTH_WARN
no active mgr
services:
mon: 3 daemons, quorum d,e,f (age 27h)
mgr: no daemons active (since 21h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 27h), 3 in (since 27h)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 8.06k objects, 16 GiB
usage: 45 GiB used, 6.0 TiB / 6.0 TiB avail
pgs: 169 active+clean
io:
client: 2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr
$ ceph osd blocklist ls
10.129.2.72:0/688561354 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/1561870797 2023-10-18T18:51:11.057931+0000
10.129.2.72:6801/2358525361 2023-10-18T18:51:11.057931+0000
10.129.2.72:6800/2358525361 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/1224440434 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/704399710 2023-10-18T18:51:11.057931+0000
10.129.2.72:0/1701803902 2023-10-18T18:51:11.057931+0000
listed 7 entries
$ kubectl rook-ceph -n openshift-storage dr health
Info: fetching the cephblockpools with mirroring enabled
Info: found "ocs-storagecluster-cephblockpool" cephblockpool with mirroring enabled
Info: running ceph status from peer cluster
Info: cluster:
id: af1877b4-e193-4373-be97-290e8eae4ce7
health: HEALTH_WARN
1 slow ops, oldest one blocked for 97448 sec, mon.f has slow ops
services:
mon: 3 daemons, quorum d,e,f (age 27h)
mgr: a(active, since 28h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 27h), 3 in (since 28h)
rbd-mirror: 1 daemon active (1 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 22.89k objects, 74 GiB
usage: 215 GiB used, 5.8 TiB / 6.0 TiB avail
pgs: 169 active+clean
io:
client: 80 KiB/s rd, 1.0 MiB/s wr, 57 op/s rd, 201 op/s wr
Info: running mirroring daemon health
=====> Final output hangs here
Subctl verify
C1 - http://pastebin.test.redhat.com/1110653
C2 - http://pastebin.test.redhat.com/1110652
In critical alerts on UI
CephMgrIsAbsent
Ceph Manager has disappeared from Prometheus target discovery.
Expected results:
Ceph should not report no active mgr, and it should remain healthy
Live Cluster is available for debugging:
HUB - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30559/
C1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30560/
C2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30561/
Must gather
c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c1/
c2(affected cluster) - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c2/
hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/hub/
(In reply to krishnaram Karthick from comment #15)
> We have seen this issue in 4.14 testing.
> With 4.15, we have 2 MGRs, so we might need to see the behaviour with 4.15
> testing.
Hi Karthick,
I don't understand the reason behind making it blocker, in 4.15 we will have 2 mgr by default so according to https://bugzilla.redhat.com/show_bug.cgi?id=2255616#c3 we will have extra cushion in case this issue is hit. Please correct me if I my understanding is wrong.
Description of problem (please be detailed as possible and provide log snippests): On the Regional DR setup, in few hours after the workload deployment, ceph reports no active mgr on managed cluster C2. Cluster had rbd and cephfs based pvcs DR protected. Output in C1 sh-5.1$ ceph status cluster: id: 018d44db-a132-443d-b7ff-7c1a07d303de health: HEALTH_WARN no active mgr services: mon: 3 daemons, quorum d,e,f (age 22h) mgr: no daemons active (since 17h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 22h), 3 in (since 23h) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 8.06k objects, 16 GiB usage: 45 GiB used, 6.0 TiB / 6.0 TiB avail pgs: 169 active+clean io: client: 2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr Version of all relevant components (if applicable): ODF - 4.14.0-150 OCP - 4.14.0-0.nightly-2023-10-15-164249 Submariner - 0.16.0(brew.registry.redhat.io/rh-osbs/iib:594788) ACM - 2.9.0(2.9.0-DOWNSTREAM-2023-10-03-20-08-35) Ceph Version - ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. On Regional DR setup, keep the rbd and cephfs based workloads on managed clusters(C1,C2) running for few hours 2. On managed cluster C2, ceph reports no active mgr. At the same time, a ceph mgr pod is in running state. $ odf-pods | grep mgr rook-ceph-mgr-a-7cdbc5b5db-9l74n 2/2 Running 0 27h ceph status cluster: id: 018d44db-a132-443d-b7ff-7c1a07d303de health: HEALTH_WARN no active mgr services: mon: 3 daemons, quorum d,e,f (age 27h) mgr: no daemons active (since 21h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 27h), 3 in (since 27h) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 8.06k objects, 16 GiB usage: 45 GiB used, 6.0 TiB / 6.0 TiB avail pgs: 169 active+clean io: client: 2.1 MiB/s rd, 25 MiB/s wr, 108 op/s rd, 224 op/s wr $ ceph osd blocklist ls 10.129.2.72:0/688561354 2023-10-18T18:51:11.057931+0000 10.129.2.72:0/1561870797 2023-10-18T18:51:11.057931+0000 10.129.2.72:6801/2358525361 2023-10-18T18:51:11.057931+0000 10.129.2.72:6800/2358525361 2023-10-18T18:51:11.057931+0000 10.129.2.72:0/1224440434 2023-10-18T18:51:11.057931+0000 10.129.2.72:0/704399710 2023-10-18T18:51:11.057931+0000 10.129.2.72:0/1701803902 2023-10-18T18:51:11.057931+0000 listed 7 entries $ kubectl rook-ceph -n openshift-storage dr health Info: fetching the cephblockpools with mirroring enabled Info: found "ocs-storagecluster-cephblockpool" cephblockpool with mirroring enabled Info: running ceph status from peer cluster Info: cluster: id: af1877b4-e193-4373-be97-290e8eae4ce7 health: HEALTH_WARN 1 slow ops, oldest one blocked for 97448 sec, mon.f has slow ops services: mon: 3 daemons, quorum d,e,f (age 27h) mgr: a(active, since 28h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 27h), 3 in (since 28h) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 22.89k objects, 74 GiB usage: 215 GiB used, 5.8 TiB / 6.0 TiB avail pgs: 169 active+clean io: client: 80 KiB/s rd, 1.0 MiB/s wr, 57 op/s rd, 201 op/s wr Info: running mirroring daemon health =====> Final output hangs here Subctl verify C1 - http://pastebin.test.redhat.com/1110653 C2 - http://pastebin.test.redhat.com/1110652 In critical alerts on UI CephMgrIsAbsent Ceph Manager has disappeared from Prometheus target discovery. Expected results: Ceph should not report no active mgr, and it should remain healthy Live Cluster is available for debugging: HUB - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30559/ C1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30560/ C2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/30561/ Must gather c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c1/ c2(affected cluster) - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/c2/ hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/ceph-mgr/hub/