Description of problem: The cluster is unresponsive and the ceph commands are stuck when there is netsplit scenario created between the two data sites. workflow: 1. Deploy RHCS stretch cluster, with below daemon placements. HOST ADDR LABELS STATUS ceph-pdhiran-22ki39-node1-installer (Arbiter) 10.0.211.122 _admin,alertmanager,grafana,installer,prometheus,mon ceph-pdhiran-22ki39-node2 (DC1) 10.0.211.45 _admin,mgr,mon,mds,osd ceph-pdhiran-22ki39-node3 (DC1) 10.0.209.24 nfs,osd,mgr,mon ceph-pdhiran-22ki39-node4 (DC1) 10.0.211.12 _admin,osd,rgw ceph-pdhiran-22ki39-node5 (DC2) 10.0.208.75 _admin,osd,mgr,mon Offline ceph-pdhiran-22ki39-node6 (DC2) 10.0.206.215 mon,osd,rgw,mgr Offline ceph-pdhiran-22ki39-node7 (DC2) 10.0.207.70 nfs,osd,mds Offline Where node1 is the arbiter node, with monitoring stack components and single tiebreaker mon, and the other nodes have the remaining daemons with OSDs. This daemon placement in the cluster is in accordance with the daemon placement suggested in the MDR guide [1]. 2. Run workloads and fill up the pools. 3. Introduce netsplit scenario by adding IP table rules in DC1, blocking all incoming and outgoing traffic from and to DC2, using below commands. [root@ceph-pdhiran-22ki39-node4 ~]# iptables -A INPUT -s 10.0.208.75 -j DROP; iptables -A OUTPUT -d 10.0.208.75 -j DROP [root@ceph-pdhiran-22ki39-node4 ~]# iptables -A INPUT -s 10.0.206.215 -j DROP; iptables -A OUTPUT -d 10.0.206.215 -j DROP [root@ceph-pdhiran-22ki39-node4 ~]# iptables -A INPUT -s 10.0.207.70 -j DROP; iptables -A OUTPUT -d 10.0.207.70 -j DROP With the above commands added in all DC1 hosts, all incoming and outgoing traffic from DC1 to DC2 is dropped. Note: There are no IPtables rule added on arbiter node. Arbiter node is able to communicate with both the data sites. Only communication b/w DC1 & DC2 is dropped. 4. Observe that the ceph status shows that one site is down, which is as expected. # ceph -s cluster: id: d1baa022-68e9-11ee-8703-fa163ee116f5 health: HEALTH_WARN Failed to apply 1 service(s): osd.all-available-devices We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer 2/5 mons down, quorum ceph-pdhiran-22ki39-node6,ceph-pdhiran-22ki39-node5,ceph-pdhiran-22ki39-node1-installer 1 datacenter (12 osds) down 12 osds down 3 hosts (12 osds) down Degraded data redundancy: 46023/236072 objects degraded (19.495%), 122 pgs degraded, 238 pgs undersized 794 slow ops, oldest one blocked for 193 sec, daemons [mon.ceph-pdhiran-22ki39-node2,mon.ceph-pdhiran-22ki39-node3,mon.ceph-pdhiran-22ki39-node6] have slow ops. services: mon: 5 daemons, quorum ceph-pdhiran-22ki39-node6,ceph-pdhiran-22ki39-node5,ceph-pdhiran-22ki39-node1-installer (age 61s), out of quorum: ceph-pdhiran-22ki39-node2, ceph-pdhiran-22ki39-node3 mgr: ceph-pdhiran-22ki39-node2.fsqnfn(active, since 27h), standbys: ceph-pdhiran-22ki39-node3.tvqgjm, ceph-pdhiran-22ki39-node5.tabrcd, ceph-pdhiran-22ki39-node6.gulgol, ceph-pdhiran-22ki39-node1-installer.shjsea mds: 1/1 daemons up, 1 standby osd: 24 osds: 12 up (since 72s), 24 in (since 26h) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 337 pgs objects: 59.02k objects, 1.4 GiB usage: 10 GiB used, 590 GiB / 600 GiB avail pgs: 46023/236072 objects degraded (19.495%) 122 active+undersized+degraded 116 active+undersized 99 active+clean [root@ceph-pdhiran-22ki39-node8 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.58658 root default -3 0.29358 datacenter DC1 -2 0.09799 host ceph-pdhiran-22ki39-node2 2 hdd 0.02399 osd.2 down 1.00000 1.00000 8 hdd 0.02399 osd.8 down 1.00000 1.00000 15 hdd 0.02399 osd.15 down 1.00000 1.00000 21 hdd 0.02399 osd.21 down 1.00000 1.00000 -4 0.09799 host ceph-pdhiran-22ki39-node3 1 hdd 0.02399 osd.1 down 1.00000 1.00000 6 hdd 0.02399 osd.6 down 1.00000 1.00000 12 hdd 0.02399 osd.12 down 1.00000 1.00000 18 hdd 0.02399 osd.18 down 1.00000 1.00000 -5 0.09760 host ceph-pdhiran-22ki39-node4 0 hdd 0.02440 osd.0 down 1.00000 1.00000 7 hdd 0.02440 osd.7 down 1.00000 1.00000 13 hdd 0.02440 osd.13 down 1.00000 1.00000 19 hdd 0.02440 osd.19 down 1.00000 1.00000 -7 0.29300 datacenter DC2 -6 0.09799 host ceph-pdhiran-22ki39-node5 5 hdd 0.02399 osd.5 up 1.00000 1.00000 11 hdd 0.02399 osd.11 up 1.00000 1.00000 17 hdd 0.02399 osd.17 up 1.00000 1.00000 23 hdd 0.02399 osd.23 up 1.00000 1.00000 -8 0.09799 host ceph-pdhiran-22ki39-node6 3 hdd 0.02399 osd.3 up 1.00000 1.00000 9 hdd 0.02399 osd.9 up 1.00000 1.00000 14 hdd 0.02399 osd.14 up 1.00000 1.00000 20 hdd 0.02399 osd.20 up 1.00000 1.00000 -9 0.09799 host ceph-pdhiran-22ki39-node7 4 hdd 0.02399 osd.4 up 1.00000 1.00000 10 hdd 0.02399 osd.10 up 1.00000 1.00000 16 hdd 0.02399 osd.16 up 1.00000 1.00000 22 hdd 0.02399 osd.22 up 1.00000 1.00000 [root@ceph-pdhiran-22ki39-node8 ~]# ceph orch host ls HOST ADDR LABELS STATUS ceph-pdhiran-22ki39-node1-installer 10.0.211.122 _admin,alertmanager,grafana,installer,prometheus,mon,mgr ceph-pdhiran-22ki39-node2 10.0.211.45 _admin,mgr,mon,mds,osd ceph-pdhiran-22ki39-node3 10.0.209.24 nfs,osd,mgr,mon ceph-pdhiran-22ki39-node4 10.0.211.12 _admin,osd,rgw ceph-pdhiran-22ki39-node5 10.0.208.75 _admin,osd,mgr,mon Offline ceph-pdhiran-22ki39-node6 10.0.206.215 mon,osd,rgw,mgr Offline ceph-pdhiran-22ki39-node7 10.0.207.70 nfs,osd,mds Offline 7 hosts in cluster [root@ceph-pdhiran-22ki39-node8 ~]# ceph health detail HEALTH_WARN Failed to apply 1 service(s): osd.all-available-devices; We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer; 2/5 mons down, quorum ceph-pdhiran-22ki39-node6,ceph-pdhiran-22ki39-node5,ceph-pdhiran-22ki39-node1-installer; 1 datacenter (12 osds) down; 12 osds down; 3 hosts (12 osds) down; Degraded data redundancy: 46023/236072 objects degraded (19.495%), 122 pgs degraded, 238 pgs undersized; 4048 slow ops, oldest one blocked for 418 sec, daemons [mon.ceph-pdhiran-22ki39-node2,mon.ceph-pdhiran-22ki39-node3,mon.ceph-pdhiran-22ki39-node6] have slow ops. [WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s): osd.all-available-devices osd.all-available-devices: Failed to connect to ceph-pdhiran-22ki39-node5 (10.0.208.75): TimeoutError() Log: [conn=3, chan=27206] Command: sudo true [conn=2, chan=27130] Command: sudo true [conn=2, chan=27130] Received exit status 0 [conn=2, chan=27130] Received channel close [conn=2, chan=27130] Channel closed [conn=2, chan=27131] Requesting new SSH session [conn=2, chan=27131] Command: sudo which python3 [conn=3, chan=27206] Received exit status 0 [conn=3, chan=27206] Received channel close [conn=3, chan=27206] Channel closed [conn=3, chan=27207] Requesting new SSH session [conn=3, chan=27207] Command: sudo which python3 [conn=2, chan=27131] Received exit status 0 [conn=2, chan=27131] Received channel close [conn=2, chan=27131] Channel closed [conn=2, chan=27132] Requesting new SSH session [conn=2, chan=27132] Command: sudo true [conn=3, chan=27207] Received exit status 0 [conn=3, chan=27207] Received channel close [conn=3, chan=27207] Channel closed [conn=3, chan=27208] Requesting new SSH session [conn=3, chan=27208] Command: sudo true [conn=2, chan=27132] Received exit status 0 [conn=2, chan=27132] Received channel close [conn=2, chan=27132] Channel closed [conn=2, chan=27133] Requesting new SSH session [conn=2, chan=27133] Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --env CEPH_VOLUME_OSDSPEC_AFFINITY=all-available-devices --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 --config-json - -- lvm batch --no-auto /dev/vdb /dev/vdc /dev/vdd /dev/vde --yes --no-systemd [conn=3, chan=27208] Received exit status 0 [conn=3, chan=27208] Received channel close [conn=3, chan=27208] Channel closed [conn=3, chan=27209] Requesting new SSH session [conn=3, chan=27209] Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --env CEPH_VOLUME_OSDSPEC_AFFINITY=all-available-devices --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 --config-json - -- lvm batch --no-auto /dev/vdb /dev/vdc /dev/vdd /dev/vde --yes --no-systemd Opening SSH connection to 10.0.208.75, port 22 Opening SSH connection to 10.0.206.215, port 22 Opening SSH connection to 10.0.207.70, port 22 [conn=3, chan=27209] Received exit status 0 [conn=3, chan=27209] Received channel close [conn=3, chan=27209] Channel closed [conn=3, chan=27210] Requesting new SSH session [conn=2, chan=27133] Received exit status 0 [conn=2, chan=27133] Received channel close [conn=2, chan=27133] Channel closed [conn=3, chan=27210] Command: sudo true [conn=2, chan=27134] Requesting new SSH session [conn=2, chan=27134] Command: sudo true [conn=3, chan=27210] Received exit status 0 [conn=3, chan=27210] Received channel close [conn=3, chan=27210] Channel closed [conn=3, chan=27211] Requesting new SSH session [conn=3, chan=27211] Command: sudo which python3 [conn=2, chan=27134] Received exit status 0 [conn=2, chan=27134] Received channel close [conn=2, chan=27134] Channel closed [conn=2, chan=27135] Requesting new SSH session [conn=2, chan=27135] Command: sudo which python3 [conn=3, chan=27211] Received exit status 0 [conn=3, chan=27211] Received channel close [conn=3, chan=27211] Channel closed [conn=3, chan=27212] Requesting new SSH session [conn=2, chan=27135] Received exit status 0 [conn=2, chan=27135] Received channel close [conn=2, chan=27135] Channel closed [conn=3, chan=27212] Command: sudo true [conn=2, chan=27136] Requesting new SSH session [conn=2, chan=27136] Command: sudo true [conn=3, chan=27212] Received exit status 0 [conn=3, chan=27212] Received channel close [conn=3, chan=27212] Channel closed [conn=3, chan=27213] Requesting new SSH session [conn=2, chan=27136] Received exit status 0 [conn=2, chan=27136] Received channel close [conn=2, chan=27136] Channel closed [conn=3, chan=27213] Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e9 096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 -- lvm list --format json [conn=2, chan=27137] Requesting new SSH session [conn=2, chan=27137] Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 -- lvm list --format json [conn=3, chan=27213] Received exit status 0 [conn=3, chan=27213] Received channel close [conn=3, chan=27213] Channel closed [conn=3, chan=27214] Requesting new SSH session [conn=3, chan=27214] Command: sudo true [conn=3, chan=27214] Received exit status 0 [conn=3, chan=27214] Received channel close [conn=3, chan=27214] Channel closed [conn=3, chan=27215] Requesting new SSH session [conn=3, chan=27215] Command: sudo which python3 [conn=3, chan=27215] Received exit status 0 [conn=3, chan=27215] Received channel close [conn=3, chan=27215] Channel closed [conn=3, chan=27216] Requesting new SSH session [conn=3, chan=27216] Command: sudo true [conn=3, chan=27216] Received exit status 0 [conn=3, chan=27216] Received channel close [conn=3, chan=27216] Channel closed [conn=3, chan=27217] Requesting new SSH session [conn=3, chan=27217] Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 -- raw list --format json [conn=2, chan=27137] Received exit status 0 [conn=2, chan=27137] Received channel close [conn=2, chan=27137] Channel closed [conn=2, chan=27138] Requesting new SSH session [conn=2, chan=27138] Command: sudo true [conn=2, chan=27138] Received exit status 0 [conn=2, chan=27138] Received channel close [conn=2, chan=27138] Channel closed [conn=2, chan=27139] Requesting new SSH session [conn=2, chan=27139] Command: sudo which python3 [conn=2, chan=27139] Received exit status 0 [conn=2, chan=27139] Received channel close [conn=2, chan=27139] Channel closed [conn=2, chan=27140] Requesting new SSH session [conn=2, chan=27140] Command: sudo true [conn=2, chan=27140] Received exit status 0 [conn=2, chan=27140] Received channel close [conn=2, chan=27140] Channel closed [conn=2, chan=27141] Requesting new SSH session [conn=2, chan=27141] Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 -- raw list --format json [conn=3, chan=27217] Received exit status 0 [conn=3, chan=27217] Received channel close [conn=3, chan=27217] Channel closed [conn=2, chan=27141] Received exit status 0 [conn=2, chan=27141] Received channel close [conn=2, chan=27141] Channel closed [WRN] DEGRADED_STRETCH_MODE: We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer [WRN] MON_DOWN: 2/5 mons down, quorum ceph-pdhiran-22ki39-node6,ceph-pdhiran-22ki39-node5,ceph-pdhiran-22ki39-node1-installer mon.ceph-pdhiran-22ki39-node2 (rank 0) addr [v2:10.0.211.45:3300/0,v1:10.0.211.45:6789/0] is down (out of quorum) mon.ceph-pdhiran-22ki39-node3 (rank 2) addr [v2:10.0.209.24:3300/0,v1:10.0.209.24:6789/0] is down (out of quorum) [WRN] OSD_DATACENTER_DOWN: 1 datacenter (12 osds) down datacenter DC1 (root=default) (12 osds) is down [WRN] OSD_DOWN: 12 osds down osd.0 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node4) is down osd.1 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node3) is down osd.2 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node2) is down osd.6 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node3) is down osd.7 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node4) is down osd.8 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node2) is down osd.12 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node3) is down osd.13 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node4) is down osd.15 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node2) is down osd.18 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node3) is down osd.19 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node4) is down osd.21 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node2) is down [WRN] OSD_HOST_DOWN: 3 hosts (12 osds) down host ceph-pdhiran-22ki39-node2 (root=default,datacenter=DC1) (4 osds) is down host ceph-pdhiran-22ki39-node3 (root=default,datacenter=DC1) (4 osds) is down host ceph-pdhiran-22ki39-node4 (root=default,datacenter=DC1) (4 osds) is down [WRN] PG_DEGRADED: Degraded data redundancy: 46023/236072 objects degraded (19.495%), 122 pgs degraded, 238 pgs undersized pg 2.14 is stuck undersized for 7m, current state active+undersized+degraded, last acting [14,22,6] pg 2.16 is stuck undersized for 7m, current state active+undersized+degraded, last acting [9,22,21] pg 2.19 is stuck undersized for 7m, current state active+undersized+degraded, last acting [11,14,6] pg 2.1a is stuck undersized for 7m, current state active+undersized+degraded, last acting [3,23,8] pg 3.14 is stuck undersized for 7m, current state active+undersized+degraded, last acting [14,10,8] pg 3.15 is stuck undersized for 7m, current state active+undersized+degraded, last acting [10,23,18] pg 3.16 is stuck undersized for 7m, current state active+undersized+degraded, last acting [9,22,8] pg 3.18 is stuck undersized for 7m, current state active+undersized+degraded, last acting [10,11,21] pg 3.19 is stuck undersized for 7m, current state active+undersized+degraded, last acting [11,4,8] pg 3.1a is stuck undersized for 7m, current state active+undersized+degraded, last acting [4,17,1] pg 3.1b is stuck undersized for 7m, current state active+undersized+degraded, last acting [10,11,15] pg 4.10 is stuck undersized for 7m, current state active+undersized, last acting [9,23,1] pg 4.11 is stuck undersized for 7m, current state active+undersized, last acting [3,5,12] pg 4.13 is stuck undersized for 7m, current state active+undersized, last acting [9,4,6] pg 4.1c is stuck undersized for 7m, current state active+undersized, last acting [4,3,1] pg 4.1e is stuck undersized for 7m, current state active+undersized, last acting [14,4,18] pg 4.1f is stuck undersized for 7m, current state active+undersized, last acting [17,16,18] pg 5.10 is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,14,15] pg 5.12 is stuck undersized for 7m, current state active+undersized+degraded, last acting [3,22,1] pg 5.13 is stuck undersized for 7m, current state active+undersized+degraded, last acting [4,5,15] pg 5.1c is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,14,6] pg 5.1d is stuck undersized for 7m, current state active+undersized+degraded, last acting [4,23,6] pg 5.1e is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,11,8] pg 5.1f is stuck undersized for 7m, current state active+undersized+degraded, last acting [23,10,6] pg 6.1d is stuck undersized for 7m, current state active+undersized, last acting [4,9,6] pg 6.1e is stuck undersized for 7m, current state active+undersized, last acting [5,10,8] pg 6.1f is stuck undersized for 7m, current state active+undersized+degraded, last acting [11,9,6] pg 7.10 is stuck undersized for 7m, current state active+undersized, last acting [3,10,6] pg 7.12 is stuck undersized for 7m, current state active+undersized, last acting [10,11,6] pg 7.13 is stuck undersized for 7m, current state active+undersized+degraded, last acting [23,16,15] pg 7.1d is stuck undersized for 7m, current state active+undersized+degraded, last acting [5,16,18] pg 7.1e is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,11,18] pg 8.10 is stuck undersized for 7m, current state active+undersized, last acting [16,20,6] pg 8.11 is stuck undersized for 7m, current state active+undersized, last acting [10,3,8] pg 8.12 is stuck undersized for 7m, current state active+undersized, last acting [11,22,1] pg 8.1d is stuck undersized for 7m, current state active+undersized, last acting [23,9,15] pg 8.1e is stuck undersized for 7m, current state active+undersized, last acting [16,11,1] pg 8.1f is stuck undersized for 7m, current state active+undersized, last acting [17,14,2] pg 10.10 is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,9,1] pg 10.11 is stuck undersized for 7m, current state active+undersized+degraded, last acting [5,22,8] pg 10.1d is stuck undersized for 7m, current state active+undersized+degraded, last acting [3,23,8] pg 10.1f is stuck undersized for 7m, current state active+undersized+degraded, last acting [3,4,21] pg 11.11 is stuck undersized for 7m, current state active+undersized, last acting [3,11,1] pg 11.13 is stuck undersized for 7m, current state active+undersized, last acting [14,23,8] pg 11.1c is stuck undersized for 7m, current state active+undersized+degraded, last acting [9,23,21] pg 11.1e is stuck undersized for 7m, current state active+undersized, last acting [22,3,2] pg 12.14 is stuck undersized for 7m, current state active+undersized, last acting [17,16,18] pg 12.15 is stuck undersized for 7m, current state active+undersized, last acting [5,22,2] pg 12.19 is stuck undersized for 7m, current state active+undersized, last acting [9,16,18] pg 12.1a is stuck undersized for 7m, current state active+undersized, last acting [23,9,8] pg 12.1b is stuck undersized for 7m, current state active+undersized, last acting [20,10,21] [WRN] SLOW_OPS: 4048 slow ops, oldest one blocked for 418 sec, daemons [mon.ceph-pdhiran-22ki39-node2,mon.ceph-pdhiran-22ki39-node3,mon.ceph-pdhiran-22ki39-node6] have slow ops. 5. But the IO's are hung as well as other ceph commands are not getting executed and are stuck. This is not as expected. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: As mentioned above Actual results: The cluster is getting unresponsive during netsplits Expected results: The cluster should work as expected during netsplit scenarios. Additional info: [1]. https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.14/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/metro-dr-solution#hardware_requirements 1. Is the cluster getting stuck an expected behaviour? 2. We observed that if there exists a Mgr daemon on the arbiter site, there were no issues observed and the cluster worked as expected in netsplit scenarios. Is a Mgr daemon necessary on arbiter node? If yes, this needs to be documented in the above guide, in the daemon placement section.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:10216