Bug 2249962 - [Stretch Mode] Cluster unresponsive and commands are stuck during Netsplit scenario b/w the two data sites
Summary: [Stretch Mode] Cluster unresponsive and commands are stuck during Netsplit sc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 8.0
Assignee: Kamoltat (Junior) Sirivadhna
QA Contact: Pawan
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks: 2267614 2298578 2298579 2310114 2310115 2317218
TreeView+ depends on / blocked
 
Reported: 2023-11-16 04:29 UTC by Pawan
Modified: 2024-11-25 09:00 UTC (History)
14 users (show)

Fixed In Version: ceph-19.1.1-3.el9cp
Doc Type: Bug Fix
Doc Text:
.Monitors no longer get stuck in elections during crashor shutdown tests Previously, the `disallowed_leaders` attribute of the MonitorMap was conditionally filled only when entering `stretch_mode`. However, there were instances wherein Monitors that just got revived would not enter `stretch_mode` right away because they would be in a `probing` state. This led to a mismatch in the `disallowed_leaders` set between the monitors across the cluster. Due to this, Monitors would fail to elect a leader, and the election would be stuck, resulting in Ceph being unresponsive. With this fix, Monitors do not have to be in `stretch_mode` to fill the `disallowed_leaders` attribute. Monitors no longer get stuck in elections during crashor shutdown tests.
Clone Of:
: 2310114 2310115 (view as bug list)
Environment:
Last Closed: 2024-11-25 09:00:02 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 54003 0 None Merged reef: src/mon/Monitor: Fix set_elector_disallowed_leaders 2023-12-14 21:15:40 UTC
Red Hat Issue Tracker RHCEPH-7921 0 None None None 2023-11-16 04:30:05 UTC
Red Hat Product Errata RHBA-2024:10216 0 None None None 2024-11-25 09:00:21 UTC

Description Pawan 2023-11-16 04:29:39 UTC
Description of problem:

The cluster is unresponsive and the ceph commands are stuck when there is netsplit scenario created between the two data sites.

workflow:
1. Deploy RHCS stretch cluster, with below daemon placements.
HOST                                 ADDR          LABELS                                                    STATUS
ceph-pdhiran-22ki39-node1-installer (Arbiter)  10.0.211.122  _admin,alertmanager,grafana,installer,prometheus,mon
ceph-pdhiran-22ki39-node2 (DC1)                10.0.211.45   _admin,mgr,mon,mds,osd
ceph-pdhiran-22ki39-node3 (DC1)                10.0.209.24   nfs,osd,mgr,mon
ceph-pdhiran-22ki39-node4 (DC1)                10.0.211.12   _admin,osd,rgw
ceph-pdhiran-22ki39-node5 (DC2)                10.0.208.75   _admin,osd,mgr,mon                                        Offline
ceph-pdhiran-22ki39-node6 (DC2)                10.0.206.215  mon,osd,rgw,mgr                                           Offline
ceph-pdhiran-22ki39-node7 (DC2)                10.0.207.70   nfs,osd,mds                                               Offline

Where node1 is the arbiter node, with monitoring stack components and single tiebreaker mon, and the other nodes have the remaining daemons with OSDs.
This daemon placement in the cluster is in accordance with the daemon placement suggested in the MDR guide [1].

2. Run workloads and fill up the pools.

3. Introduce netsplit scenario by adding IP table rules in DC1, blocking all incoming and outgoing traffic from and to DC2, using below commands.

[root@ceph-pdhiran-22ki39-node4 ~]# iptables -A INPUT -s 10.0.208.75  -j DROP; iptables -A OUTPUT -d 10.0.208.75 -j DROP
[root@ceph-pdhiran-22ki39-node4 ~]# iptables -A INPUT -s 10.0.206.215  -j DROP; iptables -A OUTPUT -d 10.0.206.215 -j DROP
[root@ceph-pdhiran-22ki39-node4 ~]# iptables -A INPUT -s 10.0.207.70  -j DROP; iptables -A OUTPUT -d 10.0.207.70 -j DROP


With the above commands added in all DC1 hosts, all incoming and outgoing traffic from DC1 to DC2 is dropped.

Note: There are no IPtables rule added on arbiter node. Arbiter node is able to communicate with both the data sites. Only communication b/w DC1 & DC2 is dropped.

4. Observe that the ceph status shows that one site is down, which is as expected.
# ceph -s
  cluster:
    id:     d1baa022-68e9-11ee-8703-fa163ee116f5
    health: HEALTH_WARN
            Failed to apply 1 service(s): osd.all-available-devices
            We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
            2/5 mons down, quorum ceph-pdhiran-22ki39-node6,ceph-pdhiran-22ki39-node5,ceph-pdhiran-22ki39-node1-installer
            1 datacenter (12 osds) down
            12 osds down
            3 hosts (12 osds) down
            Degraded data redundancy: 46023/236072 objects degraded (19.495%), 122 pgs degraded, 238 pgs undersized
            794 slow ops, oldest one blocked for 193 sec, daemons [mon.ceph-pdhiran-22ki39-node2,mon.ceph-pdhiran-22ki39-node3,mon.ceph-pdhiran-22ki39-node6] have slow ops.

  services:
    mon: 5 daemons, quorum ceph-pdhiran-22ki39-node6,ceph-pdhiran-22ki39-node5,ceph-pdhiran-22ki39-node1-installer (age 61s), out of quorum: ceph-pdhiran-22ki39-node2, ceph-pdhiran-22ki39-node3
    mgr: ceph-pdhiran-22ki39-node2.fsqnfn(active, since 27h), standbys: ceph-pdhiran-22ki39-node3.tvqgjm, ceph-pdhiran-22ki39-node5.tabrcd, ceph-pdhiran-22ki39-node6.gulgol, ceph-pdhiran-22ki39-node1-installer.shjsea
    mds: 1/1 daemons up, 1 standby
    osd: 24 osds: 12 up (since 72s), 24 in (since 26h)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 337 pgs
    objects: 59.02k objects, 1.4 GiB
    usage:   10 GiB used, 590 GiB / 600 GiB avail
    pgs:     46023/236072 objects degraded (19.495%)
             122 active+undersized+degraded
             116 active+undersized
             99  active+clean

[root@ceph-pdhiran-22ki39-node8 ~]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME                               STATUS  REWEIGHT  PRI-AFF
-1         0.58658  root default
-3         0.29358      datacenter DC1
-2         0.09799          host ceph-pdhiran-22ki39-node2
 2    hdd  0.02399              osd.2                         down   1.00000  1.00000
 8    hdd  0.02399              osd.8                         down   1.00000  1.00000
15    hdd  0.02399              osd.15                        down   1.00000  1.00000
21    hdd  0.02399              osd.21                        down   1.00000  1.00000
-4         0.09799          host ceph-pdhiran-22ki39-node3
 1    hdd  0.02399              osd.1                         down   1.00000  1.00000
 6    hdd  0.02399              osd.6                         down   1.00000  1.00000
12    hdd  0.02399              osd.12                        down   1.00000  1.00000
18    hdd  0.02399              osd.18                        down   1.00000  1.00000
-5         0.09760          host ceph-pdhiran-22ki39-node4
 0    hdd  0.02440              osd.0                         down   1.00000  1.00000
 7    hdd  0.02440              osd.7                         down   1.00000  1.00000
13    hdd  0.02440              osd.13                        down   1.00000  1.00000
19    hdd  0.02440              osd.19                        down   1.00000  1.00000
-7         0.29300      datacenter DC2
-6         0.09799          host ceph-pdhiran-22ki39-node5
 5    hdd  0.02399              osd.5                           up   1.00000  1.00000
11    hdd  0.02399              osd.11                          up   1.00000  1.00000
17    hdd  0.02399              osd.17                          up   1.00000  1.00000
23    hdd  0.02399              osd.23                          up   1.00000  1.00000
-8         0.09799          host ceph-pdhiran-22ki39-node6
 3    hdd  0.02399              osd.3                           up   1.00000  1.00000
 9    hdd  0.02399              osd.9                           up   1.00000  1.00000
14    hdd  0.02399              osd.14                          up   1.00000  1.00000
20    hdd  0.02399              osd.20                          up   1.00000  1.00000
-9         0.09799          host ceph-pdhiran-22ki39-node7
 4    hdd  0.02399              osd.4                           up   1.00000  1.00000
10    hdd  0.02399              osd.10                          up   1.00000  1.00000
16    hdd  0.02399              osd.16                          up   1.00000  1.00000
22    hdd  0.02399              osd.22                          up   1.00000  1.00000

[root@ceph-pdhiran-22ki39-node8 ~]# ceph orch host ls
HOST                                 ADDR          LABELS                                                    STATUS
ceph-pdhiran-22ki39-node1-installer  10.0.211.122  _admin,alertmanager,grafana,installer,prometheus,mon,mgr
ceph-pdhiran-22ki39-node2            10.0.211.45   _admin,mgr,mon,mds,osd
ceph-pdhiran-22ki39-node3            10.0.209.24   nfs,osd,mgr,mon
ceph-pdhiran-22ki39-node4            10.0.211.12   _admin,osd,rgw
ceph-pdhiran-22ki39-node5            10.0.208.75   _admin,osd,mgr,mon                                        Offline
ceph-pdhiran-22ki39-node6            10.0.206.215  mon,osd,rgw,mgr                                           Offline
ceph-pdhiran-22ki39-node7            10.0.207.70   nfs,osd,mds                                               Offline
7 hosts in cluster

[root@ceph-pdhiran-22ki39-node8 ~]# ceph health detail
HEALTH_WARN Failed to apply 1 service(s): osd.all-available-devices; We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer; 2/5 mons down, quorum ceph-pdhiran-22ki39-node6,ceph-pdhiran-22ki39-node5,ceph-pdhiran-22ki39-node1-installer; 1 datacenter (12 osds) down; 12 osds down; 3 hosts (12 osds) down; Degraded data redundancy: 46023/236072 objects degraded (19.495%), 122 pgs degraded, 238 pgs undersized; 4048 slow ops, oldest one blocked for 418 sec, daemons [mon.ceph-pdhiran-22ki39-node2,mon.ceph-pdhiran-22ki39-node3,mon.ceph-pdhiran-22ki39-node6] have slow ops.
[WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s): osd.all-available-devices
    osd.all-available-devices: Failed to connect to ceph-pdhiran-22ki39-node5 (10.0.208.75): TimeoutError()
Log: [conn=3, chan=27206]   Command: sudo true
[conn=2, chan=27130]   Command: sudo true
[conn=2, chan=27130] Received exit status 0
[conn=2, chan=27130] Received channel close
[conn=2, chan=27130] Channel closed
[conn=2, chan=27131] Requesting new SSH session
[conn=2, chan=27131]   Command: sudo which python3
[conn=3, chan=27206] Received exit status 0
[conn=3, chan=27206] Received channel close
[conn=3, chan=27206] Channel closed
[conn=3, chan=27207] Requesting new SSH session
[conn=3, chan=27207]   Command: sudo which python3
[conn=2, chan=27131] Received exit status 0
[conn=2, chan=27131] Received channel close
[conn=2, chan=27131] Channel closed
[conn=2, chan=27132] Requesting new SSH session
[conn=2, chan=27132]   Command: sudo true
[conn=3, chan=27207] Received exit status 0
[conn=3, chan=27207] Received channel close
[conn=3, chan=27207] Channel closed
[conn=3, chan=27208] Requesting new SSH session
[conn=3, chan=27208]   Command: sudo true
[conn=2, chan=27132] Received exit status 0
[conn=2, chan=27132] Received channel close
[conn=2, chan=27132] Channel closed
[conn=2, chan=27133] Requesting new SSH session
[conn=2, chan=27133]   Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --env CEPH_VOLUME_OSDSPEC_AFFINITY=all-available-devices --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 --config-json - -- lvm batch --no-auto /dev/vdb /dev/vdc /dev/vdd /dev/vde --yes --no-systemd
[conn=3, chan=27208] Received exit status 0
[conn=3, chan=27208] Received channel close
[conn=3, chan=27208] Channel closed
[conn=3, chan=27209] Requesting new SSH session
[conn=3, chan=27209]   Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --env CEPH_VOLUME_OSDSPEC_AFFINITY=all-available-devices --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 --config-json - -- lvm batch --no-auto /dev/vdb /dev/vdc /dev/vdd /dev/vde --yes --no-systemd
Opening SSH connection to 10.0.208.75, port 22
Opening SSH connection to 10.0.206.215, port 22
Opening SSH connection to 10.0.207.70, port 22
[conn=3, chan=27209] Received exit status 0
[conn=3, chan=27209] Received channel close
[conn=3, chan=27209] Channel closed
[conn=3, chan=27210] Requesting new SSH session
[conn=2, chan=27133] Received exit status 0
[conn=2, chan=27133] Received channel close
[conn=2, chan=27133] Channel closed
[conn=3, chan=27210]   Command: sudo true
[conn=2, chan=27134] Requesting new SSH session
[conn=2, chan=27134]   Command: sudo true
[conn=3, chan=27210] Received exit status 0
[conn=3, chan=27210] Received channel close
[conn=3, chan=27210] Channel closed
[conn=3, chan=27211] Requesting new SSH session
[conn=3, chan=27211]   Command: sudo which python3
[conn=2, chan=27134] Received exit status 0
[conn=2, chan=27134] Received channel close
[conn=2, chan=27134] Channel closed
[conn=2, chan=27135] Requesting new SSH session
[conn=2, chan=27135]   Command: sudo which python3
[conn=3, chan=27211] Received exit status 0
[conn=3, chan=27211] Received channel close
[conn=3, chan=27211] Channel closed
[conn=3, chan=27212] Requesting new SSH session
[conn=2, chan=27135] Received exit status 0
[conn=2, chan=27135] Received channel close
[conn=2, chan=27135] Channel closed
[conn=3, chan=27212]   Command: sudo true
[conn=2, chan=27136] Requesting new SSH session
[conn=2, chan=27136]   Command: sudo true
[conn=3, chan=27212] Received exit status 0
[conn=3, chan=27212] Received channel close
[conn=3, chan=27212] Channel closed
[conn=3, chan=27213] Requesting new SSH session
[conn=2, chan=27136] Received exit status 0
[conn=2, chan=27136] Received channel close
[conn=2, chan=27136] Channel closed
[conn=3, chan=27213]   Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e9
096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 -- lvm list --format json
[conn=2, chan=27137] Requesting new SSH session
[conn=2, chan=27137]   Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 -- lvm list --format json
[conn=3, chan=27213] Received exit status 0
[conn=3, chan=27213] Received channel close
[conn=3, chan=27213] Channel closed
[conn=3, chan=27214] Requesting new SSH session
[conn=3, chan=27214]   Command: sudo true
[conn=3, chan=27214] Received exit status 0
[conn=3, chan=27214] Received channel close
[conn=3, chan=27214] Channel closed
[conn=3, chan=27215] Requesting new SSH session
[conn=3, chan=27215]   Command: sudo which python3
[conn=3, chan=27215] Received exit status 0
[conn=3, chan=27215] Received channel close
[conn=3, chan=27215] Channel closed
[conn=3, chan=27216] Requesting new SSH session
[conn=3, chan=27216]   Command: sudo true
[conn=3, chan=27216] Received exit status 0
[conn=3, chan=27216] Received channel close
[conn=3, chan=27216] Channel closed
[conn=3, chan=27217] Requesting new SSH session
[conn=3, chan=27217]   Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 -- raw list --format json
[conn=2, chan=27137] Received exit status 0
[conn=2, chan=27137] Received channel close
[conn=2, chan=27137] Channel closed
[conn=2, chan=27138] Requesting new SSH session
[conn=2, chan=27138]   Command: sudo true
[conn=2, chan=27138] Received exit status 0
[conn=2, chan=27138] Received channel close
[conn=2, chan=27138] Channel closed
[conn=2, chan=27139] Requesting new SSH session
[conn=2, chan=27139]   Command: sudo which python3
[conn=2, chan=27139] Received exit status 0
[conn=2, chan=27139] Received channel close
[conn=2, chan=27139] Channel closed
[conn=2, chan=27140] Requesting new SSH session
[conn=2, chan=27140]   Command: sudo true
[conn=2, chan=27140] Received exit status 0
[conn=2, chan=27140] Received channel close
[conn=2, chan=27140] Channel closed
[conn=2, chan=27141] Requesting new SSH session
[conn=2, chan=27141]   Command: sudo /bin/python3 /var/lib/ceph/d1baa022-68e9-11ee-8703-fa163ee116f5/cephadm.e05194e09bd9a5a250c63b8f4e2484c62024f0e7358420e5e8d3824045ab1a14 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:096b1f24b40b4fca8f3c3ace7085e94e54a8b347226ee866d000c4e67cb3da2e --timeout 895 ceph-volume --fsid d1baa022-68e9-11ee-8703-fa163ee116f5 -- raw list --format json
[conn=3, chan=27217] Received exit status 0
[conn=3, chan=27217] Received channel close
[conn=3, chan=27217] Channel closed
[conn=2, chan=27141] Received exit status 0
[conn=2, chan=27141] Received channel close
[conn=2, chan=27141] Channel closed

[WRN] DEGRADED_STRETCH_MODE: We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
[WRN] MON_DOWN: 2/5 mons down, quorum ceph-pdhiran-22ki39-node6,ceph-pdhiran-22ki39-node5,ceph-pdhiran-22ki39-node1-installer
    mon.ceph-pdhiran-22ki39-node2 (rank 0) addr [v2:10.0.211.45:3300/0,v1:10.0.211.45:6789/0] is down (out of quorum)
    mon.ceph-pdhiran-22ki39-node3 (rank 2) addr [v2:10.0.209.24:3300/0,v1:10.0.209.24:6789/0] is down (out of quorum)
[WRN] OSD_DATACENTER_DOWN: 1 datacenter (12 osds) down
    datacenter DC1 (root=default) (12 osds) is down
[WRN] OSD_DOWN: 12 osds down
    osd.0 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node4) is down
    osd.1 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node3) is down
    osd.2 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node2) is down
    osd.6 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node3) is down
    osd.7 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node4) is down
    osd.8 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node2) is down
    osd.12 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node3) is down
    osd.13 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node4) is down
    osd.15 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node2) is down
    osd.18 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node3) is down
    osd.19 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node4) is down
    osd.21 (root=default,datacenter=DC1,host=ceph-pdhiran-22ki39-node2) is down
[WRN] OSD_HOST_DOWN: 3 hosts (12 osds) down
    host ceph-pdhiran-22ki39-node2 (root=default,datacenter=DC1) (4 osds) is down
    host ceph-pdhiran-22ki39-node3 (root=default,datacenter=DC1) (4 osds) is down
    host ceph-pdhiran-22ki39-node4 (root=default,datacenter=DC1) (4 osds) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 46023/236072 objects degraded (19.495%), 122 pgs degraded, 238 pgs undersized
    pg 2.14 is stuck undersized for 7m, current state active+undersized+degraded, last acting [14,22,6]
    pg 2.16 is stuck undersized for 7m, current state active+undersized+degraded, last acting [9,22,21]
    pg 2.19 is stuck undersized for 7m, current state active+undersized+degraded, last acting [11,14,6]
    pg 2.1a is stuck undersized for 7m, current state active+undersized+degraded, last acting [3,23,8]
    pg 3.14 is stuck undersized for 7m, current state active+undersized+degraded, last acting [14,10,8]
    pg 3.15 is stuck undersized for 7m, current state active+undersized+degraded, last acting [10,23,18]
    pg 3.16 is stuck undersized for 7m, current state active+undersized+degraded, last acting [9,22,8]
    pg 3.18 is stuck undersized for 7m, current state active+undersized+degraded, last acting [10,11,21]
    pg 3.19 is stuck undersized for 7m, current state active+undersized+degraded, last acting [11,4,8]
    pg 3.1a is stuck undersized for 7m, current state active+undersized+degraded, last acting [4,17,1]
    pg 3.1b is stuck undersized for 7m, current state active+undersized+degraded, last acting [10,11,15]
    pg 4.10 is stuck undersized for 7m, current state active+undersized, last acting [9,23,1]
    pg 4.11 is stuck undersized for 7m, current state active+undersized, last acting [3,5,12]
    pg 4.13 is stuck undersized for 7m, current state active+undersized, last acting [9,4,6]
    pg 4.1c is stuck undersized for 7m, current state active+undersized, last acting [4,3,1]
    pg 4.1e is stuck undersized for 7m, current state active+undersized, last acting [14,4,18]
    pg 4.1f is stuck undersized for 7m, current state active+undersized, last acting [17,16,18]
    pg 5.10 is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,14,15]
    pg 5.12 is stuck undersized for 7m, current state active+undersized+degraded, last acting [3,22,1]
    pg 5.13 is stuck undersized for 7m, current state active+undersized+degraded, last acting [4,5,15]
    pg 5.1c is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,14,6]
    pg 5.1d is stuck undersized for 7m, current state active+undersized+degraded, last acting [4,23,6]
    pg 5.1e is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,11,8]
    pg 5.1f is stuck undersized for 7m, current state active+undersized+degraded, last acting [23,10,6]
    pg 6.1d is stuck undersized for 7m, current state active+undersized, last acting [4,9,6]
    pg 6.1e is stuck undersized for 7m, current state active+undersized, last acting [5,10,8]
    pg 6.1f is stuck undersized for 7m, current state active+undersized+degraded, last acting [11,9,6]
    pg 7.10 is stuck undersized for 7m, current state active+undersized, last acting [3,10,6]
    pg 7.12 is stuck undersized for 7m, current state active+undersized, last acting [10,11,6]
    pg 7.13 is stuck undersized for 7m, current state active+undersized+degraded, last acting [23,16,15]
    pg 7.1d is stuck undersized for 7m, current state active+undersized+degraded, last acting [5,16,18]
    pg 7.1e is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,11,18]
    pg 8.10 is stuck undersized for 7m, current state active+undersized, last acting [16,20,6]
    pg 8.11 is stuck undersized for 7m, current state active+undersized, last acting [10,3,8]
    pg 8.12 is stuck undersized for 7m, current state active+undersized, last acting [11,22,1]
    pg 8.1d is stuck undersized for 7m, current state active+undersized, last acting [23,9,15]
    pg 8.1e is stuck undersized for 7m, current state active+undersized, last acting [16,11,1]
    pg 8.1f is stuck undersized for 7m, current state active+undersized, last acting [17,14,2]
    pg 10.10 is stuck undersized for 7m, current state active+undersized+degraded, last acting [22,9,1]
    pg 10.11 is stuck undersized for 7m, current state active+undersized+degraded, last acting [5,22,8]
    pg 10.1d is stuck undersized for 7m, current state active+undersized+degraded, last acting [3,23,8]
    pg 10.1f is stuck undersized for 7m, current state active+undersized+degraded, last acting [3,4,21]
    pg 11.11 is stuck undersized for 7m, current state active+undersized, last acting [3,11,1]
    pg 11.13 is stuck undersized for 7m, current state active+undersized, last acting [14,23,8]
    pg 11.1c is stuck undersized for 7m, current state active+undersized+degraded, last acting [9,23,21]
    pg 11.1e is stuck undersized for 7m, current state active+undersized, last acting [22,3,2]
    pg 12.14 is stuck undersized for 7m, current state active+undersized, last acting [17,16,18]
    pg 12.15 is stuck undersized for 7m, current state active+undersized, last acting [5,22,2]
    pg 12.19 is stuck undersized for 7m, current state active+undersized, last acting [9,16,18]
    pg 12.1a is stuck undersized for 7m, current state active+undersized, last acting [23,9,8]
    pg 12.1b is stuck undersized for 7m, current state active+undersized, last acting [20,10,21]
[WRN] SLOW_OPS: 4048 slow ops, oldest one blocked for 418 sec, daemons [mon.ceph-pdhiran-22ki39-node2,mon.ceph-pdhiran-22ki39-node3,mon.ceph-pdhiran-22ki39-node6] have slow ops.


5. But the IO's are hung as well as other ceph commands are not getting executed and are stuck. This is not as expected.

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
As mentioned above

Actual results:
The cluster is getting unresponsive during netsplits

Expected results:
The cluster should work as expected during netsplit scenarios.

Additional info:
[1]. https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.14/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/metro-dr-solution#hardware_requirements 


1. Is the cluster getting stuck an expected behaviour?
2. We observed that if there exists a Mgr daemon on the arbiter site, there were no issues observed and the cluster worked as expected in netsplit scenarios. Is a Mgr daemon necessary on arbiter node? If yes, this needs to be documented in the above guide, in the daemon placement section.

Comment 56 errata-xmlrpc 2024-11-25 09:00:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:10216


Note You need to log in before you can comment on or make changes to this bug.