Description of problem (please be detailed as possible and provide log snippests): [vSphere]: On a fresh Arbiter deployment (3M + 6W), PG state is not active+clean which result in blocking of creating default OSD PDB Version of all relevant components (if applicable): ocs-registry:4.16.0-135 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? 3/3 Can this issue reproduce from the UI? Not tried If this is a regression, please provide more details to justify this: Yes Steps to Reproduce: 1. install Arbiter deployment 2. check all PDB's are created 3. Actual results: $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 7h54m rook-ceph-mgr-pdb N/A 1 1 7h52m rook-ceph-mon-pdb N/A 2 2 7h52m rook-ceph-osd-zone-data-2 N/A 0 0 7h51m rook-ceph-rgw-ocs-storagecluster-cephobjectstore 1 N/A 1 7h54m Expected results: Default OSD pdb should be "rook-ceph-osd" Additional info: Some times we see 2 PDBs created for OSD. > rook ceph operator log 2024-07-09 17:59:41.547696 I | clusterdisruption-controller: osd is down in failure domain "data-2". pg health: "cluster has no PGs" 2024-07-09 17:59:41.547749 I | clusterdisruption-controller: deleting the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd . . . 2024-07-09 17:59:48.866945 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down but no node drain is detected 2024-07-09 17:59:49.241674 I | clusterdisruption-controller: osd is down in failure domain "data-1". pg health: "cluster has no PGs" 2024-07-09 17:59:49.241775 I | clusterdisruption-controller: creating temporary blocking pdb "rook-ceph-osd-zone-data-2" with maxUnavailable=0 for "zone" failure domain "data-2" > All OSD's are up $ oc get pods | grep -i osd | egrep -v "Running|Completed" $ > sh-5.1$ ceph -s cluster: id: 230697f0-ea45-4ceb-9d8a-ee9341ec19c5 health: HEALTH_OK services: mon: 5 daemons, quorum a,b,c,d,e (age 8h) mgr: a(active, since 8h), standbys: b mds: 1/1 daemons up, 1 hot standby osd: 12 osds: 12 up (since 8h), 12 in (since 8h); 3 remapped pgs rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 61 pgs objects: 2.34k objects, 6.9 GiB usage: 28 GiB used, 6.0 TiB / 6 TiB avail pgs: 3846/9356 objects misplaced (41.107%) 58 active+clean 3 active+clean+remapped io: client: 1023 B/s rd, 263 KiB/s wr, 1 op/s rd, 2 op/s wr sh-5.1$ > sh-5.1$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 6.00000 root default -10 3.00000 zone data-1 -13 1.00000 host compute-0 3 ssd 0.50000 osd.3 up 1.00000 1.00000 9 ssd 0.50000 osd.9 up 1.00000 1.00000 -9 1.00000 host compute-2 2 ssd 0.50000 osd.2 up 1.00000 1.00000 7 ssd 0.50000 osd.7 up 1.00000 1.00000 -17 1.00000 host compute-4 10 ssd 0.50000 osd.10 up 1.00000 1.00000 11 ssd 0.50000 osd.11 up 1.00000 1.00000 -4 3.00000 zone data-2 -3 1.00000 host compute-1 0 ssd 0.50000 osd.0 up 1.00000 1.00000 6 ssd 0.50000 osd.6 up 1.00000 1.00000 -7 1.00000 host compute-3 1 ssd 0.50000 osd.1 up 1.00000 1.00000 8 ssd 0.50000 osd.8 up 1.00000 1.00000 -15 1.00000 host compute-5 4 ssd 0.50000 osd.4 up 1.00000 1.00000 5 ssd 0.50000 osd.5 up 1.00000 1.00000 sh-5.1$ > job: https://url.corp.redhat.com/e4d0617 must gather: https://url.corp.redhat.com/7c182c1 > I have tested the same in older build ( 4.16.0-120 ) and its working fine and no issue seen in 4.16.0-120
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591