Description of problem (please be detailed as possible and provide log snippests): mon pod scaledown is skipped if the mons are portable Version of all relevant components (if applicable): OCP 4.15 and ODF 4.15.0-150 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? NA Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: NA Steps to Reproduce: 1. Install OCP 4.15 and ODF 4.15.0-150 on a 6 worker node and 6 failure domain cluster on vsphere 2. Update the mon count to 5 and then change it back to 3 from storagecluster ('monCount' attribute) 3. The monCount value is updated in storagecluster and cephcluster, But still five mons tend to exist Actual results: Even after scaling down mon pods to 3, five mon pods keep running Expected results: Upon scaling down the mon pod to three, only three mon pods should be running Additional info: storagecluster CR: spec: arbiter: {} enableCephTools: true encryption: kms: {} externalStorage: {} managedResources: cephBlockPools: {} cephCluster: monCount: 3 cephConfig: {} cephcluster CR: name: balancer mon: count: 3 volumeClaimTemplate: rook ceph operator log: 2024-02-28 14:40:10.746033 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage" 2024-02-28 14:40:10.796538 I | ceph-cluster-controller: reporting cluster telemetry 2024-02-28 14:40:10.804990 I | ceph-cluster-controller: enabling ceph mon monitoring goroutine for cluster "openshift-storage" 2024-02-28 14:40:16.823208 I | ceph-cluster-controller: reporting node telemetry 2024-02-28 14:40:56.290615 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired 2024-02-28 14:40:56.290662 I | op-mon: removing arbitrary extra mon "" 2024-02-28 14:40:56.290666 I | op-mon: did not identify a mon to remove 2024-02-28 14:41:41.744893 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired 2024-02-28 14:41:41.744951 I | op-mon: removing arbitrary extra mon "" 2024-02-28 14:41:41.744954 I | op-mon: did not identify a mon to remove 2024-02-28 14:42:27.196955 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired 2024-02-28 14:42:27.196997 I | op-mon: removing arbitrary extra mon "" 2024-02-28 14:42:27.197000 I | op-mon: did not identify a mon to remove 2024-02-28 14:43:12.623301 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired 2024-02-28 14:43:12.623450 I | op-mon: removing arbitrary extra mon "" 2024-02-28 14:43:12.623470 I | op-mon: did not identify a mon to remove
Moving to 4.16, not a blocker.
Based on previous comment/description seems like a rook specific issue, transferring it to rook.
Verified with OCP 4.16.0-0.nightly-2024-05-08-222442 and ODF 4.16.0-96 Verification steps: 1. Installed OCP 4.16 and ODF 4.16.0-96 on a 6 worker node and 6 failure domain cluster on vsphere 2. Updated the mon count to 5 and then changed it back to 3 from storagecluster ('monCount' attribute) 3. The monCount value is updated in storagecluster and cephcluster, and mon pods count is reduced to three storagecluster yaml: kms: {} externalStorage: {} managedResources: cephBlockPools: {} cephCluster: monCount: 5 [jopinto@jopinto 5mbug]$ oc get pods -n openshift-storage | grep mon rook-ceph-mon-a-5cdd784484-zp6bl 2/2 Running 0 35m rook-ceph-mon-b-5fdd68b844-4lb44 2/2 Running 0 34m rook-ceph-mon-c-5f55dfb6bb-ch9ld 2/2 Running 0 34m rook-ceph-mon-d-65ddfd5556-rlbdc 2/2 Running 0 8m38s rook-ceph-mon-e-58d8475cd8-ndxdg 2/2 Running 0 8m18s storagecluster yaml: kms: {} externalStorage: {} managedResources: cephBlockPools: {} cephCluster: monCount: 3 [jopinto@jopinto 5mbug]$ oc get pods -n openshift-storage | grep mon rook-ceph-mon-c-5f55dfb6bb-ch9ld 2/2 Running 0 41m rook-ceph-mon-d-65ddfd5556-rlbdc 2/2 Running 0 15m rook-ceph-mon-e-58d8475cd8-ndxdg 2/2 Running 0 15m sh-5.1$ ceph health HEALTH_OK sh-5.1$ Also upon changing the monCount back to three, CephMonLowNumber alert is trigeered which is expected.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591