Bug 2266621

Summary: mon pod scaledown is skipped if the mons are portable
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Joy John Pinto <jopinto>
Component: rookAssignee: Subham Rai <srai>
Status: CLOSED ERRATA QA Contact: Joy John Pinto <jopinto>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.15CC: nladha, odf-bz-bot, srai, tnielsen
Target Milestone: ---   
Target Release: ODF 4.16.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.16.0-89 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-07-17 13:14:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joy John Pinto 2024-02-28 15:12:26 UTC
Description of problem (please be detailed as possible and provide log
snippests):
mon pod scaledown is skipped if the mons are portable

Version of all relevant components (if applicable):
OCP 4.15 and ODF 4.15.0-150

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. Install OCP 4.15 and ODF 4.15.0-150 on a 6 worker node and 6 failure domain cluster on vsphere
2. Update the mon count to 5 and then change it back to 3 from storagecluster ('monCount' attribute)
3. The monCount value is updated in storagecluster and cephcluster, But still five mons tend to exist


Actual results:
Even after scaling down mon pods to 3, five mon pods keep running

Expected results:
Upon scaling down the mon pod to three, only three mon pods should be running

Additional info:
storagecluster CR:
 spec:
    arbiter: {}
    enableCephTools: true
    encryption:
      kms: {}
    externalStorage: {}
    managedResources:
      cephBlockPools: {}
      cephCluster:
        monCount: 3
      cephConfig: {}

cephcluster CR:
name: balancer
    mon:
      count: 3
      volumeClaimTemplate:

rook ceph operator log:
2024-02-28 14:40:10.746033 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage"
2024-02-28 14:40:10.796538 I | ceph-cluster-controller: reporting cluster telemetry
2024-02-28 14:40:10.804990 I | ceph-cluster-controller: enabling ceph mon monitoring goroutine for cluster "openshift-storage"
2024-02-28 14:40:16.823208 I | ceph-cluster-controller: reporting node telemetry
2024-02-28 14:40:56.290615 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:40:56.290662 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:40:56.290666 I | op-mon: did not identify a mon to remove
2024-02-28 14:41:41.744893 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:41:41.744951 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:41:41.744954 I | op-mon: did not identify a mon to remove
2024-02-28 14:42:27.196955 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:42:27.196997 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:42:27.197000 I | op-mon: did not identify a mon to remove
2024-02-28 14:43:12.623301 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:43:12.623450 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:43:12.623470 I | op-mon: did not identify a mon to remove

Comment 2 Travis Nielsen 2024-02-28 18:06:27 UTC
Moving to 4.16, not a blocker.

Comment 4 Nikhil Ladha 2024-02-29 06:25:45 UTC
Based on previous comment/description seems like a rook specific issue, transferring it to rook.

Comment 11 Joy John Pinto 2024-05-14 07:30:09 UTC
Verified with OCP 4.16.0-0.nightly-2024-05-08-222442 and ODF 4.16.0-96

Verification steps:
1. Installed OCP 4.16 and ODF 4.16.0-96 on a 6 worker node and 6 failure domain cluster on vsphere
2. Updated the mon count to 5 and then changed it back to 3 from storagecluster ('monCount' attribute)
3. The monCount value is updated in storagecluster and cephcluster, and mon pods count is reduced to three

storagecluster yaml:
    kms: {}
  externalStorage: {}
  managedResources:
    cephBlockPools: {}
    cephCluster:
      monCount: 5

[jopinto@jopinto 5mbug]$ oc get pods -n openshift-storage | grep mon
rook-ceph-mon-a-5cdd784484-zp6bl                                  2/2     Running           0             35m
rook-ceph-mon-b-5fdd68b844-4lb44                                  2/2     Running           0             34m
rook-ceph-mon-c-5f55dfb6bb-ch9ld                                  2/2     Running           0             34m
rook-ceph-mon-d-65ddfd5556-rlbdc                                  2/2     Running           0             8m38s
rook-ceph-mon-e-58d8475cd8-ndxdg                                  2/2     Running           0             8m18s

storagecluster yaml:
    kms: {}
  externalStorage: {}
  managedResources:
    cephBlockPools: {}
    cephCluster:
      monCount: 3

[jopinto@jopinto 5mbug]$ oc get pods -n openshift-storage | grep mon
rook-ceph-mon-c-5f55dfb6bb-ch9ld                                  2/2     Running     0               41m
rook-ceph-mon-d-65ddfd5556-rlbdc                                  2/2     Running     0               15m
rook-ceph-mon-e-58d8475cd8-ndxdg                                  2/2     Running     0               15m


sh-5.1$ ceph health
HEALTH_OK
sh-5.1$ 


Also upon changing the monCount back to three, CephMonLowNumber alert is trigeered which is expected.

Comment 15 errata-xmlrpc 2024-07-17 13:14:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591