Bug 2266621 - mon pod scaledown is skipped if the mons are portable
Summary: mon pod scaledown is skipped if the mons are portable
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.16.0
Assignee: Subham Rai
QA Contact: Joy John Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-02-28 15:12 UTC by Joy John Pinto
Modified: 2024-07-17 13:14 UTC (History)
4 users (show)

Fixed In Version: 4.16.0-89
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-07-17 13:14:31 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 638 0 None open Bug 2266621: mon: fix mon scaledown when mons are portable 2024-04-29 05:13:43 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:14:34 UTC

Description Joy John Pinto 2024-02-28 15:12:26 UTC
Description of problem (please be detailed as possible and provide log
snippests):
mon pod scaledown is skipped if the mons are portable

Version of all relevant components (if applicable):
OCP 4.15 and ODF 4.15.0-150

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. Install OCP 4.15 and ODF 4.15.0-150 on a 6 worker node and 6 failure domain cluster on vsphere
2. Update the mon count to 5 and then change it back to 3 from storagecluster ('monCount' attribute)
3. The monCount value is updated in storagecluster and cephcluster, But still five mons tend to exist


Actual results:
Even after scaling down mon pods to 3, five mon pods keep running

Expected results:
Upon scaling down the mon pod to three, only three mon pods should be running

Additional info:
storagecluster CR:
 spec:
    arbiter: {}
    enableCephTools: true
    encryption:
      kms: {}
    externalStorage: {}
    managedResources:
      cephBlockPools: {}
      cephCluster:
        monCount: 3
      cephConfig: {}

cephcluster CR:
name: balancer
    mon:
      count: 3
      volumeClaimTemplate:

rook ceph operator log:
2024-02-28 14:40:10.746033 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage"
2024-02-28 14:40:10.796538 I | ceph-cluster-controller: reporting cluster telemetry
2024-02-28 14:40:10.804990 I | ceph-cluster-controller: enabling ceph mon monitoring goroutine for cluster "openshift-storage"
2024-02-28 14:40:16.823208 I | ceph-cluster-controller: reporting node telemetry
2024-02-28 14:40:56.290615 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:40:56.290662 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:40:56.290666 I | op-mon: did not identify a mon to remove
2024-02-28 14:41:41.744893 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:41:41.744951 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:41:41.744954 I | op-mon: did not identify a mon to remove
2024-02-28 14:42:27.196955 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:42:27.196997 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:42:27.197000 I | op-mon: did not identify a mon to remove
2024-02-28 14:43:12.623301 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:43:12.623450 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:43:12.623470 I | op-mon: did not identify a mon to remove

Comment 2 Travis Nielsen 2024-02-28 18:06:27 UTC
Moving to 4.16, not a blocker.

Comment 4 Nikhil Ladha 2024-02-29 06:25:45 UTC
Based on previous comment/description seems like a rook specific issue, transferring it to rook.

Comment 11 Joy John Pinto 2024-05-14 07:30:09 UTC
Verified with OCP 4.16.0-0.nightly-2024-05-08-222442 and ODF 4.16.0-96

Verification steps:
1. Installed OCP 4.16 and ODF 4.16.0-96 on a 6 worker node and 6 failure domain cluster on vsphere
2. Updated the mon count to 5 and then changed it back to 3 from storagecluster ('monCount' attribute)
3. The monCount value is updated in storagecluster and cephcluster, and mon pods count is reduced to three

storagecluster yaml:
    kms: {}
  externalStorage: {}
  managedResources:
    cephBlockPools: {}
    cephCluster:
      monCount: 5

[jopinto@jopinto 5mbug]$ oc get pods -n openshift-storage | grep mon
rook-ceph-mon-a-5cdd784484-zp6bl                                  2/2     Running           0             35m
rook-ceph-mon-b-5fdd68b844-4lb44                                  2/2     Running           0             34m
rook-ceph-mon-c-5f55dfb6bb-ch9ld                                  2/2     Running           0             34m
rook-ceph-mon-d-65ddfd5556-rlbdc                                  2/2     Running           0             8m38s
rook-ceph-mon-e-58d8475cd8-ndxdg                                  2/2     Running           0             8m18s

storagecluster yaml:
    kms: {}
  externalStorage: {}
  managedResources:
    cephBlockPools: {}
    cephCluster:
      monCount: 3

[jopinto@jopinto 5mbug]$ oc get pods -n openshift-storage | grep mon
rook-ceph-mon-c-5f55dfb6bb-ch9ld                                  2/2     Running     0               41m
rook-ceph-mon-d-65ddfd5556-rlbdc                                  2/2     Running     0               15m
rook-ceph-mon-e-58d8475cd8-ndxdg                                  2/2     Running     0               15m


sh-5.1$ ceph health
HEALTH_OK
sh-5.1$ 


Also upon changing the monCount back to three, CephMonLowNumber alert is trigeered which is expected.

Comment 15 errata-xmlrpc 2024-07-17 13:14:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.