2266621 – mon pod scaledown is skipped if the mons are portable

Bug 2266621 - mon pod scaledown is skipped if the mons are portable

Summary: mon pod scaledown is skipped if the mons are portable

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Subham Rai
QA Contact:	Joy John Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-02-28 15:12 UTC by Joy John Pinto
Modified:	2024-07-17 13:14 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.16.0-89
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-07-17 13:14:31 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 638	0	None	open	Bug 2266621: mon: fix mon scaledown when mons are portable	2024-04-29 05:13:43 UTC
Red Hat Product Errata	RHSA-2024:4591	0	None	None	None	2024-07-17 13:14:34 UTC

Description Joy John Pinto 2024-02-28 15:12:26 UTC

Description of problem (please be detailed as possible and provide log
snippests):
mon pod scaledown is skipped if the mons are portable

Version of all relevant components (if applicable):
OCP 4.15 and ODF 4.15.0-150

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. Install OCP 4.15 and ODF 4.15.0-150 on a 6 worker node and 6 failure domain cluster on vsphere
2. Update the mon count to 5 and then change it back to 3 from storagecluster ('monCount' attribute)
3. The monCount value is updated in storagecluster and cephcluster, But still five mons tend to exist


Actual results:
Even after scaling down mon pods to 3, five mon pods keep running

Expected results:
Upon scaling down the mon pod to three, only three mon pods should be running

Additional info:
storagecluster CR:
 spec:
    arbiter: {}
    enableCephTools: true
    encryption:
      kms: {}
    externalStorage: {}
    managedResources:
      cephBlockPools: {}
      cephCluster:
        monCount: 3
      cephConfig: {}

cephcluster CR:
name: balancer
    mon:
      count: 3
      volumeClaimTemplate:

rook ceph operator log:
2024-02-28 14:40:10.746033 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage"
2024-02-28 14:40:10.796538 I | ceph-cluster-controller: reporting cluster telemetry
2024-02-28 14:40:10.804990 I | ceph-cluster-controller: enabling ceph mon monitoring goroutine for cluster "openshift-storage"
2024-02-28 14:40:16.823208 I | ceph-cluster-controller: reporting node telemetry
2024-02-28 14:40:56.290615 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:40:56.290662 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:40:56.290666 I | op-mon: did not identify a mon to remove
2024-02-28 14:41:41.744893 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:41:41.744951 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:41:41.744954 I | op-mon: did not identify a mon to remove
2024-02-28 14:42:27.196955 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:42:27.196997 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:42:27.197000 I | op-mon: did not identify a mon to remove
2024-02-28 14:43:12.623301 I | op-mon: removing an extra mon. currently 5 are in quorum and only 3 are desired
2024-02-28 14:43:12.623450 I | op-mon: removing arbitrary extra mon ""
2024-02-28 14:43:12.623470 I | op-mon: did not identify a mon to remove

Comment 2 Travis Nielsen 2024-02-28 18:06:27 UTC

Moving to 4.16, not a blocker.

Comment 4 Nikhil Ladha 2024-02-29 06:25:45 UTC

Based on previous comment/description seems like a rook specific issue, transferring it to rook.

Comment 11 Joy John Pinto 2024-05-14 07:30:09 UTC

Verified with OCP 4.16.0-0.nightly-2024-05-08-222442 and ODF 4.16.0-96

Verification steps:
1. Installed OCP 4.16 and ODF 4.16.0-96 on a 6 worker node and 6 failure domain cluster on vsphere
2. Updated the mon count to 5 and then changed it back to 3 from storagecluster ('monCount' attribute)
3. The monCount value is updated in storagecluster and cephcluster, and mon pods count is reduced to three

storagecluster yaml:
    kms: {}
  externalStorage: {}
  managedResources:
    cephBlockPools: {}
    cephCluster:
      monCount: 5

[jopinto@jopinto 5mbug]$ oc get pods -n openshift-storage | grep mon
rook-ceph-mon-a-5cdd784484-zp6bl                                  2/2     Running           0             35m
rook-ceph-mon-b-5fdd68b844-4lb44                                  2/2     Running           0             34m
rook-ceph-mon-c-5f55dfb6bb-ch9ld                                  2/2     Running           0             34m
rook-ceph-mon-d-65ddfd5556-rlbdc                                  2/2     Running           0             8m38s
rook-ceph-mon-e-58d8475cd8-ndxdg                                  2/2     Running           0             8m18s

storagecluster yaml:
    kms: {}
  externalStorage: {}
  managedResources:
    cephBlockPools: {}
    cephCluster:
      monCount: 3

[jopinto@jopinto 5mbug]$ oc get pods -n openshift-storage | grep mon
rook-ceph-mon-c-5f55dfb6bb-ch9ld                                  2/2     Running     0               41m
rook-ceph-mon-d-65ddfd5556-rlbdc                                  2/2     Running     0               15m
rook-ceph-mon-e-58d8475cd8-ndxdg                                  2/2     Running     0               15m


sh-5.1$ ceph health
HEALTH_OK
sh-5.1$ 


Also upon changing the monCount back to three, CephMonLowNumber alert is trigeered which is expected.

Comment 15 errata-xmlrpc 2024-07-17 13:14:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.