1927922 – Allow shrinking the cluster by removing OSDs

Bug 1927922 - Allow shrinking the cluster by removing OSDs

Summary: Allow shrinking the cluster by removing OSDs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	OCS 4.6.4
Assignee:	Travis Nielsen
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	1927552 1933736
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-11 20:04 UTC by Travis Nielsen
Modified:	2021-04-08 10:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1927552
Environment:
Last Closed:	2021-04-08 10:29:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift rook pull 186	0	None	open	Bug 1927922: ceph: Allow removal of arbitrary osds on pvcs and simplify pvc names	2021-03-09 19:11:03 UTC
Red Hat Product Errata	RHBA-2021:1134	0	None	None	None	2021-04-08 10:29:28 UTC

Description Travis Nielsen 2021-02-11 20:04:49 UTC

+++ This bug was initially created as a clone of Bug #1927552 +++

Description of problem (please be detailed as possible and provide log
snippests):

OCS storage capacity can currently be expanded, but not shrunk. The scenarios include:
- A device backing an OSD is failing
- A device needs to be wiped and provisioned for a purpose other than OCS
- Nodes with non-portable OSDs are being decommissioned

Today if OSDs are to be cleaned, Rook will immediately reconcile them and re-create any OSDs or their backing PVCs that may have been deleted.

The one scenario that would work today is:
- Reduce the count of the storageClassDeviceSet. Rook will stop reconciling the OSD that is backed by the PVC with the highest index name.
- Remove the OSD that is backed by the PVC with the highest index.

However, this currently does not allow for any arbitrary OSD to be removed.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Large clusters require the ability to do hardware deprovisioning, particularly on prem.


Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

No, removing OSDs is not supported from the UI


If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. Install OCS
2. Reduce the count in the StorageCluster CR
3. Attempt to remove an arbitrary OSD

Actual results:

Rook will replace the OSD that was removed

Expected results:

Rook should allow for the reduced count of OSDs in the storageClassDeviceSet

--- Additional comment from Travis Nielsen on 2021-02-11 00:26:35 UTC ---

This has already been fixed in the upstream v1.5 Rook release and included downstream in OCS 4.7. The upstream PR: 
https://github.com/rook/rook/pull/6982

This BZ is to raise awareness of the feature of reducing cluster size and have QE coverage. Given it is late for the 4.7 cycle, we might have it as dev or tech preview until 4.8 and allow an exception. Goldman is asking for it to be backported to 4.6 as well.

It is almost the same scenario as being able to replace an OSD, which is documented today where we have the osd-removal job that will purge the old OSD. The difference for this scenario is that we don’t want the operator to create a new OSD in its place. The only extra documentation should be to reduce the “count” of the device sets before you run the osd-removal job.

This has more implications in OCS than Rook. With the StorageCluster CR if you reduce the device set count, it would reduce the count across all the zones to effectively reduce by 3 OSDs across the cluster. Or in other words, if we’re using +3 scaling, you’d get -3 reduction and would need to remove one OSD from each zone. If you’re using +1 scaling, you would have -1 scaling to remove a single OSD.

Comment 2 Elad 2021-03-09 08:26:38 UTC

Verification should be based on regression testing, with an emphasis on device and node replacements

Comment 5 Travis Nielsen 2021-03-09 19:11:03 UTC

Backport PR: https://github.com/openshift/rook/pull/186

Comment 9 Elad 2021-03-29 15:01:12 UTC

Moving to VERIFIED based on the regression testing with v4.6.4-323.ci

Comment 13 errata-xmlrpc 2021-04-08 10:29:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.6.4 container bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1134

Note You need to log in before you can comment on or make changes to this bug.