Bug 1927552

Summary: [RFE] Allow shrinking the cluster by removing OSDs
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Travis Nielsen <tnielsen>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED DEFERRED QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.6CC: aclewett, dmoessne, etamir, madam, muagarwa, ocs-bugs, odf-bz-bot, owasserm, rhale
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1927922 1933736 (view as bug list) Environment:
Last Closed: 2022-01-21 14:50:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1927922, 1933736    

Description Travis Nielsen 2021-02-11 00:20:18 UTC
Description of problem (please be detailed as possible and provide log
snippests):

OCS storage capacity can currently be expanded, but not shrunk. The scenarios include:
- A device backing an OSD is failing
- A device needs to be wiped and provisioned for a purpose other than OCS
- Nodes with non-portable OSDs are being decommissioned

Today if OSDs are to be cleaned, Rook will immediately reconcile them and re-create any OSDs or their backing PVCs that may have been deleted.

The one scenario that would work today is:
- Reduce the count of the storageClassDeviceSet. Rook will stop reconciling the OSD that is backed by the PVC with the highest index name.
- Remove the OSD that is backed by the PVC with the highest index.

However, this currently does not allow for any arbitrary OSD to be removed.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Large clusters require the ability to do hardware deprovisioning, particularly on prem.


Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

No, removing OSDs is not supported from the UI


If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. Install OCS
2. Reduce the count in the StorageCluster CR
3. Attempt to remove an arbitrary OSD

Actual results:

Rook will replace the OSD that was removed

Expected results:

Rook should allow for the reduced count of OSDs in the storageClassDeviceSet

Comment 1 Travis Nielsen 2021-02-11 00:26:35 UTC
This has already been fixed in the upstream v1.5 Rook release and included downstream in OCS 4.7. The upstream PR: 
https://github.com/rook/rook/pull/6982

This BZ is to raise awareness of the feature of reducing cluster size and have QE coverage. Given it is late for the 4.7 cycle, we might have it as dev or tech preview until 4.8 and allow an exception. Goldman is asking for it to be backported to 4.6 as well.

It is almost the same scenario as being able to replace an OSD, which is documented today where we have the osd-removal job that will purge the old OSD. The difference for this scenario is that we don’t want the operator to create a new OSD in its place. The only extra documentation should be to reduce the “count” of the device sets before you run the osd-removal job.

This has more implications in OCS than Rook. With the StorageCluster CR if you reduce the device set count, it would reduce the count across all the zones to effectively reduce by 3 OSDs across the cluster. Or in other words, if we’re using +3 scaling, you’d get -3 reduction and would need to remove one OSD from each zone. If you’re using +1 scaling, you would have -1 scaling to remove a single OSD.

Comment 2 Travis Nielsen 2021-02-17 22:49:39 UTC
This is already in the 4.7 builds for a few weeks, it just needs a qa_ack

Comment 9 Mudit Agarwal 2021-08-16 07:18:38 UTC
AFAIK, it is still in the same state.
Eran, can we add this as an Epic in 4.10?

Comment 13 Mudit Agarwal 2022-01-21 14:50:04 UTC
Tracked by https://issues.redhat.com/browse/RHSTOR-1934, closing the bug.
Code is already present and Jira can be used as a tracker for testing whenever we decide to test it.