Bug 1927922

Summary: Allow shrinking the cluster by removing OSDs
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Travis Nielsen <tnielsen>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED ERRATA QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.6CC: aclewett, dmoessne, ebenahar, madam, muagarwa, ocs-bugs, rcyriac, rhale
Target Milestone: ---Keywords: ZStream
Target Release: OCS 4.6.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1927552 Environment:
Last Closed: 2021-04-08 10:29:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1927552, 1933736    
Bug Blocks:    

Description Travis Nielsen 2021-02-11 20:04:49 UTC
+++ This bug was initially created as a clone of Bug #1927552 +++

Description of problem (please be detailed as possible and provide log
snippests):

OCS storage capacity can currently be expanded, but not shrunk. The scenarios include:
- A device backing an OSD is failing
- A device needs to be wiped and provisioned for a purpose other than OCS
- Nodes with non-portable OSDs are being decommissioned

Today if OSDs are to be cleaned, Rook will immediately reconcile them and re-create any OSDs or their backing PVCs that may have been deleted.

The one scenario that would work today is:
- Reduce the count of the storageClassDeviceSet. Rook will stop reconciling the OSD that is backed by the PVC with the highest index name.
- Remove the OSD that is backed by the PVC with the highest index.

However, this currently does not allow for any arbitrary OSD to be removed.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Large clusters require the ability to do hardware deprovisioning, particularly on prem.


Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

No, removing OSDs is not supported from the UI


If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. Install OCS
2. Reduce the count in the StorageCluster CR
3. Attempt to remove an arbitrary OSD

Actual results:

Rook will replace the OSD that was removed

Expected results:

Rook should allow for the reduced count of OSDs in the storageClassDeviceSet

--- Additional comment from Travis Nielsen on 2021-02-11 00:26:35 UTC ---

This has already been fixed in the upstream v1.5 Rook release and included downstream in OCS 4.7. The upstream PR: 
https://github.com/rook/rook/pull/6982

This BZ is to raise awareness of the feature of reducing cluster size and have QE coverage. Given it is late for the 4.7 cycle, we might have it as dev or tech preview until 4.8 and allow an exception. Goldman is asking for it to be backported to 4.6 as well.

It is almost the same scenario as being able to replace an OSD, which is documented today where we have the osd-removal job that will purge the old OSD. The difference for this scenario is that we don’t want the operator to create a new OSD in its place. The only extra documentation should be to reduce the “count” of the device sets before you run the osd-removal job.

This has more implications in OCS than Rook. With the StorageCluster CR if you reduce the device set count, it would reduce the count across all the zones to effectively reduce by 3 OSDs across the cluster. Or in other words, if we’re using +3 scaling, you’d get -3 reduction and would need to remove one OSD from each zone. If you’re using +1 scaling, you would have -1 scaling to remove a single OSD.

Comment 2 Elad 2021-03-09 08:26:38 UTC
Verification should be based on regression testing, with an emphasis on device and node replacements

Comment 5 Travis Nielsen 2021-03-09 19:11:03 UTC
Backport PR: https://github.com/openshift/rook/pull/186

Comment 9 Elad 2021-03-29 15:01:12 UTC
Moving to VERIFIED based on the regression testing with v4.6.4-323.ci

Comment 13 errata-xmlrpc 2021-04-08 10:29:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.6.4 container bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1134