+++ This bug was initially created as a clone of Bug #1927552 +++ Description of problem (please be detailed as possible and provide log snippests): OCS storage capacity can currently be expanded, but not shrunk. The scenarios include: - A device backing an OSD is failing - A device needs to be wiped and provisioned for a purpose other than OCS - Nodes with non-portable OSDs are being decommissioned Today if OSDs are to be cleaned, Rook will immediately reconcile them and re-create any OSDs or their backing PVCs that may have been deleted. The one scenario that would work today is: - Reduce the count of the storageClassDeviceSet. Rook will stop reconciling the OSD that is backed by the PVC with the highest index name. - Remove the OSD that is backed by the PVC with the highest index. However, this currently does not allow for any arbitrary OSD to be removed. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Large clusters require the ability to do hardware deprovisioning, particularly on prem. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes Can this issue reproduce from the UI? No, removing OSDs is not supported from the UI If this is a regression, please provide more details to justify this: NA Steps to Reproduce: 1. Install OCS 2. Reduce the count in the StorageCluster CR 3. Attempt to remove an arbitrary OSD Actual results: Rook will replace the OSD that was removed Expected results: Rook should allow for the reduced count of OSDs in the storageClassDeviceSet --- Additional comment from Travis Nielsen on 2021-02-11 00:26:35 UTC --- This has already been fixed in the upstream v1.5 Rook release and included downstream in OCS 4.7. The upstream PR: https://github.com/rook/rook/pull/6982 This BZ is to raise awareness of the feature of reducing cluster size and have QE coverage. Given it is late for the 4.7 cycle, we might have it as dev or tech preview until 4.8 and allow an exception. Goldman is asking for it to be backported to 4.6 as well. It is almost the same scenario as being able to replace an OSD, which is documented today where we have the osd-removal job that will purge the old OSD. The difference for this scenario is that we don’t want the operator to create a new OSD in its place. The only extra documentation should be to reduce the “count” of the device sets before you run the osd-removal job. This has more implications in OCS than Rook. With the StorageCluster CR if you reduce the device set count, it would reduce the count across all the zones to effectively reduce by 3 OSDs across the cluster. Or in other words, if we’re using +3 scaling, you’d get -3 reduction and would need to remove one OSD from each zone. If you’re using +1 scaling, you would have -1 scaling to remove a single OSD.
Verification should be based on regression testing, with an emphasis on device and node replacements
Backport PR: https://github.com/openshift/rook/pull/186
Moving to VERIFIED based on the regression testing with v4.6.4-323.ci
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.6.4 container bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1134