Description of problem (please be detailed as possible and provide log snippests): RFE: automation around OSD removal to avoid Data Loss We have support cases (at least 1 per week) where customers remove OSDs across nodes, failure domains Examples of what is driving this behavior: - There are too many disk resources allocated to Ceph backing store, merely taking 1 or 2 back from each OSD host - There are too many OSD requiring too much memory and CPU resources, swapping out 4 x 512 GiB disks for 1 x 2 TiB disk - There is a new VMWare data store, therefore migrating all Ceph OSDs from data store x to new OSDs in data store Y There is nothing stopping customers from removing an OSD. At least there is nothing in place that a --force won't overcome. The request is some automation be created to do the work listed above, maybe even more. I realize this is not a trivial request. Just the same, maybe we start with, the automation which simply explains not to manually remove OSDs and directs the administrator to review a KCS link before moving forward. Anything to get started and help administrators not to inadvertently abuse their Ceph Backing Storage. We can create a KCS (agreed upon by Eng/Dev/CS) as the landing spot for such requests. The ultimate goal should be automation which does all the work Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: BR Manny
Created Jira Epic for tracking- https://issues.redhat.com/browse/RHSTOR-4838
What I understand is customers while trying to remove the OSDs just use the --force option for osd removal without knowing the consequences. As it is the --force option should just do that(i.e. go ahead with the removal no matter what). I believe the problem is more in the section of documentation where we need to warn customers clearly & correctly about the data loss that would happen. As Subham was earlier involved with the feature of addition of the --force removal parameter & has worked on it I am assigning it to him, so he can better help in drafting the doc words. Also, I am closing the JIRA attached here & removing it from 4.15 discussions.
Tested on ODF4.15 + OCP4.15 CLI TOOL Replace disk Interanl mode: https://docs.google.com/document/d/1r2b9lAasqxSOrnKA07qbUlqjVm3ogp5fH3wC-Ti-mXg/edit CLI TOOL Replace disk lso vsphere: https://docs.google.com/document/d/1uP3hKL1tWB1rdoMmPPBDqCoF_y5Zw1TeRbPIUVBP468/edit Verify that healthy OSD cannot be removed: https://docs.google.com/document/d/1LX6tTK0cZUdgiu_KgJzINLb6uxaEjcMA2avdN19tb_Y/edit
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383