2234479 – RFE: automation around OSD removal to avoid Data Loss

Bug 2234479 - RFE: automation around OSD removal to avoid Data Loss

Summary: RFE: automation around OSD removal to avoid Data Loss

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Subham Rai
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-24 15:18 UTC by Manny
Modified:	2024-05-14 08:01 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.15.0-103
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:23:04 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-ci pull 9655	0	None	Merged	Test CLI tool for disk replacement proceudre	2024-05-14 08:01:50 UTC
Red Hat Product Errata	RHSA-2024:1383	0	None	None	None	2024-03-19 15:23:07 UTC

Description Manny 2023-08-24 15:18:12 UTC

Description of problem (please be detailed as possible and provide log
snippests):

RFE: automation around OSD removal to avoid Data Loss

We have support cases (at least 1 per week) where customers remove OSDs across nodes, failure domains

Examples of what is driving this behavior:

- There are too many disk resources allocated to Ceph backing store, merely taking 1 or 2 back from each OSD host
- There are too many OSD requiring too much memory and CPU resources, swapping out 4 x 512 GiB disks for 1 x 2 TiB disk
- There is a new VMWare data store, therefore migrating all Ceph OSDs from data store x to new OSDs in data store Y

There is nothing stopping customers from removing an OSD. At least there is nothing in place that a --force won't overcome.

The request is some automation be created to do the work listed above, maybe even more. I realize this is not a trivial request. Just the same, maybe we start with, the automation which simply explains not to manually remove OSDs and directs the administrator to review a KCS link before moving forward. Anything to get started and help administrators not to inadvertently abuse their Ceph Backing Storage. We can create a KCS (agreed upon by Eng/Dev/CS) as the landing spot for such requests.

The ultimate goal should be automation which does all the work

Version of all relevant components (if applicable):

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

BR
Manny

Comment 5 Malay Kumar parida 2023-08-30 05:14:42 UTC

Created Jira Epic for tracking- https://issues.redhat.com/browse/RHSTOR-4838

Comment 7 Malay Kumar parida 2023-09-12 11:32:43 UTC

What I understand is customers while trying to remove the OSDs just use the --force option for osd removal without knowing the consequences. As it is the --force option should just do that(i.e. go ahead with the removal no matter what). I believe the problem is more in the section of documentation where we need to warn customers clearly & correctly about the data loss that would happen. As Subham was earlier involved with the feature of addition of the --force removal parameter & has worked on it I am assigning it to him, so he can better help in drafting the doc words. Also, I am closing the JIRA attached here & removing it from 4.15 discussions.

Comment 17 Oded 2024-01-09 09:33:15 UTC

Tested on ODF4.15 + OCP4.15 

CLI TOOL Replace disk Interanl mode:  https://docs.google.com/document/d/1r2b9lAasqxSOrnKA07qbUlqjVm3ogp5fH3wC-Ti-mXg/edit
CLI TOOL Replace disk lso vsphere: https://docs.google.com/document/d/1uP3hKL1tWB1rdoMmPPBDqCoF_y5Zw1TeRbPIUVBP468/edit
Verify that healthy OSD cannot be removed: https://docs.google.com/document/d/1LX6tTK0cZUdgiu_KgJzINLb6uxaEjcMA2avdN19tb_Y/edit

Comment 18 errata-xmlrpc 2024-03-19 15:23:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.