2026007 – Use ceph 'osd safe-to-destroy' feature in OSD purge job

Bug 2026007 - Use ceph 'osd safe-to-destroy' feature in OSD purge job

Summary: Use ceph 'osd safe-to-destroy' feature in OSD purge job

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.10.0
Assignee:	Sébastien Han
QA Contact:	Itzhak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2027396 2056571 2106025 2106026 2106027
TreeView+	depends on / blocked

Reported:	2021-11-23 16:01 UTC by Vikhyat Umrao
Modified:	2023-08-09 17:03 UTC (History)
CC List:	14 users (show)
Fixed In Version:	4.10.0-132
Doc Type:	Enhancement
Doc Text:	.OSDs are safe when multiple jobs are fired Previously, when multiple jobs removal were fired in parallel then there was a risk of losing data since it would forcefully remove the OSD. With this update, if you perform multiple jobs removal then it checks whether the OSD is ok-to-stop first and then proceeds. This implementation waits endlessly and retries every minute thereby keeping the OSD safe from losing data.
Clone Of:
Clones:	2027396 2106026 2106027 (view as bug list)
Environment:
Last Closed:	2022-04-13 18:50:37 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-ci pull 6114	None	Merged	Added FORCE_OSD_REMOVAL flag on ocs-osd-removal-job	2022-07-27 10:04:06 UTC
Github	rook rook pull 9230	None	open	osd: check if osd is ok-to-stop before removal	2021-11-23 17:53:13 UTC
Red Hat Product Errata	RHSA-2022:1372	None	None	None	2022-04-13 18:51:04 UTC

Description Vikhyat Umrao 2021-11-23 16:01:30 UTC

Description of problem (please be detailed as possible and provide log
snippets):
Use ceph 'osd safe-to-destroy' and 'osd ok-to-stop' feature in OSD purge job

[1] mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands
     https://github.com/ceph/ceph/pull/16976 
     An osd is safe to destroy if
we have osd_stat for it
osd_stat indicates no pgs stored
all pgs are known
no pgs map to it
i.e., overall data durability will not be affected
An OSD is ok to stop if

we have the pg stats we need
no PGs will drop below min_size
i.e., availability won't be immediately compromised

Comment 5 Travis Nielsen 2021-11-23 17:12:45 UTC

Not a blocker for 4.9. Moving out to 4.10, but could be considered for 4.9.z if needed.

Comment 17 Itzhak 2022-03-13 13:25:57 UTC

Should I add the parameters 'osd safe-to-destroy' and 'osd ok-to-stop' in the osd removal job?
Please provide more details about the exact steps needed to test it.

Comment 18 Subham Rai 2022-03-14 09:54:57 UTC

I think first you need to mark osd safe to destroy and then pass the flag accordingly in the oc process.

Comment 19 Itzhak 2022-03-15 10:05:54 UTC

According to my comment https://bugzilla.redhat.com/show_bug.cgi?id=2027826#c16 in the bz https://bugzilla.redhat.com/show_bug.cgi?id=2027826, I am moving this bug also to Verified.

Comment 20 Mudit Agarwal 2022-03-31 15:02:14 UTC

Pls add doc text

Comment 22 Sébastien Han 2022-04-11 08:21:13 UTC

This is fine, thanks Shilpi.

Comment 24 errata-xmlrpc 2022-04-13 18:50:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372

Note You need to log in before you can comment on or make changes to this bug.