Bug 2026007

Summary: Use ceph 'osd safe-to-destroy' feature in OSD purge job
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Vikhyat Umrao <vumrao>
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Itzhak <ikave>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: bkunal, etamir, kramdoss, madam, muagarwa, nberry, ocs-bugs, odf-bz-bot, owasserm, shan, shilpsha, srai, tdesala, tnielsen
Target Milestone: ---Keywords: FutureFeature
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-132 Doc Type: Enhancement
Doc Text:
.OSDs are safe when multiple jobs are fired Previously, when multiple jobs removal were fired in parallel then there was a risk of losing data since it would forcefully remove the OSD. With this update, if you perform multiple jobs removal then it checks whether the OSD is ok-to-stop first and then proceeds. This implementation waits endlessly and retries every minute thereby keeping the OSD safe from losing data.
Story Points: ---
Clone Of:
: 2027396 2106026 2106027 (view as bug list) Environment:
Last Closed: 2022-04-13 18:50:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2027396, 2056571, 2106025, 2106026, 2106027    

Description Vikhyat Umrao 2021-11-23 16:01:30 UTC
Description of problem (please be detailed as possible and provide log
snippets):
Use ceph 'osd safe-to-destroy' and 'osd ok-to-stop' feature in OSD purge job

[1] mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands
     https://github.com/ceph/ceph/pull/16976 
     An osd is safe to destroy if
we have osd_stat for it
osd_stat indicates no pgs stored
all pgs are known
no pgs map to it
i.e., overall data durability will not be affected
An OSD is ok to stop if

we have the pg stats we need
no PGs will drop below min_size
i.e., availability won't be immediately compromised

Comment 5 Travis Nielsen 2021-11-23 17:12:45 UTC
Not a blocker for 4.9. Moving out to 4.10, but could be considered for 4.9.z if needed.

Comment 17 Itzhak 2022-03-13 13:25:57 UTC
Should I add the parameters 'osd safe-to-destroy' and 'osd ok-to-stop' in the osd removal job?
Please provide more details about the exact steps needed to test it.

Comment 18 Subham Rai 2022-03-14 09:54:57 UTC
I think first you need to mark osd safe to destroy and then pass the flag accordingly in the oc process.

Comment 19 Itzhak 2022-03-15 10:05:54 UTC
According to my comment https://bugzilla.redhat.com/show_bug.cgi?id=2027826#c16 in the bz https://bugzilla.redhat.com/show_bug.cgi?id=2027826, I am moving this bug also to Verified.

Comment 20 Mudit Agarwal 2022-03-31 15:02:14 UTC
Pls add doc text

Comment 22 Sébastien Han 2022-04-11 08:21:13 UTC
This is fine, thanks Shilpi.

Comment 24 errata-xmlrpc 2022-04-13 18:50:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372