.OSDs are safe when multiple jobs are fired
Previously, when multiple jobs removal were fired in parallel then there was a risk of losing data since it would forcefully remove the OSD.
With this update, if you perform multiple jobs removal then it checks whether the OSD is ok-to-stop first and then proceeds. This implementation waits endlessly and retries every minute thereby keeping the OSD safe from losing data.
Description of problem (please be detailed as possible and provide log
snippets):
Use ceph 'osd safe-to-destroy' and 'osd ok-to-stop' feature in OSD purge job
[1] mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands
https://github.com/ceph/ceph/pull/16976
An osd is safe to destroy if
we have osd_stat for it
osd_stat indicates no pgs stored
all pgs are known
no pgs map to it
i.e., overall data durability will not be affected
An OSD is ok to stop if
we have the pg stats we need
no PGs will drop below min_size
i.e., availability won't be immediately compromised
Should I add the parameters 'osd safe-to-destroy' and 'osd ok-to-stop' in the osd removal job?
Please provide more details about the exact steps needed to test it.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:1372
Description of problem (please be detailed as possible and provide log snippets): Use ceph 'osd safe-to-destroy' and 'osd ok-to-stop' feature in OSD purge job [1] mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands https://github.com/ceph/ceph/pull/16976 An osd is safe to destroy if we have osd_stat for it osd_stat indicates no pgs stored all pgs are known no pgs map to it i.e., overall data durability will not be affected An OSD is ok to stop if we have the pg stats we need no PGs will drop below min_size i.e., availability won't be immediately compromised