This bug was initially created as a copy of Bug #2026007 I am copying this bug because: An OCS operator update is needed to expose an option to force removal of an OSD if Ceph indicates the OSD is not safe-to-destroy. If https://bugzilla.redhat.com/show_bug.cgi?id=2027396 is approved for 4.9.z, we will also need this for 4.9.z. Description of problem (please be detailed as possible and provide log snippets): Use ceph 'osd safe-to-destroy' and 'osd ok-to-stop' feature in OSD purge job [1] mgr: implement 'osd safe-to-destroy' and 'osd ok-to-stop' commands https://github.com/ceph/ceph/pull/16976 An osd is safe to destroy if we have osd_stat for it osd_stat indicates no pgs stored all pgs are known no pgs map to it i.e., overall data durability will not be affected An OSD is ok to stop if we have the pg stats we need no PGs will drop below min_size i.e., availability won't be immediately compromised
The OSD removal job now has the following option as seen in https://github.com/red-hat-storage/rook/pull/313: --force-osd-removal [true | false] The OSD removal job template created by the OCS operator needs to expose a variable for this option. Otherwise, the OSD will not be removable in scenarios where the PGs are never safe-to-destroy in small clusters where another OSD is not available to which the PGs can be backfilled. This can be moved to 4.10 if https://bugzilla.redhat.com/show_bug.cgi?id=2027396 is not approved for 4.9.z.
Since this hasn't been looked at or even prioritized, moving this out to ODF 4.11. The change itself should be fairly trivial. Same for the backport, if approved.
Is it just a matter of finding someone to fix it? If so, I'll get someone from Rook to do it in 4.10. The osd removal job will really not be usable in some scenarios without this option exposed in the template.
Bringing it back to 4.10 and adding the devel ack. Travis is this really a Feature, looks like a bug fix to me. PS: RFEs are not allowed at this point in 4.10
I'm removing the RFE tag, as this is indeed a proper bug fix.
Agreed, a bug fix rather than RFE, thanks.
Should I add the parameters 'osd safe-to-destroy' and 'osd ok-to-stop' in the osd removal job? Please provide more details about the exact steps needed to test it.
I think first you need to mark osd safe to destroy and then pass the flag accordingly in the oc process.
Please see these instructions for testing: https://docs.google.com/document/d/1WHxEdmwTn1EmrNujjOzBGp6R-IHbwcdsYPkG0nrIW5o/edit
I tested it with vSphere 4.10 dynamic clusters. Steps: 1. I checked the output of the commands 'ceph osd safe-to-destroy' and 'ceph osd ok-to-stop' on osd 0: sh-4.4$ ceph osd safe-to-destroy 0 Error EBUSY: OSD(s) 0 have 177 pgs currently mapped to them. sh-4.4$ ceph osd ok-to-stop 0 {"ok_to_stop":true,"osds":[0],"num_ok_pgs":177,"num_not_ok_pgs":0,"ok_become_degraded":["1.0","1.1","1.2","1.3","1.4","1.5","1.6","1.7","1.8","1.9","1.a","1.b","1.c","1.d","1.e","1.f","1.10","1.11","1.12","1.13","1.14","1.15","1.16","1.17","1.18","1.19","1.1a","1.1b","1.1c","1.1d","1.1e","1.1f","2.0","2.1","2.2","2.3","2.4","2.5","2.6","2.7","3.0","3.1","3.2","3.3","3.4","3.5","3.6","3.7","4.0","4.1","4.2","4.3","4.4","4.5","4.6","4.7","5.0","5.1","5.2","5.3","5.4","5.5","5.6","5.7","6.0","6.1","6.2","6.3","6.4","6.5","6.6","6.7","7.0","7.1","7.2","7.3","7.4","7.5","7.6","7.7","8.0","9.0","9.1","9.2","9.3","9.4","9.5","9.6","9.7","9.8","9.9","9.a","9.b","9.c","9.d","9.e","9.f","9.10","9.11","9.12","9.13","9.14","9.15","9.16","9.17","9.18","9.19","9.1a","9.1b","9.1c","9.1d","9.1e","9.1f","10.0","10.1","10.2","10.3","10.4","10.5","10.6","10.7","10.8","10.9","10.a","10.b","10.c","10.d","10.e","10.f","10.10","10.11","10.12","10.13","10.14","10.15","10.16","10.17","10.18","10.19","10.1a","10.1b","10.1c","10.1d","10.1e","10.1f","11.0","11.1","11.2","11.3","11.4","11.5","11.6","11.7","11.8","11.9","11.a","11.b","11.c","11.d","11.e","11.f","11.10","11.11","11.12","11.13","11.14","11.15","11.16","11.17","11.18","11.19","11.1a","11.1b","11.1c","11.1d","11.1e","11.1f"]} sh-4.4$ 2. Delete the disk from vSphere side that associated to osd-0. The osd-0 went to "CrashLoopBackOff" state: $ oc get pods | grep osd rook-ceph-osd-0-6b7957b4bd-5fh4t 1/2 CrashLoopBackOff 1 (13s ago) 3h17m rook-ceph-osd-1-68f5f9f9bc-fp7mm 2/2 Running 0 3h17m rook-ceph-osd-2-78fcfb4748-drknv 2/2 Running 0 3h17m rook-ceph-osd-prepare-ocs-deviceset-0-data-0jsqcp-nhs7w 0/1 Completed 0 3h17m rook-ceph-osd-prepare-ocs-deviceset-1-data-045v6j-tlr8l 0/1 Completed 0 3h17m rook-ceph-osd-prepare-ocs-deviceset-2-data-0fwmcx-8dp8b 0/1 Completed 0 3h17m 3. Scale the osd-0 deployment: $ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0 deployment.apps/rook-ceph-osd-0 scaled 4. Delete the osd-0 pod: $ oc delete pod rook-ceph-osd-0-6b7957b4bd-5fh4t --force --grace-period=0 warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6b7957b4bd-5fh4t" force deleted 5. Check the output of the commands 'ceph osd safe-to-destroy' and 'ceph osd ok-to-stop' on osd 0: sh-4.4$ ceph osd ok-to-stop 0 {"ok_to_stop":true,"osds":[0],"num_ok_pgs":0,"num_not_ok_pgs":0} sh-4.4$ ceph osd safe-to-destroy 0 Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. 6. Running the osd removal job without the "FORCE_OSD_REMOVAL" param: oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created 7. Check the osd removal job and see it stuck on running state: $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-x8bh6 1/1 Running 0 12m When looking at the osd removal job logs I saw this message repeatedly: 2022-03-14 15:20:06.805613 W | cephosd: osd.0 is NOT be ok to destroy, retrying in 1m until success 2022-03-14 15:21:06.806068 D | exec: Running command: ceph osd safe-to-destroy 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 8. Delete the current osd removal job $ oc delete jobs ocs-osd-removal-job job.batch "ocs-osd-removal-job" deleted 9. Create a new one instead with the "FORCE_OSD_REMOVAL" param: oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created 10. Check the osd removal job and see it is completed successfully: $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-9lq2k 0/1 Completed 0 72s Also check the osd removal job logs and saw this message: 2022-03-14 15:35:34.459609 I | cephosd: validating status of osd.0 2022-03-14 15:35:34.459626 I | cephosd: osd.0 is marked 'DOWN' 2022-03-14 15:35:34.459645 D | exec: Running command: ceph osd safe-to-destroy 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2022-03-14 15:35:34.735903 I | cephosd: osd.0 is NOT be ok to destroy but force removal is enabled so proceeding with removal 11. Delete the osd removal job: $ oc delete job ocs-osd-removal-job job.batch "ocs-osd-removal-job" deleted 12. Check the osd pods and see the new osd-0 pod running: $ oc get pods | grep osd rook-ceph-osd-0-585b9b6f6d-6plp2 2/2 Running 0 5m13s rook-ceph-osd-1-68f5f9f9bc-fp7mm 2/2 Running 0 4h8m rook-ceph-osd-2-78fcfb4748-drknv 2/2 Running 0 4h8m rook-ceph-osd-prepare-ocs-deviceset-0-data-0xv8x8-5fskd 0/1 Completed 0 5m46s rook-ceph-osd-prepare-ocs-deviceset-1-data-045v6j-tlr8l 0/1 Completed 0 4h8m rook-ceph-osd-prepare-ocs-deviceset-2-data-0fwmcx-8dp8b 0/1 Completed 0 4h8m 13. Check the PVC state: $ oc get pvc | grep ocs-deviceset ocs-deviceset-0-data-0xv8x8 Bound pvc-18f210c1-dc33-4912-b8c9-7056883dc105 256Gi RWO thin 6m20s ocs-deviceset-1-data-045v6j Bound pvc-a17d78f3-812d-494b-9e57-e20b0e62a15d 256Gi RWO thin 4h9m ocs-deviceset-2-data-0fwmcx Bound pvc-6734a2f5-5981-462b-b003-952cdfa8e324 256Gi RWO thin 4h9m 14. silence the osd crash warning ceph crash archive 2022-03-14T14:50:18.260644Z_dbd5458b-fea0-47b1-94eb-4d3335bb7913 15. Verify ceph health is ok: sh-4.4$ ceph health HEALTH_OK
Additional info: OCP version: Client Version: 4.10.0-0.nightly-2022-03-05-023708 Server Version: 4.10.0-0.nightly-2022-03-13-040322 Kubernetes Version: v1.23.3+e419edf OCS verison: ocs-operator.v4.10.0 OpenShift Container Storage 4.10.0 Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-03-13-040322 True False 4h20m Cluster version is 4.10.0-0.nightly-2022-03-13-040322 Rook version rook: v4.10.0-0.2285b5b9c4a9993456f0b78b7b23a7399ca98731 go: go1.16.12 Ceph version ceph version 16.2.7-76.el8cp (f4d6ada772570ae8b05c62ad79e222fbd3f04188) pacific (stable) Link to Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10780/
I think that from the conclusions above, we can move this bug to Verify. Please let me know what do you think.
Yes, all the steps above look expected, sounds good to move it to verified
According to the conclusions above, I am moving the bug to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1372