Description of problem (please be detailed as possible and provide log snippets): While testing Failed node replacement, ocs-osd-removal-job not getting completed. [root@m4204001 ~]# oc -n openshift-storage get pod ocs-osd-removal-job-7k5kv NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-7k5kv 1/1 Running 0 141m [root@m4204001 ~]# Logs: 2022-12-14 08:35:06.698509 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2022-12-14 08:35:07.167520 I | cephosd: validating status of osd.1 2022-12-14 08:35:07.167559 I | cephosd: osd.1 is marked 'DOWN' 2022-12-14 08:35:07.167581 D | exec: Running command: ceph osd find 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2022-12-14 08:35:07.953915 I | cephosd: marking osd.1 out 2022-12-14 08:35:07.953971 D | exec: Running command: ceph osd out osd.1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2022-12-14 08:35:08.689277 D | exec: Running command: ceph osd safe-to-destroy 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2022-12-14 08:35:09.214305 W | cephosd: osd.1 is NOT be ok to destroy, retrying in 1m until success Version of all relevant components (if applicable): ODF 4.12.0-140 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy ODF 4.12 on OCP 4.12 2. Fail one of the ODF node 3. Follow steps for replace node Actual results: ocs-osd-removal-job get completed Expected results: ocs-osd-removal-job keeps in running status for more than an hour. Additional info: Must gather logs: https://drive.google.com/file/d/1GQI6vs7L0pt9D_Na1m8uSM93DQHdMFcE/view?usp=sharing
When removing an OSD, Rook will query ceph to know if it is safe-to-destroy the OSD: 2022-12-14 08:35:09.214305 W | cephosd: osd.1 is NOT be ok to destroy, retrying in 1m until success Since the OSD is not safe to destroy, Rook will wait and continue checking indefinitely until it is safe. An OSD is safe to destroy when all the PGs have been moved to other OSDs in the cluster such that the PGs are active+clean. This means all replicas of the data are replicated safely and there is no risk of data loss. If there are no spare OSDs where the PGs can be moved, an OSD will never be safe to destroy. For example, if there are 3 OSDs in the cluster and pools with replica 3, a lost OSD will never be safe to destroy. If you need to remove an OSD even when it is not safe to destroy, you will need to force the removal on the removal template with the flag FORCE_OSD_REMOVAL. See the docs on the osd removal for more details.
Hi, You need to add FORCE_FLAG to the removal job. $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - Steps: 1. Delete the old ocs-osd-removal job $ oc delete job ocs-osd-removal -n openshift-storage 2.Run ocs-osd-removal job with FORCE_FLAG: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - 3.Verify job moved to completed state: $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage We have an api [from tool pod] to check if we need to add FORCE_FLAG: ``` sh-4.4$ ceph osd ok-to-stop 0 {"ok_to_stop":true,"osds":[0],"num_ok_pgs":0,"num_not_ok_pgs":0} sh-4.4$ ceph osd safe-to-destroy 0 Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. ``` if "ceph osd safe-to-destroy <osd-di>" return error -> we need to add the FORCE_FLAG, else we don't need to add the force flag. ``` I opened a doc bz to add `-p` string to command : https://bugzilla.redhat.com/show_bug.cgi?id=2139406
If this is not resolved by the force flag, please reopen the issue.