I tested multiple osds replacements on Vmware LSO cluster: On section $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0,1 FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - on Section 2.4.1. on step 18, We need to add FORCE_OSD_REMOVAL flag https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_local_storage_devices#replacing_storage_nodes_on_vmware_infrastructure SetUp: OCP Version: 4.11.0-0.nightly-2022-07-16-020951 ODF Version: 4.11.0-113 LSO Version: local-storage-operator.4.11.0-202207121147 Test Prcoess: $ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-7f557d75d-xggxv 2/2 Running 0 78m 10.129.2.22 compute-0 <none> <none> rook-ceph-osd-1-759bb46bc6-wth4l 2/2 Running 0 78m 10.128.2.45 compute-1 <none> <none> rook-ceph-osd-2-5bb4c984c7-zzm57 2/2 Running 0 78m 10.131.0.32 compute-2 <none> <none> Delete osd 0,1: $ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-7f557d75d-xggxv 1/2 CrashLoopBackOff 2 (6s ago) 82m 10.129.2.22 compute-0 <none> <none> rook-ceph-osd-1-759bb46bc6-wth4l 1/2 CrashLoopBackOff 1 (18s ago) 82m 10.128.2.45 compute-1 <none> <none> rook-ceph-osd-2-5bb4c984c7-zzm57 2/2 Running 0 82m 10.131.0.32 compute-2 <none> <none> $ oc scale -n openshift-storage deployment rook-ceph-osd-0 --replicas=0 deployment.apps/rook-ceph-osd-0 scaled $ oc scale -n openshift-storage deployment rook-ceph-osd-1 --replicas=0 deployment.apps/rook-ceph-osd-1 scaled $ oc get -n openshift-storage pods -l ceph-osd-id=0 NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-7f557d75d-xggxv 0/2 Terminating 4 84m $ oc get -n openshift-storage pods -l ceph-osd-id=1 NAME READY STATUS RESTARTS AGE rook-ceph-osd-1-759bb46bc6-wth4l 0/2 Terminating 3 84m $ oc delete -n openshift-storage pod rook-ceph-osd-0-7f557d75d-xggxv --grace-period=0 --force warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-7f557d75d-xggxv" force deleted $ oc delete -n openshift-storage pod rook-ceph-osd-1-759bb46bc6-wth4l --grace-period=0 --force warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-1-759bb46bc6-wth4l" force deleted $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0,1 FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-dfdmr 0/1 Completed 0 92s $ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal' 2022-07-18 13:43:19.207129 I | cephosd: completed removal of OSD 0 2022-07-18 13:43:25.063755 I | cephosd: completed removal of OSD 1 $ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-5c3e0bc0 100Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-0svf5w localblock 93m compute-0 local-pv-a76136ea 100Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-29plq4 localblock 93m compute-1 $ oc delete pv local-pv-5c3e0bc0 persistentvolume "local-pv-5c3e0bc0" deleted $ oc delete pv local-pv-a76136ea persistentvolume "local-pv-a76136ea" deleted $ oc get pods | grep osd rook-ceph-osd-0-6dbb477cc7-tggf9 2/2 Running 0 80s rook-ceph-osd-1-6d69d57d84-qfc5n 2/2 Running 0 79s rook-ceph-osd-2-5bb4c984c7-zzm57 2/2 Running 0 94m rook-ceph-osd-prepare-007401b8286106910c461cc5d73d9687-rvwt8 0/1 Completed 0 6m47s rook-ceph-osd-prepare-e6e56ed144a9c3ed7d6873038aa03aee-6cs4b 0/1 Completed 0 6m46s rook-ceph-osd-prepare-f2a2313c490e4d5c1d127f4f5c4e8141-5fkpq 0/1 Completed 0 95m
I tested the node replacement procedure on LSO cluster [vmware] with wide encryption. We can add a new step "Add a new disk to new worker node" before step 16 [Verify that the new localblock PV is available.] https://docs.google.com/document/d/1m720IElmcnqLMW_iNcSrx75tatFT7X2yMhYB7NZxhXc/edit
https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/replacing_nodes/index?lb_target=preview#replacing-an-operational-node-using-local-storage-devices_vmware-upi-operational Bug Fixed. 1. Added FORCE_OSD_REMOVAL flag: for example: oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=1 FORCE_OSD_REMOVAL=true | oc create -f - 2. <failed_osd_id> string fixed.