- The steps in 3.4 look good. It should be that way for every platform using Local Storage i.e. all UPI and baremetal. - Why is there a different section for every platform? I don't see significant differences, but maybe I missed them https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_nodes/index?lb_target=stage#replacing-failed-storage-nodes-on-bare-metal-infrastructure_rhocs
Clearing NI on me, as info was provided over email
I have tested the doc's steps https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhocs. The steps look fine, but still not complete. Here are my suggestions: - After step '5': Add 2 additional steps: 1. Find the pv that need to be deleted by the command: $ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 100Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1 2. Delete the pv: $ oc delete pv local-pv-d6bf175b - In step '7': Change the command from "oc describe localvolumeset localblock" to "oc -n openshift-local-storage describe localvolumeset localblock" - After step '7' we need to add 2 more steps: 1. Delete the ocs-osd-removal job: $ oc delete job ocs-osd-removal-${osd_id_to_remove} 2. Need to rsh to ceph tools pod and silence the warning of the old osd crash due to this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1896810. I assume we want the ocs-osd-removal job will do that - Verification steps '1': For me, it was in a "Bound" state and not in an "Available" state. Maybe we need to mention - that it can also be in a "Bound" or "Available" state. - Verification steps '4': The full recovery of the data can take up to 50 min. We may need to mention it in the docs.
@psurve is right. The recovery of the data depends on the capacity. If we have 256 MB to each OSD, it will take less than if we have 2Ti to each OSD - so no need to mention a specific time.
> > 2. Need to rsh to ceph tools pod and silence the warning of the old osd > crash due to this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1896810. I > assume we want the ocs-osd-removal job will do that > The job will handle that. But for now, we can mention it as a `known issue`.
I tested the updated doc. Still have 3 tiny things to fix: 1. In step 3: "Note If the rook-ceph-osd pod is in terminating state for more than a few minutes, use the force option to delete the pod. # oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force" need to change the command "oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force" to "oc delete -n openshift-storage pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force" 2. In steps '4' and '10' change the command from "oc delete job ocs-osd-removal-${osd_id_to_remove}" to "oc delete -n openshift-storage job ocs-osd-removal-${osd_id_to_remove}" 3. One semantic issue - in steps 6 change the size to be 1490Gi so the doc will be consistent: So now step 6 output should be: $ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1 Rather than that, the doc looks good to me.
I tested the new doc with a vSphere LSO 4.6 cluster, and the process has finished successfully. Additional information about the cluster I used: OCP version: Client Version: 4.6.0-0.nightly-2020-12-06-095114 Server Version: 4.6.0-0.nightly-2020-12-06-095114 Kubernetes Version: v1.19.0+7070803 OCS verison: ocs-operator.v4.6.0-183.ci OpenShift Container Storage 4.6.0-183.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-12-06-095114 True False 22h Cluster version is 4.6.0-0.nightly-2020-12-06-095114 Rook version rook: 4.6-74.92220e58.release_4.6 go: go1.15.2 Ceph version ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)