Bug 1892604

Summary: [Doc RFE] Update the replacing operational and failed devices procedure based on LSO UI replace enhancements
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Anjana Suparna Sriram <asriram>
Component: documentationAssignee: Olive Lakra <olakra>
Status: CLOSED CURRENTRELEASE QA Contact: Itzhak <ikave>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.6CC: ikave, nberry, ocs-bugs, olakra, prsurve, rohgupta, rojoseph, sabose, sdudhgao
Target Milestone: ---Keywords: FutureFeature
Target Release: OCS 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-25 14:55:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1880905    

Comment 8 Rohan CJ 2020-11-30 11:35:30 UTC
- The steps in 3.4  look good. It should be that way for every platform using Local Storage i.e. all UPI and baremetal.
- Why is there a different section for every platform? I don't see significant differences, but maybe I missed them

https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_nodes/index?lb_target=stage#replacing-failed-storage-nodes-on-bare-metal-infrastructure_rhocs

Comment 9 Sahina Bose 2020-12-01 07:38:23 UTC
Clearing NI on me, as info was provided over email

Comment 11 Itzhak 2020-12-03 10:52:10 UTC
I have tested the doc's steps https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhocs. The steps look fine, but still not complete.

Here are my suggestions:

- After step '5': 
Add 2 additional steps:
1. Find the pv that need to be deleted by the command: 
$ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-d6bf175b                          100Gi      RWO            Delete           Released   openshift-storage/ocs-deviceset-0-data-0-6c5pw                   localblock                             2d22h   compute-1
2. Delete the pv:
$ oc delete pv local-pv-d6bf175b

- In step '7':
Change the command from "oc describe localvolumeset localblock" to "oc -n openshift-local-storage describe localvolumeset localblock"

- After step '7' we need to add 2 more steps:
1. Delete the ocs-osd-removal job:
$ oc delete job ocs-osd-removal-${osd_id_to_remove}

2. Need to rsh to ceph tools pod and silence the warning of the old osd crash due to this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1896810. I assume we want the ocs-osd-removal job will do that


- Verification steps '1': 
For me, it was in a "Bound" state and not in an "Available" state. Maybe we need to mention - that it can also be in a "Bound" or "Available" state. 

- Verification steps '4': 
The full recovery of the data can take up to 50 min. We may need to mention it in the docs.

Comment 14 Itzhak 2020-12-06 10:13:33 UTC
@psurve is right. The recovery of the data depends on the capacity. If we have 256 MB to each OSD, it will take less than if we have 2Ti to each OSD - so no need to mention a specific time.

Comment 15 Servesha 2020-12-07 06:28:33 UTC
> 
> 2. Need to rsh to ceph tools pod and silence the warning of the old osd
> crash due to this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1896810. I
> assume we want the ocs-osd-removal job will do that
> 

The job will handle that. But for now, we can mention it as a `known issue`.

Comment 17 Itzhak 2020-12-07 13:53:16 UTC
I tested the updated doc. 
Still have 3 tiny things to fix: 

1. In step 3: "Note If the rook-ceph-osd pod is in terminating state for more than a few minutes, use the force option to delete the pod. # oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force" 
need to change the command "oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force" to 
"oc delete -n openshift-storage pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force"

2. In steps '4' and '10' change the command from "oc delete job ocs-osd-removal-${osd_id_to_remove}" to 
"oc delete -n openshift-storage job ocs-osd-removal-${osd_id_to_remove}"

3. One semantic issue - in steps 6 change the size to be 1490Gi so the doc will be consistent:
So now step 6 output should be: 
$ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

local-pv-d6bf175b           1490Gi       RWO         Delete          Released            openshift-storage/ocs-deviceset-0-data-0-6c5pw      localblock      2d22h       compute-1


Rather than that, the doc looks good to me.

Comment 19 Itzhak 2020-12-07 15:51:59 UTC
I tested the new doc with a vSphere LSO 4.6 cluster, and the process has finished successfully. 

Additional information about the cluster I used:

OCP version:
Client Version: 4.6.0-0.nightly-2020-12-06-095114
Server Version: 4.6.0-0.nightly-2020-12-06-095114
Kubernetes Version: v1.19.0+7070803

OCS verison:
ocs-operator.v4.6.0-183.ci   OpenShift Container Storage   4.6.0-183.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-12-06-095114   True        False         22h     Cluster version is 4.6.0-0.nightly-2020-12-06-095114

Rook version
rook: 4.6-74.92220e58.release_4.6
go: go1.15.2

Ceph version
ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)