1892604 – [Doc RFE] Update the replacing operational and failed devices procedure based on LSO UI replace enhancements

Bug 1892604 - [Doc RFE] Update the replacing operational and failed devices procedure based on LSO UI replace enhancements

Summary: [Doc RFE] Update the replacing operational and failed devices procedure based...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	documentation
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Olive Lakra
QA Contact:	Itzhak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1880905
TreeView+	depends on / blocked

Reported:	2020-10-29 10:15 UTC by Anjana Suparna Sriram
Modified:	2021-08-25 14:55 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-25 14:55:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 8 Rohan CJ 2020-11-30 11:35:30 UTC

- The steps in 3.4  look good. It should be that way for every platform using Local Storage i.e. all UPI and baremetal.
- Why is there a different section for every platform? I don't see significant differences, but maybe I missed them

https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_nodes/index?lb_target=stage#replacing-failed-storage-nodes-on-bare-metal-infrastructure_rhocs

Comment 9 Sahina Bose 2020-12-01 07:38:23 UTC

Clearing NI on me, as info was provided over email

Comment 11 Itzhak 2020-12-03 10:52:10 UTC

I have tested the doc's steps https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhocs. The steps look fine, but still not complete.

Here are my suggestions:

- After step '5': 
Add 2 additional steps:
1. Find the pv that need to be deleted by the command: 
$ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-d6bf175b                          100Gi      RWO            Delete           Released   openshift-storage/ocs-deviceset-0-data-0-6c5pw                   localblock                             2d22h   compute-1
2. Delete the pv:
$ oc delete pv local-pv-d6bf175b

- In step '7':
Change the command from "oc describe localvolumeset localblock" to "oc -n openshift-local-storage describe localvolumeset localblock"

- After step '7' we need to add 2 more steps:
1. Delete the ocs-osd-removal job:
$ oc delete job ocs-osd-removal-${osd_id_to_remove}

2. Need to rsh to ceph tools pod and silence the warning of the old osd crash due to this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1896810. I assume we want the ocs-osd-removal job will do that


- Verification steps '1': 
For me, it was in a "Bound" state and not in an "Available" state. Maybe we need to mention - that it can also be in a "Bound" or "Available" state. 

- Verification steps '4': 
The full recovery of the data can take up to 50 min. We may need to mention it in the docs.

Comment 14 Itzhak 2020-12-06 10:13:33 UTC

@psurve is right. The recovery of the data depends on the capacity. If we have 256 MB to each OSD, it will take less than if we have 2Ti to each OSD - so no need to mention a specific time.

Comment 15 Servesha 2020-12-07 06:28:33 UTC

> 
> 2. Need to rsh to ceph tools pod and silence the warning of the old osd
> crash due to this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1896810. I
> assume we want the ocs-osd-removal job will do that
> 

The job will handle that. But for now, we can mention it as a `known issue`.

Comment 17 Itzhak 2020-12-07 13:53:16 UTC

I tested the updated doc. 
Still have 3 tiny things to fix: 

1. In step 3: "Note If the rook-ceph-osd pod is in terminating state for more than a few minutes, use the force option to delete the pod. # oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force" 
need to change the command "oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force" to 
"oc delete -n openshift-storage pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force"

2. In steps '4' and '10' change the command from "oc delete job ocs-osd-removal-${osd_id_to_remove}" to 
"oc delete -n openshift-storage job ocs-osd-removal-${osd_id_to_remove}"

3. One semantic issue - in steps 6 change the size to be 1490Gi so the doc will be consistent:
So now step 6 output should be: 
$ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

local-pv-d6bf175b           1490Gi       RWO         Delete          Released            openshift-storage/ocs-deviceset-0-data-0-6c5pw      localblock      2d22h       compute-1


Rather than that, the doc looks good to me.

Comment 19 Itzhak 2020-12-07 15:51:59 UTC

I tested the new doc with a vSphere LSO 4.6 cluster, and the process has finished successfully. 

Additional information about the cluster I used:

OCP version:
Client Version: 4.6.0-0.nightly-2020-12-06-095114
Server Version: 4.6.0-0.nightly-2020-12-06-095114
Kubernetes Version: v1.19.0+7070803

OCS verison:
ocs-operator.v4.6.0-183.ci   OpenShift Container Storage   4.6.0-183.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-12-06-095114   True        False         22h     Cluster version is 4.6.0-0.nightly-2020-12-06-095114

Rook version
rook: 4.6-74.92220e58.release_4.6
go: go1.15.2

Ceph version
ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)

Note You need to log in before you can comment on or make changes to this bug.