1895796 – Update node replacement procedure for local storage devices for local volume set changes, upgraded cluster scenario, OCS 4.6 job update

Bug 1895796 - Update node replacement procedure for local storage devices for local volume set changes, upgraded cluster scenario, OCS 4.6 job update

Summary: Update node replacement procedure for local storage devices for local volume ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	documentation
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Laura Bailey
QA Contact:	Pratik Surve
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1882363
TreeView+	depends on / blocked

Reported:	2020-11-09 05:38 UTC by Kusuma
Modified:	2021-08-25 14:55 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-25 14:55:03 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 9 Servesha 2020-11-18 07:00:46 UTC

@kusuma, There will be modifications in steps 13,14 according to new LSO changes and OCS 4.6 osd removal job respectively. I have commented in the doc. Let me know if you have any doubt!

Comment 11 Servesha 2020-11-18 11:44:26 UTC

@Laura ack. thanks

Comment 20 Itzhak 2020-12-03 17:55:49 UTC

I didn't get to test the steps from the doc on a cluster yet, 
But from what I remembered when trying the node replacement procedure I have a few comments about the doc:

1. I think we need to run the ocs-osd-removal job before deleting the pv(steps 15, 16). 
After executing the ocs-osd-removal job, the pv will be in status 'Released', and then we can delete it safely.

2. In step 19, no need to delete "rook-ceph-operator" in 4.6. 
Maybe we can write something like the in the device replacement doc https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhocs:
"If the new OSD does not show as Running after a few minutes, restart the rook-ceph-operator pod to force a reconciliation."


Rather than these 2 comments, the doc looks good to me.

Comment 21 Rohan CJ 2020-12-04 06:34:07 UTC

> After executing the ocs-osd-removal job, the pv will be in status 'Released', and then we can delete it safely.

Also, The PV should be eventually deleted after it is released as the ReclaimPolicy on the LSO storageclass is "Delete"

Comment 22 Rohan CJ 2020-12-04 06:35:54 UTC

> 1. I think we need to run the ocs-osd-removal job before deleting the pv(steps 15, 16). 
> After executing the ocs-osd-removal job, the pv will be in status 'Released', and then we can delete it safely.


+1

Comment 24 Rohan CJ 2020-12-04 07:55:18 UTC

> Asking Rohan in chat now whether his comment 21 means I should delete "Delete the PV associated with the failed node" or just add a sentence about the ReclaimPolicy meaning that the PV will eventually be deleted automatically.

I made a mistake. The PV will not get cleaned up on failed nodes.

Comment 31 Itzhak 2020-12-07 19:21:05 UTC

I tested the doc section 3.1.1, the other sections I didn't test yet. 

There 2 things we may need to fix in the doc:

1. In step 18.1 - The ocs-osd-removal job deletes the pvc, so we can't get the pvc after executing the ocs-osd-removal job.
Instead, we need to perform this steps:
- Figure out the pv by the pvc(and don't delete the pv, or the pvc).
- executing the ocs-osd-removal job
- delete the pv

2. In step 20 -  I don't think we need to delete the rook-ceph-operator in 4.6. 



Also another note, one of the mons was in a pending state for a short time, and then back to be in a "Running" state.
other than that the doc looks good to me. 
The Ceph health back to be OK at the end.

Note You need to log in before you can comment on or make changes to this bug.