Bug 1895796

Summary: Update node replacement procedure for local storage devices for local volume set changes, upgraded cluster scenario, OCS 4.6 job update
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Kusuma <kbg>
Component: documentationAssignee: Laura Bailey <lbailey>
Status: CLOSED CURRENTRELEASE QA Contact: Pratik Surve <prsurve>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.6CC: asriram, ebenahar, ikave, lbailey, nberry, ocs-bugs, olakra, prsurve, rohgupta, rojoseph, sabose, sdudhgao
Target Milestone: ---   
Target Release: OCS 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-25 14:55:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1882363    

Comment 9 Servesha 2020-11-18 07:00:46 UTC
@kusuma, There will be modifications in steps 13,14 according to new LSO changes and OCS 4.6 osd removal job respectively. I have commented in the doc. Let me know if you have any doubt!

Comment 11 Servesha 2020-11-18 11:44:26 UTC
@Laura ack. thanks

Comment 20 Itzhak 2020-12-03 17:55:49 UTC
I didn't get to test the steps from the doc on a cluster yet, 
But from what I remembered when trying the node replacement procedure I have a few comments about the doc:

1. I think we need to run the ocs-osd-removal job before deleting the pv(steps 15, 16). 
After executing the ocs-osd-removal job, the pv will be in status 'Released', and then we can delete it safely.

2. In step 19, no need to delete "rook-ceph-operator" in 4.6. 
Maybe we can write something like the in the device replacement doc https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhocs:
"If the new OSD does not show as Running after a few minutes, restart the rook-ceph-operator pod to force a reconciliation."


Rather than these 2 comments, the doc looks good to me.

Comment 21 Rohan CJ 2020-12-04 06:34:07 UTC
> After executing the ocs-osd-removal job, the pv will be in status 'Released', and then we can delete it safely.

Also, The PV should be eventually deleted after it is released as the ReclaimPolicy on the LSO storageclass is "Delete"

Comment 22 Rohan CJ 2020-12-04 06:35:54 UTC
> 1. I think we need to run the ocs-osd-removal job before deleting the pv(steps 15, 16). 
> After executing the ocs-osd-removal job, the pv will be in status 'Released', and then we can delete it safely.


+1

Comment 24 Rohan CJ 2020-12-04 07:55:18 UTC
> Asking Rohan in chat now whether his comment 21 means I should delete "Delete the PV associated with the failed node" or just add a sentence about the ReclaimPolicy meaning that the PV will eventually be deleted automatically.

I made a mistake. The PV will not get cleaned up on failed nodes.

Comment 31 Itzhak 2020-12-07 19:21:05 UTC
I tested the doc section 3.1.1, the other sections I didn't test yet. 

There 2 things we may need to fix in the doc:

1. In step 18.1 - The ocs-osd-removal job deletes the pvc, so we can't get the pvc after executing the ocs-osd-removal job.
Instead, we need to perform this steps:
- Figure out the pv by the pvc(and don't delete the pv, or the pvc).
- executing the ocs-osd-removal job
- delete the pv

2. In step 20 -  I don't think we need to delete the rook-ceph-operator in 4.6. 



Also another note, one of the mons was in a pending state for a short time, and then back to be in a "Running" state.
other than that the doc looks good to me. 
The Ceph health back to be OK at the end.