Bug 2118344

Summary: Documentation for replacing nodes on IBM Z is incomplete
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: tstober
Component: documentationAssignee: Melanie Manley <mmanley>
Status: CLOSED CURRENTRELEASE QA Contact: Neha Berry <nberry>
Severity: medium Docs Contact: Olive Lakra <olakra>
Priority: unspecified    
Version: 4.11CC: asriram, ebenahar, ocs-bugs, odf-bz-bot, olakra
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-08 14:07:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description tstober 2022-08-15 14:32:06 UTC
Describe the issue:
Documentation for replacing nodes on IBM Z is incomplete

Describe the task you were trying to accomplish:
steps are missing to reset Ceph

Suggestions for improvement:

Document URL:

Chapter/Section Number and Title:
2.2.1

Product Version:
4.11

Environment Details:
IBM Z

Any other versions of this document that also needs this update:

Additional information:

For this section the documentation for IBM Z is incomplete.
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_local_storage_devices#replacing-operational-nodes-on-ibmz-infrastructure_ibm-z

It should have similar instructions to clean up Ceph as the documentation provided for bare metal infrastructure (2.2.1):
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_local_storage_devices#replacing-failed-storage-nodes-on-ibm-power-infrastructure_ibm-power

In particular:
steps 1-6 (2.2.1) as described in the baremetal section are missing and need to be added
step 7 would be called “Get a new zSystem storage node as replacement”
after step 7, add csr approvement as described in steps 9-10 (2.2.1)
Steps 12-19 (2.2.1) needs to be added too, in order to cleanly remove the osd from ODF
There should also be a troubleshoot section, especially for Step 18 (2.2.1) in order to verify that the ocs-osd-removal-job pod worked correctly. It may be necessary to manually cleanup the removed OSD (i.e. ID 2) as follows:
	ceph osd crush remove osd.REMOVED_OSD_ID
	ceph osd rm REMOVED_OSD_ID
	ceph auth del osd.REMOVED_OSD_ID
	ceph osd crush rm REMOVED_NODE
ODF now should be able to replace the node, check via ceph status and rook-ceph-osd-prepare pod
Hint: You can speed up the rebalancing after adding the replacement node with the following ceph commands - please make sure to return them to default values for a productive cluster:
ceph tell 'osd.*' injectargs --osd-max-backfills=16 --osd-recovery-max-active=4
ceph tell 'osd.*' config set osd_recovery_sleep_hdd 0
ceph tell 'osd.*' config set osd_recovery_sleep_ssd 0

Comment 4 tstober 2022-08-23 12:08:28 UTC
Manuel and I have verified the content. Looks good, thanks