Bug 2118344

Summary:	Documentation for replacing nodes on IBM Z is incomplete
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	tstober
Component:	documentation	Assignee:	Melanie Manley <mmanley>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Neha Berry <nberry>
Severity:	medium	Docs Contact:	Olive Lakra <olakra>
Priority:	unspecified
Version:	4.11	CC:	asriram, ebenahar, ocs-bugs, odf-bz-bot, olakra
Target Milestone:	---
Target Release:	---
Hardware:	s390x
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-08 14:07:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description tstober 2022-08-15 14:32:06 UTC

Describe the issue:
Documentation for replacing nodes on IBM Z is incomplete

Describe the task you were trying to accomplish:
steps are missing to reset Ceph

Suggestions for improvement:

Document URL:

Chapter/Section Number and Title:
2.2.1

Product Version:
4.11

Environment Details:
IBM Z

Any other versions of this document that also needs this update:

Additional information:

For this section the documentation for IBM Z is incomplete.
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_local_storage_devices#replacing-operational-nodes-on-ibmz-infrastructure_ibm-z

It should have similar instructions to clean up Ceph as the documentation provided for bare metal infrastructure (2.2.1):
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_local_storage_devices#replacing-failed-storage-nodes-on-ibm-power-infrastructure_ibm-power

In particular:
steps 1-6 (2.2.1) as described in the baremetal section are missing and need to be added
step 7 would be called “Get a new zSystem storage node as replacement”
after step 7, add csr approvement as described in steps 9-10 (2.2.1)
Steps 12-19 (2.2.1) needs to be added too, in order to cleanly remove the osd from ODF
There should also be a troubleshoot section, especially for Step 18 (2.2.1) in order to verify that the ocs-osd-removal-job pod worked correctly. It may be necessary to manually cleanup the removed OSD (i.e. ID 2) as follows:
	ceph osd crush remove osd.REMOVED_OSD_ID
	ceph osd rm REMOVED_OSD_ID
	ceph auth del osd.REMOVED_OSD_ID
	ceph osd crush rm REMOVED_NODE
ODF now should be able to replace the node, check via ceph status and rook-ceph-osd-prepare pod
Hint: You can speed up the rebalancing after adding the replacement node with the following ceph commands - please make sure to return them to default values for a productive cluster:
ceph tell 'osd.*' injectargs --osd-max-backfills=16 --osd-recovery-max-active=4
ceph tell 'osd.*' config set osd_recovery_sleep_hdd 0
ceph tell 'osd.*' config set osd_recovery_sleep_ssd 0

Comment 3 Melanie Manley 2022-08-17 11:43:02 UTC

Manuel Gotin has approved the changes as per MR:

https://gitlab.cee.redhat.com/red-hat-openshift-container-storage-documentation/openshift-data-foundation-documentation-4.11/-/merge_requests/119

Comment 4 tstober 2022-08-23 12:08:28 UTC

Manuel and I have verified the content. Looks good, thanks