2118344 – Documentation for replacing nodes on IBM Z is incomplete

Bug 2118344 - Documentation for replacing nodes on IBM Z is incomplete

Summary: Documentation for replacing nodes on IBM Z is incomplete

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	documentation
Sub Component:
Version:	4.11
Hardware:	s390x
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Melanie Manley
QA Contact:	Neha Berry
Docs Contact:	Olive Lakra
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-15 14:32 UTC by tstober
Modified:	2023-08-09 16:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-08 14:07:58 UTC
Embargoed:

Attachments	(Terms of Use)

Description tstober 2022-08-15 14:32:06 UTC

Describe the issue:
Documentation for replacing nodes on IBM Z is incomplete

Describe the task you were trying to accomplish:
steps are missing to reset Ceph

Suggestions for improvement:

Document URL:

Chapter/Section Number and Title:
2.2.1

Product Version:
4.11

Environment Details:
IBM Z

Any other versions of this document that also needs this update:

Additional information:

For this section the documentation for IBM Z is incomplete.
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_local_storage_devices#replacing-operational-nodes-on-ibmz-infrastructure_ibm-z

It should have similar instructions to clean up Ceph as the documentation provided for bare metal infrastructure (2.2.1):
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html/replacing_nodes/openshift_data_foundation_deployed_using_local_storage_devices#replacing-failed-storage-nodes-on-ibm-power-infrastructure_ibm-power

In particular:
steps 1-6 (2.2.1) as described in the baremetal section are missing and need to be added
step 7 would be called “Get a new zSystem storage node as replacement”
after step 7, add csr approvement as described in steps 9-10 (2.2.1)
Steps 12-19 (2.2.1) needs to be added too, in order to cleanly remove the osd from ODF
There should also be a troubleshoot section, especially for Step 18 (2.2.1) in order to verify that the ocs-osd-removal-job pod worked correctly. It may be necessary to manually cleanup the removed OSD (i.e. ID 2) as follows:
	ceph osd crush remove osd.REMOVED_OSD_ID
	ceph osd rm REMOVED_OSD_ID
	ceph auth del osd.REMOVED_OSD_ID
	ceph osd crush rm REMOVED_NODE
ODF now should be able to replace the node, check via ceph status and rook-ceph-osd-prepare pod
Hint: You can speed up the rebalancing after adding the replacement node with the following ceph commands - please make sure to return them to default values for a productive cluster:
ceph tell 'osd.*' injectargs --osd-max-backfills=16 --osd-recovery-max-active=4
ceph tell 'osd.*' config set osd_recovery_sleep_hdd 0
ceph tell 'osd.*' config set osd_recovery_sleep_ssd 0

Comment 3 Melanie Manley 2022-08-17 11:43:02 UTC

Manuel Gotin has approved the changes as per MR:

https://gitlab.cee.redhat.com/red-hat-openshift-container-storage-documentation/openshift-data-foundation-documentation-4.11/-/merge_requests/119

Comment 4 tstober 2022-08-23 12:08:28 UTC

Manuel and I have verified the content. Looks good, thanks

Note You need to log in before you can comment on or make changes to this bug.