Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1210543

Summary:	Replacing failed CEPH Node
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vasu Kulkarni <vakulkar>
Component:	Documentation	Assignee:	John Wilkins <jowilkin>
Status:	CLOSED CURRENTRELEASE	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	1.3.0	CC:	asriram, dgallowa, flucifre, hnallurv, jowilkin, kdreyer, shmohan, vakulkar, vashastr
Target Milestone:	rc	Keywords:	Reopened
Target Release:	1.3.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-30 09:52:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vasu Kulkarni 2015-04-10 03:09:42 UTC

Description of problem:
In a Multi Node CEPH Cluster, A Hardware failure can cause the some/all of the ceph services to be down, A documentation explaining how to identify the failed services and replace the hardware should be 

(1)
Identify all the services which the failed node was running from live nodes

(2)
Is this document sufficient to bring up a failed MON service
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/

(3)
since the drives might survive failure but UUIDs can change, what would be the appropriate steps in this case?

BZ exists for failed OSD document
https://bugzilla.redhat.com/show_bug.cgi?id=1210539

(4)
For MDS , the document is missing steps
http://ceph.com/docs/v0.78/rados/deployment/ceph-deploy-mds/


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
N/A

Actual results:
N/A

Expected results:
Documentation needed to identify/replace failed services

Additional info:

Comment 1 Ken Dreyer (Red Hat) 2015-04-23 14:27:45 UTC

I'm targeting this to 1.3.0. John please feel free to re-target if that's not appropriate.

Comment 2 John Wilkins 2015-06-09 22:04:08 UTC

Will have to address this after 1.3 release. Seems to be duplicated multiple times.

Comment 4 John Wilkins 2015-09-14 17:10:34 UTC

https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/commit/7e261b1d9e3b9ff542e59caf1567cd1e0b2f0e13

Comment 5 Vasu Kulkarni 2015-09-14 17:39:26 UTC

John,

I checked that document, It explains few things about adding and remove OSD(which is a Ceph Disk Replacemnt as compared to this bz which is Ceph Node replacement(i,e All OSD's/Mon's on this particular node). Also I feel the doc link you sent doesn't completely explain how to remove and add a new OSD.

Comment 6 John Wilkins 2015-09-15 17:30:07 UTC

This document--https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/blob/v1.3/add-remove-node.adoc--is meant to provide high-level guidance to a user when changing an OSD node, but not a monitor node. Changing OSD nodes can't easily be made into a generic procedure, because it depends on the hardware configuration which is unknown, the means by which the node was configured, which is also unknown; and, the reason the node must be changed (e.g., motherboard failure, hardware upgrade, etc.), which is also unknown. I have stated this multiple times. I am not clear on how I can document the procedure for unknown hardware, configuration and rationale for swapping out the node, so I have provided the high level guidance that a system administrator should know as it relates to what they can expect as a performance impact and the steps they can do to mitigate performance impact..

The doc provides a hyperlink to--https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/blob/v1.3/replace-osds.adoc--that describes how to change an OSD that has failed as well as https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/blob/v1.3/cluster-size.adoc which describes generically adding/removing OSDs.

The add/remove doc has been available in that form for years. The doc on replacing and OSD disk that has failed is new, but tracks the add/remove OSD fairly closely. Did you follow that procedure?

Comment 8 Vasu Kulkarni 2015-10-07 21:26:55 UTC

John,

Sorry I missed to update earlier, I am not sure why changing OSD is not a generic process, It shouldn't have dependency on Drive or Chassis, detecting new drive should be a OS Specific and it could be in different forms. The one which you had here before I think is generic : https://gitlab.cee.redhat.com/jowilkin/red-hat-ceph-storage-administration-guide/blob/v1.3/replace-osds.adoc--that

We need to document the process for replacing mon and I think the upstream document should be good enough.

Comment 9 John Wilkins 2015-11-05 16:55:09 UTC

Please re-open this bug for Infernalis (RHCS 2.0). We are going to rewrite the hardware guide and will address this with information we receive from the reference architecture team.

Comment 11 John Wilkins 2016-09-08 18:15:49 UTC

https://access.qa.redhat.com/documentation/en/red-hat-ceph-storage/2/single/administration-guide/#adding_and_removing_osd_nodes

Comment 14 Tejas 2016-09-19 12:25:27 UTC

hi,

   In the doc:
https://access.qa.redhat.com/documentation/en/red-hat-ceph-storage/2/single/administration-guide/#adding_and_removing_osd_nodes

Section 8.3 step 3:

osd_recovery_priority = 1
is not a valid config setting.

root@magna009 ceph-config]# ceph tell osd.* injectargs '--osd_recovery_priority 1'
osd.0:  failed to parse arguments: --osd_recovery_priority,1



I think it should be :
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'


root@magna009 ceph-config]# ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
osd.0: osd_recovery_op_priority = '1' 
osd.1: osd_recovery_op_priority = '1' 

Moving this back.

Thanks,
Tejas

Comment 15 John Wilkins 2016-09-19 18:37:47 UTC

Fixed. There were three instances in the doc. https://access.qa.redhat.com/documentation/en/red-hat-ceph-storage/2/single/administration-guide#recommendations

Comment 17 John Wilkins 2016-09-28 15:54:55 UTC

https://access.qa.redhat.com/documentation/en/red-hat-ceph-storage/1.3/single/administration-guide/#changing_an_osd_drive

https://access.qa.redhat.com/documentation/en/red-hat-ceph-storage/1.3/single/administration-guide/#adding_and_removing_osd_nodes

Comment 19 John Wilkins 2016-10-04 17:01:03 UTC

Fixed.

Comment 20 Vasu Kulkarni 2016-10-06 16:54:49 UTC

Sorry,

but can we also document how to replace the "root" drive that holds osd and mon db, I was hoping this will cover that scenario where osd drive might be intact but the root drive needs replacement?

Comment 21 Harish NV Rao 2016-10-06 17:37:09 UTC

Vasu, this defect tracks node replacement. comment 20 may be applicable to 
1210539 (Replacing failed disks on CEPH nodes.)

Can you please check and do the needful.

We have almost completed verifying this defect.

Comment 22 Vasu Kulkarni 2016-10-06 18:32:52 UTC

Harish,

that bz is covering only the failed "osd" drives,its not covering the failed "system" drive, since the osd drives survive we need to replace system drive and check for other services come up after quick restore of ceph (mon/osd/mds that exists on node)

Comment 23 Harish NV Rao 2016-10-07 08:42:26 UTC

Should comment 22 be incorporated as part of this BZ?
If yes, please move the defect to assigned state.

Comment 24 Harish NV Rao 2016-10-18 10:03:29 UTC

Vasu, A gentle reminder. 

We have completed verifying this defect already and would like to move this defect to verified state. But without resolution on comments 21, 22 and 23 we can't move this to verified state. 

I feel comment 22 should be part of 1210539 (Replacing failed disks on CEPH nodes.) and we should move this defect to verified state.