Bug 1318389

Summary: [RFE] Tool for putting node into maintenance mode
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: arkady kanevsky <arkady_kanevsky>
Component: UnclassifiedAssignee: ceph-eng-bugs <ceph-eng-bugs>
Status: CLOSED DEFERRED QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 1.3.2CC: alan_bishop, anharris, arkady_kanevsky, cdevine, christopher_dearborn, flucifre, gmeno, john_terpstra, John_walsh, kbader, kdreyer, kurt_hey, michael_tondee, morazi, nlevine, Paul_Dardeau, randy_perryman, rsussman, sreichar
Target Milestone: rcKeywords: FutureFeature, Reopened
Target Release: 3.*   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-30 14:59:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description arkady kanevsky 2016-03-16 17:56:42 UTC
Description of problem:
Currently Customers who wants to put a node into maintenance mode need to follow set of instructions in chapter 17 of https://access.redhat.com/documentation/en/red-hat-ceph-storage/1.3/red-hat-ceph-administration-guide/part-v-adding-and-removing-osd-nodes.
Since this is a common procedure for node replacement and FW upgrade having a tool that help with it will be beneficial.

Version-Release number of selected component (if applicable):
1.3

How reproducible:
N/A

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 arkady kanevsky 2016-06-06 12:53:02 UTC
I do not have access to Ceph 2.0 documentation.
Assuming little changes for putting node into maintenance mode for Ceph 2.x the ask is for a new command of type 
          ceph fw-update --"url for gw version" --user -- password
where last two parameters are optional credential for FW access.

The script will cycle thru every node in OSD cluster, take one node at a time into maintenance mode and update FW to specified version. Leave it to implementer to disable cluster rebuild or not during update.
Script will calculate that it has sufficient spare capacity in a cluster to do it.

I recommend that operation is async since it takes long time to do.
For extra bonus an extra command to check the status of fw-update and shows percentage of nodes update and current node under update.
Ditto for UI for Ceph.

Comment 6 Ian Colle 2017-07-10 17:11:16 UTC
Arkady,

Please take a look at https://bugzilla.redhat.com/show_bug.cgi?id=1464945. Does this accomplish what you're looking for?

Comment 7 Ian Colle 2017-08-01 23:20:28 UTC
Closing as duplicate due to lack of response from originator.

*** This bug has been marked as a duplicate of bug 1464945 ***

Comment 8 arkady kanevsky 2017-08-07 13:48:39 UTC
It is not a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1464945.
This BZ is not specific to disk replacement even though some of the steps used in https://bugzilla.redhat.com/show_bug.cgi?id=1464945 will be applicable here.
Documentation will be very different. 
One need a generic way to put node in a maintenance mode. The goal is to minimize data transfer and potentially create a new states for a node - not available. 
For maintenance mode we know that we will bring the node back on line, so it should not be treated as failure.

Once in a maintenance mode  once can what is required. For example, update FW or BIOS on a node or for any of its components, replace any component, like disk or NIC or processors, or even motherboard. Some specific steps maybe required depending what components were replaced.

Reopening.

Comment 9 John Spray 2017-08-07 14:09:13 UTC
So it sounds like you're looking for a host-wide equivalent of the "ceph osd add-noout" command?

Is there any behaviour you're looking for other than for the OSDs on a particular host to not be marked out?

Comment 11 Drew Harris 2019-01-30 14:59:18 UTC
I have closed this issue because it has been inactive for some time now. If you feel this still deserves attention feel free to reopen it.