Bug 1941918

Summary:	[Doc] [Arbiter] Disable mon failover in stretch mode
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Mudit Agarwal <muagarwa>
Component:	documentation	Assignee:	Olive Lakra <olakra>
Status:	CLOSED WONTFIX	QA Contact:	Elad <ebenahar>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	agantony, bniver, ebenahar, etamir, gfarnum, madam, mbukatov, muagarwa, nberry, ocs-bugs, olakra, owasserm, prsurve, rcyriac, shan, sostapov, tnielsen
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1939617	Environment:
Last Closed:	2022-03-08 08:31:28 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1939007, 1939617, 1939766
Bug Blocks:

Comment 4 Mudit Agarwal 2021-03-24 13:37:31 UTC

Right now Ceph doesn't support mon failover for stretch clusters, Ceph folks are working on the same but looks like it won't be able to make it in time for OCS4.7 to consume.

See https://bugzilla.redhat.com/show_bug.cgi?id=1939766, it is currently targeted for RHCS4.2z1-aync.

So, to workaround that in OCS we have stopped doing a failover from OCS (rook) side. 
We need to document the same as well as what customer is supposed to do in such situation. Sebastien/Martin would be able to help in elaborating more on that.


Note: In case we decide to consume RHCS4.2z1-aync (with the fix for #1939766) then we have to remove this section from the doc, but right now plan B is plan A :)

Comment 5 Travis Nielsen 2021-03-29 17:44:08 UTC

There isn't much to document, other than they need to bring the node back up where the failed mon was running. Since the feature is in tech preview, do we really need to document this? Perhaps it's more appropriate for release notes.

Comment 6 Mudit Agarwal 2021-03-30 09:06:24 UTC

Its either release notes or adding an important note or warning in the Tech preview section of the relevant guide. Whatever is preferable.

Comment 7 Martin Bukatovic 2021-04-01 15:27:02 UTC

I see that we are considering enabling mon fail over after all:

https://bugzilla.redhat.com/show_bug.cgi?id=1939617#c17

What to document and how depends on decision about this, and if we decide that mon fail over will be still disabled, how exactly should it work, as we had some unexpected behaviour related to this (see https://bugzilla.redhat.com/show_bug.cgi?id=1939617#c15).

Comment 8 Travis Nielsen 2021-04-01 18:05:33 UTC

I proposed here that we leave mon failover disabled until 4.8. https://bugzilla.redhat.com/show_bug.cgi?id=1939617#c19
If that holds, we can simply document that if a node goes down with a mon, it needs to be brought back up to avoid the risk of losing more mons.

Comment 11 Mudit Agarwal 2021-05-26 10:56:40 UTC

No, we don't need to port it to 4.8 as we are re-enabling the mon failover in 4.8

It should be
>> If a node with a failed mon goes down, it is important to fix the node to avoid the risk of losing more mons.

Comment 12 Travis Nielsen 2021-05-26 15:22:07 UTC

I might phrase it this way:

>> If a node with a failed mon goes down, it is important to bring the node back online to restore the mon. If three mons are permanently down, the cluster stops working.

Comment 13 Martin Bukatovic 2021-05-31 14:27:30 UTC

Suggestion from Mudit looks good to me. Suggestion from Travis looks good as well, that said if I understand it right, we can't avoid loosing more than one mon, since with 2 mons down, cluster loses quorum and basically stops working.

Comment 14 Martin Bukatovic 2021-05-31 14:29:15 UTC

(In reply to Martin Bukatovic from comment #13)
> Suggestion from Mudit looks good to me. Suggestion from Travis looks good as
> well, that said if I understand it right, we can't avoid loosing more than
> one mon, since with 2 mons down, cluster loses quorum and basically stops
> working.

Retracting my claim above. In arbiter mode, we have 5 mons ...

Suggestion from Travis looks like a best option.