Bug 1941918 - [Doc] [Arbiter] Disable mon failover in stretch mode
Summary: [Doc] [Arbiter] Disable mon failover in stretch mode
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: documentation
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Olive Lakra
QA Contact: Elad
URL:
Whiteboard:
Depends On: 1939007 1939617 1939766
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-23 07:06 UTC by Mudit Agarwal
Modified: 2022-04-05 03:48 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1939617
Environment:
Last Closed: 2022-03-08 08:31:28 UTC
Embargoed:


Attachments (Terms of Use)

Comment 4 Mudit Agarwal 2021-03-24 13:37:31 UTC
Right now Ceph doesn't support mon failover for stretch clusters, Ceph folks are working on the same but looks like it won't be able to make it in time for OCS4.7 to consume.

See https://bugzilla.redhat.com/show_bug.cgi?id=1939766, it is currently targeted for RHCS4.2z1-aync.

So, to workaround that in OCS we have stopped doing a failover from OCS (rook) side. 
We need to document the same as well as what customer is supposed to do in such situation. Sebastien/Martin would be able to help in elaborating more on that.


Note: In case we decide to consume RHCS4.2z1-aync (with the fix for #1939766) then we have to remove this section from the doc, but right now plan B is plan A :)

Comment 5 Travis Nielsen 2021-03-29 17:44:08 UTC
There isn't much to document, other than they need to bring the node back up where the failed mon was running. Since the feature is in tech preview, do we really need to document this? Perhaps it's more appropriate for release notes.

Comment 6 Mudit Agarwal 2021-03-30 09:06:24 UTC
Its either release notes or adding an important note or warning in the Tech preview section of the relevant guide. Whatever is preferable.

Comment 7 Martin Bukatovic 2021-04-01 15:27:02 UTC
I see that we are considering enabling mon fail over after all:

https://bugzilla.redhat.com/show_bug.cgi?id=1939617#c17

What to document and how depends on decision about this, and if we decide that mon fail over will be still disabled, how exactly should it work, as we had some unexpected behaviour related to this (see https://bugzilla.redhat.com/show_bug.cgi?id=1939617#c15).

Comment 8 Travis Nielsen 2021-04-01 18:05:33 UTC
I proposed here that we leave mon failover disabled until 4.8. https://bugzilla.redhat.com/show_bug.cgi?id=1939617#c19
If that holds, we can simply document that if a node goes down with a mon, it needs to be brought back up to avoid the risk of losing more mons.

Comment 11 Mudit Agarwal 2021-05-26 10:56:40 UTC
No, we don't need to port it to 4.8 as we are re-enabling the mon failover in 4.8

It should be
>> If a node with a failed mon goes down, it is important to fix the node to avoid the risk of losing more mons.

Comment 12 Travis Nielsen 2021-05-26 15:22:07 UTC
I might phrase it this way:

>> If a node with a failed mon goes down, it is important to bring the node back online to restore the mon. If three mons are permanently down, the cluster stops working.

Comment 13 Martin Bukatovic 2021-05-31 14:27:30 UTC
Suggestion from Mudit looks good to me. Suggestion from Travis looks good as well, that said if I understand it right, we can't avoid loosing more than one mon, since with 2 mons down, cluster loses quorum and basically stops working.

Comment 14 Martin Bukatovic 2021-05-31 14:29:15 UTC
(In reply to Martin Bukatovic from comment #13)
> Suggestion from Mudit looks good to me. Suggestion from Travis looks good as
> well, that said if I understand it right, we can't avoid loosing more than
> one mon, since with 2 mons down, cluster loses quorum and basically stops
> working.

Retracting my claim above. In arbiter mode, we have 5 mons ...

Suggestion from Travis looks like a best option.


Note You need to log in before you can comment on or make changes to this bug.