Right now Ceph doesn't support mon failover for stretch clusters, Ceph folks are working on the same but looks like it won't be able to make it in time for OCS4.7 to consume. See https://bugzilla.redhat.com/show_bug.cgi?id=1939766, it is currently targeted for RHCS4.2z1-aync. So, to workaround that in OCS we have stopped doing a failover from OCS (rook) side. We need to document the same as well as what customer is supposed to do in such situation. Sebastien/Martin would be able to help in elaborating more on that. Note: In case we decide to consume RHCS4.2z1-aync (with the fix for #1939766) then we have to remove this section from the doc, but right now plan B is plan A :)
There isn't much to document, other than they need to bring the node back up where the failed mon was running. Since the feature is in tech preview, do we really need to document this? Perhaps it's more appropriate for release notes.
Its either release notes or adding an important note or warning in the Tech preview section of the relevant guide. Whatever is preferable.
I see that we are considering enabling mon fail over after all: https://bugzilla.redhat.com/show_bug.cgi?id=1939617#c17 What to document and how depends on decision about this, and if we decide that mon fail over will be still disabled, how exactly should it work, as we had some unexpected behaviour related to this (see https://bugzilla.redhat.com/show_bug.cgi?id=1939617#c15).
I proposed here that we leave mon failover disabled until 4.8. https://bugzilla.redhat.com/show_bug.cgi?id=1939617#c19 If that holds, we can simply document that if a node goes down with a mon, it needs to be brought back up to avoid the risk of losing more mons.
No, we don't need to port it to 4.8 as we are re-enabling the mon failover in 4.8 It should be >> If a node with a failed mon goes down, it is important to fix the node to avoid the risk of losing more mons.
I might phrase it this way: >> If a node with a failed mon goes down, it is important to bring the node back online to restore the mon. If three mons are permanently down, the cluster stops working.
Suggestion from Mudit looks good to me. Suggestion from Travis looks good as well, that said if I understand it right, we can't avoid loosing more than one mon, since with 2 mons down, cluster loses quorum and basically stops working.
(In reply to Martin Bukatovic from comment #13) > Suggestion from Mudit looks good to me. Suggestion from Travis looks good as > well, that said if I understand it right, we can't avoid loosing more than > one mon, since with 2 mons down, cluster loses quorum and basically stops > working. Retracting my claim above. In arbiter mode, we have 5 mons ... Suggestion from Travis looks like a best option.