Description of problem (please be detailed as possible and provide log snippests): CephMonQuorumAtRisk alert is raised when two of the mon pods are down in five mon cluster Version of all relevant components (if applicable): OCP 4.15 ODF 4.15.2-1 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? NA Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: NA Steps to Reproduce: 1. Install OCP 4.15 and ODF 4.15.2-1 2. Create a six nodes rack/host based failure domain cluster 3. Update monCount to 5 on configure modal (In the monPDB, the values for maxUnavailable and allowedDisruptions are set to 2. which is valid for five mon cluster.) 4. Perform node drain scenario on three of the nodes out of six nodes to test node drain scenario for mon pods. Post node drain three of the mon pods will be in running state but the cluster shows critical error 'CephMonQuorumAtRisk' Actual results: CephMonQuorumAtRisk error is shown even when three mon pods are active in five mon cluster Expected results: CephMonQuorumAtRisk error should be shown even when less than three mon pods are active in five mon cluster Additional info: This bug was found upon verification of https://bugzilla.redhat.com/show_bug.cgi?id=2264553
Providing some observation: Alert: CephMonQuorumAtRisk Alert Expression: `count by (namespace) (ceph_mon_quorum_status{job="rook-ceph-mgr"} == 1) <= (floor(count by (namespace) (ceph_mon_metadata{job="rook-ceph-mgr"}) / 2) + 1)` Decomposing the expression query: the above expression is in the following format (A) <= (B) where, (A) represents number of active ceph mons (B) represents integer half of (total number of mons) + 1 Examples ============ Example 1: For example, in a 3 mon cluster, in which 1 is inactive and 2 mons are up (A) will be 2 (active number of mons) (B) will be 2 PS: (B) is evaluated as follows, total number of mons == 3, integer half of (total number of mons) == 3/2 == 1, integer half of (total number of mons) + 1 == 1+1 == 2 In this scenario (where in a total of 3 mons, only 2 are active and one inactive) (A) <= (B) is true and the alert will be triggered Example 2: Now let's see the scenario where we have a total of 5 mons where 3 mons are up and 2 mons down (A) will be 3 (B) will be 3 PS: (B) is, total number of mons == 5, integer half of (total number of mons) == 5/2 == 2, adding one to it == 2 + 1 == 3 So according to the expression, (A) <= (B) is true and we get 'CephMonQuorumAtRisk' alert, if there are 3 or less active mons (in a 5 mon cluster)
@jopinto what is the expected behavior here?
According to the initial comment (by Joy), the expected behavior is, `CephMonQuorumAtRisk error should be shown even when less than three mon pods are active in five mon cluster` But we are raising this alert ('CephMonQuorumAtRisk') when the cluster is at a risk, that is we have reached a point where we have lost half of the mons. Another proposal here is this, We will lose the whole mon quorum when we have only ONE mon left (quorum is completely lost). So we can introduce another (a new) alert (hypothetically named `CephMonQuorumAtCriticalRisk`) which will be triggered when we only have exactly TWO mons active/running. PS: we may have to suppress the alert 'CephMonQuorumAtRisk' when it is a 3 mons cluster, otherwise both the alerts, new (CephMonQuorumAtCriticalRisk) and the existing one (CephMonQuorumAtRisk) will be triggered. Please see comment#4 for more details. So underlying conclusion (IMHO) is that, a. the current alert, CephMonQuorumAtRisk, is working as expected, triggered at a high risk time (we have lost half the mons) b. this can be an RFE/proposal to bring a new alert when cluster face a critical time (when we are about to lose the mon quorum) c. and severity of the BZ can be reduced Joy, Nishanth, what do you guys think?
@tnielsen , thoughts pls
Just to be clear about the mons and quorum status: TotalMons InQuorum QuorumStatus 3 3 Healthy 3 2 Healthy 3 1 Quorum down 5 5 Healthy 5 4 Healthy 5 3 Healthy 5 2 Quorum down Is the intention of the alert to tell them that their mon quorum is actually down? That's what I'm understanding from the description of this BZ. But if quorum is already down, the cluster is unresponsive and unusable, and seems too late for an alert. If we alert when 2 mons are down in a 5 mon cluster, then the cluster is still functional when the user gets the alert and they will have time to respond and bring the mons up again before quorum is lost. So I would think the current behavior is expected to alert before the quorum is down. But when there are only three mons, it would also be unexpected to alert when only one mon is down since that will be a frequent occurrence during upgrades. I don't think we should alert in that case. My view is that whether there are 3 or 5 mons, either way we should alert if 2 or more mons are down.
As the alert name suggests, the goal of the alert is to let the admin know that quorum is at risk - AKA. further mon disruption will lead to an outage - please fix me The original statement in comment #1 expecting the alert to fire when the quorum is lost is not possible. The metrics required by the alert query are provided by a mgr module, so once quorum is lost you won't be seeing any metrics sent to prometheus/alertmanager. HTH
Based the comments above, I assume that the BZ is working as expected, will go ahead and close the BZ
Thanks Travis for the confirmation. @nthomas Yes sure. Please close the BZ.
Thanks Paul, Travis, Nishanth. Closing the BZ.
(In reply to Travis Nielsen from comment #8) > Just to be clear about the mons and quorum status: > > TotalMons InQuorum QuorumStatus > 3 3 Healthy > 3 2 Healthy > 3 1 Quorum down > 5 5 Healthy > 5 4 Healthy > 5 3 Healthy > 5 2 Quorum down > > Is the intention of the alert to tell them that their mon quorum is actually > down? That's what I'm understanding from the description of this BZ. But if > quorum is already down, the cluster is unresponsive and unusable, and seems > too late for an alert. If we alert when 2 mons are down in a 5 mon cluster, > then the cluster is still functional when the user gets the alert and they > will have time to respond and bring the mons up again before quorum is lost. > > So I would think the current behavior is expected to alert before the quorum > is down. > > But when there are only three mons, it would also be unexpected to alert > when only one mon is down since that will be a frequent occurrence during > upgrades. I don't think we should alert in that case. > > My view is that whether there are 3 or 5 mons, either way we should alert if > 2 or more mons are down. Would it make sense to send the alert as 'warning' instead of 'error'?
(In reply to krishnaram Karthick from comment #13) > (In reply to Travis Nielsen from comment #8) > > Just to be clear about the mons and quorum status: > > > > TotalMons InQuorum QuorumStatus > > 3 3 Healthy > > 3 2 Healthy > > 3 1 Quorum down > > 5 5 Healthy > > 5 4 Healthy > > 5 3 Healthy > > 5 2 Quorum down > > > > Is the intention of the alert to tell them that their mon quorum is actually > > down? That's what I'm understanding from the description of this BZ. But if > > quorum is already down, the cluster is unresponsive and unusable, and seems > > too late for an alert. If we alert when 2 mons are down in a 5 mon cluster, > > then the cluster is still functional when the user gets the alert and they > > will have time to respond and bring the mons up again before quorum is lost. > > > > So I would think the current behavior is expected to alert before the quorum > > is down. > > > > But when there are only three mons, it would also be unexpected to alert > > when only one mon is down since that will be a frequent occurrence during > > upgrades. I don't think we should alert in that case. > > > > My view is that whether there are 3 or 5 mons, either way we should alert if > > 2 or more mons are down. > > Would it make sense to send the alert as 'warning' instead of 'error'? If there are five mons and two are down, then the cluster is still in a working condition. Warning does seem more appropriate.