Bug 2276823 - CephMonQuorumAtRisk alert is raised when two of the mon pods are down in five mon cluster
Summary: CephMonQuorumAtRisk alert is raised when two of the mon pods are down in five...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: arun kumar mohan
QA Contact: Joy John Pinto
URL:
Whiteboard:
Depends On:
Blocks: 2264553
TreeView+ depends on / blocked
 
Reported: 2024-04-24 06:14 UTC by Joy John Pinto
Modified: 2024-04-29 15:53 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-04-29 06:10:36 UTC
Embargoed:


Attachments (Terms of Use)

Description Joy John Pinto 2024-04-24 06:14:01 UTC
Description of problem (please be detailed as possible and provide log
snippests):

CephMonQuorumAtRisk alert is raised when two of the mon pods are down in five mon cluster

Version of all relevant components (if applicable):
OCP 4.15
ODF 4.15.2-1

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. Install OCP 4.15 and ODF 4.15.2-1
2. Create a six nodes rack/host based failure domain cluster
3. Update monCount to 5 on configure modal (In the monPDB, the values for maxUnavailable and allowedDisruptions are set to 2. which is valid for five mon cluster.) 
4. Perform node drain scenario on three of the nodes out of six nodes to test node drain scenario for mon pods. Post node drain three of the mon pods will be in running state but the cluster shows critical error 'CephMonQuorumAtRisk'

Actual results:
CephMonQuorumAtRisk error is shown even when three mon pods are active in five mon cluster

Expected results:
CephMonQuorumAtRisk error should be shown even when less than three mon pods are active in five mon cluster


Additional info:
This bug was found upon verification of https://bugzilla.redhat.com/show_bug.cgi?id=2264553

Comment 4 arun kumar mohan 2024-04-25 11:03:40 UTC
Providing some observation:

Alert: CephMonQuorumAtRisk
Alert Expression:
`count by (namespace) (ceph_mon_quorum_status{job="rook-ceph-mgr"} == 1) <= (floor(count by (namespace) (ceph_mon_metadata{job="rook-ceph-mgr"}) / 2) + 1)`

Decomposing the expression query: the above expression is in the following format

(A) <= (B)

where,
(A) represents number of active ceph mons
(B) represents integer half of (total number of mons) + 1


Examples
============

Example 1:

For example, in a 3 mon cluster, in which 1 is inactive and 2 mons are up
(A) will be 2 (active number of mons)
(B) will be 2
PS: (B) is evaluated as follows,
total number of mons == 3, integer half of (total number of mons) == 3/2 == 1, integer half of (total number of mons) + 1 == 1+1 == 2

In this scenario (where in a total of 3 mons, only 2 are active and one inactive) (A) <= (B) is true and the alert will be triggered

Example 2:

Now let's see the scenario where we have a total of 5 mons where 3 mons are up and 2 mons down
(A) will be 3
(B) will be 3
PS: (B) is,
total number of mons == 5, integer half of (total number of mons) == 5/2 == 2, adding one to it == 2 + 1 == 3

So according to the expression, (A) <= (B) is true and we get 'CephMonQuorumAtRisk' alert, if there are 3 or less active mons (in a 5 mon cluster)

Comment 5 Nishanth Thomas 2024-04-25 14:55:12 UTC
@jopinto what is the expected behavior here?

Comment 6 arun kumar mohan 2024-04-25 15:51:16 UTC
According to the initial comment (by Joy), the expected behavior is,
`CephMonQuorumAtRisk error should be shown even when less than three mon pods are active in five mon cluster`

But we are raising this alert ('CephMonQuorumAtRisk') when the cluster is at a risk, that is we have reached a point where we have lost half of the mons.

Another proposal here is this,
We will lose the whole mon quorum when we have only ONE mon left (quorum is completely lost). So we can introduce another (a new) alert (hypothetically named `CephMonQuorumAtCriticalRisk`) which will be triggered when we only have exactly TWO mons active/running.
PS: we may have to suppress the alert 'CephMonQuorumAtRisk' when it is a 3 mons cluster, otherwise both the alerts, new (CephMonQuorumAtCriticalRisk) and the existing one (CephMonQuorumAtRisk) will be triggered. Please see comment#4 for more details.

So underlying conclusion (IMHO) is that,
a. the current alert, CephMonQuorumAtRisk, is working as expected, triggered at a high risk time (we have lost half the mons)
b. this can be an RFE/proposal to bring a new alert when cluster face a critical time (when we are about to lose the mon quorum)
c. and severity of the BZ can be reduced

Joy, Nishanth, what do you guys think?

Comment 7 Nishanth Thomas 2024-04-25 16:52:06 UTC
@tnielsen , thoughts pls

Comment 8 Travis Nielsen 2024-04-25 16:59:45 UTC
Just to be clear about the mons and quorum status:

TotalMons InQuorum QuorumStatus
3         3        Healthy
3         2        Healthy
3         1        Quorum down
5         5        Healthy
5         4        Healthy
5         3        Healthy
5         2        Quorum down

Is the intention of the alert to tell them that their mon quorum is actually down? That's what I'm understanding from the description of this BZ. But if quorum is already down, the cluster is unresponsive and unusable, and seems too late for an alert. If we alert when 2 mons are down in a 5 mon cluster, then the cluster is still functional when the user gets the alert and they will have time to respond and bring the mons up again before quorum is lost. 

So I would think the current behavior is expected to alert before the quorum is down. 

But when there are only three mons, it would also be unexpected to alert when only one mon is down since that will be a frequent occurrence during upgrades. I don't think we should alert in that case. 

My view is that whether there are 3 or 5 mons, either way we should alert if 2 or more mons are down.

Comment 9 Paul Cuzner 2024-04-25 21:20:28 UTC
As the alert name suggests, the goal of the alert is to let the admin know that quorum is at risk - AKA. further mon disruption will lead to an outage - please fix me

The original statement in comment #1 expecting the alert to fire when the quorum is lost is not possible. The metrics required by the alert query are provided by a mgr module, so once quorum is lost you won't be seeing any metrics sent to prometheus/alertmanager.

HTH

Comment 10 Nishanth Thomas 2024-04-26 07:54:50 UTC
Based the comments above, I assume that the BZ is working as expected, will go ahead and close the BZ

Comment 11 Joy John Pinto 2024-04-29 04:28:56 UTC
Thanks Travis for the confirmation. @nthomas Yes sure. Please close the BZ.

Comment 12 arun kumar mohan 2024-04-29 06:10:36 UTC
Thanks Paul, Travis, Nishanth.
Closing the BZ.

Comment 13 krishnaram Karthick 2024-04-29 11:39:26 UTC
(In reply to Travis Nielsen from comment #8)
> Just to be clear about the mons and quorum status:
> 
> TotalMons InQuorum QuorumStatus
> 3         3        Healthy
> 3         2        Healthy
> 3         1        Quorum down
> 5         5        Healthy
> 5         4        Healthy
> 5         3        Healthy
> 5         2        Quorum down
> 
> Is the intention of the alert to tell them that their mon quorum is actually
> down? That's what I'm understanding from the description of this BZ. But if
> quorum is already down, the cluster is unresponsive and unusable, and seems
> too late for an alert. If we alert when 2 mons are down in a 5 mon cluster,
> then the cluster is still functional when the user gets the alert and they
> will have time to respond and bring the mons up again before quorum is lost. 
> 
> So I would think the current behavior is expected to alert before the quorum
> is down. 
> 
> But when there are only three mons, it would also be unexpected to alert
> when only one mon is down since that will be a frequent occurrence during
> upgrades. I don't think we should alert in that case. 
> 
> My view is that whether there are 3 or 5 mons, either way we should alert if
> 2 or more mons are down.

Would it make sense to send the alert as 'warning' instead of 'error'?

Comment 14 Travis Nielsen 2024-04-29 15:53:41 UTC
(In reply to krishnaram Karthick from comment #13)
> (In reply to Travis Nielsen from comment #8)
> > Just to be clear about the mons and quorum status:
> > 
> > TotalMons InQuorum QuorumStatus
> > 3         3        Healthy
> > 3         2        Healthy
> > 3         1        Quorum down
> > 5         5        Healthy
> > 5         4        Healthy
> > 5         3        Healthy
> > 5         2        Quorum down
> > 
> > Is the intention of the alert to tell them that their mon quorum is actually
> > down? That's what I'm understanding from the description of this BZ. But if
> > quorum is already down, the cluster is unresponsive and unusable, and seems
> > too late for an alert. If we alert when 2 mons are down in a 5 mon cluster,
> > then the cluster is still functional when the user gets the alert and they
> > will have time to respond and bring the mons up again before quorum is lost. 
> > 
> > So I would think the current behavior is expected to alert before the quorum
> > is down. 
> > 
> > But when there are only three mons, it would also be unexpected to alert
> > when only one mon is down since that will be a frequent occurrence during
> > upgrades. I don't think we should alert in that case. 
> > 
> > My view is that whether there are 3 or 5 mons, either way we should alert if
> > 2 or more mons are down.
> 
> Would it make sense to send the alert as 'warning' instead of 'error'?

If there are five mons and two are down, then the cluster is still in a working condition. Warning does seem more appropriate.


Note You need to log in before you can comment on or make changes to this bug.