Bug 2064736 - CephMonQuorumLost is not triggered when 2 of 3 monitors are down [NEEDINFO]
Summary: CephMonQuorumLost is not triggered when 2 of 3 monitors are down
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Filip Balák
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-16 13:03 UTC by Filip Balák
Modified: 2023-08-09 16:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-30 12:58:26 UTC
Embargoed:
nthomas: needinfo? (fbalak)
nthomas: needinfo? (fbalak)


Attachments (Terms of Use)

Description Filip Balák 2022-03-16 13:03:00 UTC
Description of problem (please be detailed as possible and provide log
snippests):
CephMonQuorumLost is not triggered when 2 of 3 monitors are down.

Version of all relevant components (if applicable):
OCS 4.8.9-1
OCP 4.8

Can this issue reproducible?
yes, it seems that the problem is also present in ODF 4.10


Steps to Reproduce:
1. Get list of monitor deployments.
2. Scale to 0 all of them except for one.
3. Check alerting


Actual results:
There is not alert CephMonQuorumLost in time.

Expected results:
Alert CephMonQuorumLost should be triggered.

Additional info:
Based on test run test case (OCS 4.8.9-1):
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3425/testReport/junit/tests.manage.monitoring.prometheus/test_deployment_status/test_ceph_mons_quorum_lost_True_/
Reproduced in ODF 4.10:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3516

Comment 4 arun kumar mohan 2022-03-17 10:51:57 UTC
Checked with latest ODF 4.10 (at this time it is 4.10.0-198) on the latest openshift 4.10.0-0.nightly-2022-01-31-012936

CephMonQuorumLost alert is triggered at the specific time, that is after 5m.

Alert will be raised and be in 'Pending' state for the configured number of minutes (here it is 5m) and after that only the alerts will be fired.
Providing the screen shots.

Steps followed are the same as Filip.
Scaled down the deployment of 2 mon pods (out of 3) to ZERO, allowing only ONE mon in running state.

@fbalak , can you confirm, when you say alert is not fired in time means,

a. the alert is fired after 5m
OR
b.alert is fired after a very long wait (like it is in pending state for 10 - 20 minutes) and fired after that
OR
c. the alert is not fired at all

Thanks

Comment 5 arun kumar mohan 2022-03-17 11:04:11 UTC
One more thing, we haven't backported CephMonQuorumLost alert to 4.8.
It is available from 4.9 on wards as mentioned in this JIRA: https://issues.redhat.com/browse/RHSTOR-2491

Filip, can you please check the same with latest 4.10 release?

Comment 6 Filip Balák 2022-03-17 11:09:16 UTC
The test waits 14 minutes for the alert: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/conftest.py#L122. During that time it is not in Pending state.

I see in one test run that it is triggered correctly: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3440/
That means that there could be some flakiness.

Also as part of the test suit is first executed test that turns down one monitor, waits for alert CephMonQuorumAtRisk, scales the monitor back up and after this test finishes then this test that turns down all monitors except one is executed.

Comment 7 arun kumar mohan 2022-03-22 10:24:31 UTC
I think this is some flakiness in the way test is being executed.
14 mins (or more) wait is required for 'CephMonQuorumAtRisk', where we expect this alert to be triggered after 15mins when ONE mon goes down (out of THREE).
Can we executed the test for CephMonQuorumLost alert independently and verify this is triggered in time?

Comment 8 Filip Balák 2022-03-23 15:21:43 UTC
I will try to do more testing and update here.

Comment 9 Mudit Agarwal 2022-03-23 15:29:01 UTC
Based on an offline discussion with QE, it was agreed that this is not a RC blocker. Keeping it open and moving it out of 4.10

Comment 10 Nishanth Thomas 2022-04-27 14:24:22 UTC
Filip, Can you please run tests and provide the details requested?

Comment 11 arun kumar mohan 2022-05-11 10:51:52 UTC
Filip, are you able to repro this issue? If not can we close this.
Thanks

Comment 12 arun kumar mohan 2022-05-25 05:34:13 UTC
Putting it back to QA for repro. Please let me know (the steps as well) if it is reproducible or else we can close this as a flaky issue.
Thanks

Comment 13 Mudit Agarwal 2022-05-30 12:58:26 UTC
ON_QA is not the correct state here. 
Closing the BZ, lets reopen if we see it again.


Note You need to log in before you can comment on or make changes to this bug.