Bug 2064736

Summary: CephMonQuorumLost is not triggered when 2 of 3 monitors are down
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Filip Balák <fbalak>
Component: ceph-monitoringAssignee: Filip Balák <fbalak>
Status: CLOSED WORKSFORME QA Contact: Filip Balák <fbalak>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: mmuench, muagarwa, nthomas, ocs-bugs, odf-bz-bot
Target Milestone: ---Keywords: AutomationBlocker, Regression
Target Release: ---Flags: nthomas: needinfo? (fbalak)
nthomas: needinfo? (fbalak)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-30 12:58:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Filip Balák 2022-03-16 13:03:00 UTC
Description of problem (please be detailed as possible and provide log
snippests):
CephMonQuorumLost is not triggered when 2 of 3 monitors are down.

Version of all relevant components (if applicable):
OCS 4.8.9-1
OCP 4.8

Can this issue reproducible?
yes, it seems that the problem is also present in ODF 4.10


Steps to Reproduce:
1. Get list of monitor deployments.
2. Scale to 0 all of them except for one.
3. Check alerting


Actual results:
There is not alert CephMonQuorumLost in time.

Expected results:
Alert CephMonQuorumLost should be triggered.

Additional info:
Based on test run test case (OCS 4.8.9-1):
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3425/testReport/junit/tests.manage.monitoring.prometheus/test_deployment_status/test_ceph_mons_quorum_lost_True_/
Reproduced in ODF 4.10:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3516

Comment 4 arun kumar mohan 2022-03-17 10:51:57 UTC
Checked with latest ODF 4.10 (at this time it is 4.10.0-198) on the latest openshift 4.10.0-0.nightly-2022-01-31-012936

CephMonQuorumLost alert is triggered at the specific time, that is after 5m.

Alert will be raised and be in 'Pending' state for the configured number of minutes (here it is 5m) and after that only the alerts will be fired.
Providing the screen shots.

Steps followed are the same as Filip.
Scaled down the deployment of 2 mon pods (out of 3) to ZERO, allowing only ONE mon in running state.

@fbalak , can you confirm, when you say alert is not fired in time means,

a. the alert is fired after 5m
OR
b.alert is fired after a very long wait (like it is in pending state for 10 - 20 minutes) and fired after that
OR
c. the alert is not fired at all

Thanks

Comment 5 arun kumar mohan 2022-03-17 11:04:11 UTC
One more thing, we haven't backported CephMonQuorumLost alert to 4.8.
It is available from 4.9 on wards as mentioned in this JIRA: https://issues.redhat.com/browse/RHSTOR-2491

Filip, can you please check the same with latest 4.10 release?

Comment 6 Filip Balák 2022-03-17 11:09:16 UTC
The test waits 14 minutes for the alert: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/conftest.py#L122. During that time it is not in Pending state.

I see in one test run that it is triggered correctly: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3440/
That means that there could be some flakiness.

Also as part of the test suit is first executed test that turns down one monitor, waits for alert CephMonQuorumAtRisk, scales the monitor back up and after this test finishes then this test that turns down all monitors except one is executed.

Comment 7 arun kumar mohan 2022-03-22 10:24:31 UTC
I think this is some flakiness in the way test is being executed.
14 mins (or more) wait is required for 'CephMonQuorumAtRisk', where we expect this alert to be triggered after 15mins when ONE mon goes down (out of THREE).
Can we executed the test for CephMonQuorumLost alert independently and verify this is triggered in time?

Comment 8 Filip Balák 2022-03-23 15:21:43 UTC
I will try to do more testing and update here.

Comment 9 Mudit Agarwal 2022-03-23 15:29:01 UTC
Based on an offline discussion with QE, it was agreed that this is not a RC blocker. Keeping it open and moving it out of 4.10

Comment 10 Nishanth Thomas 2022-04-27 14:24:22 UTC
Filip, Can you please run tests and provide the details requested?

Comment 11 arun kumar mohan 2022-05-11 10:51:52 UTC
Filip, are you able to repro this issue? If not can we close this.
Thanks

Comment 12 arun kumar mohan 2022-05-25 05:34:13 UTC
Putting it back to QA for repro. Please let me know (the steps as well) if it is reproducible or else we can close this as a flaky issue.
Thanks

Comment 13 Mudit Agarwal 2022-05-30 12:58:26 UTC
ON_QA is not the correct state here. 
Closing the BZ, lets reopen if we see it again.