Description of problem (please be detailed as possible and provide log snippests): CephMonQuorumLost is not triggered when 2 of 3 monitors are down. Version of all relevant components (if applicable): OCS 4.8.9-1 OCP 4.8 Can this issue reproducible? yes, it seems that the problem is also present in ODF 4.10 Steps to Reproduce: 1. Get list of monitor deployments. 2. Scale to 0 all of them except for one. 3. Check alerting Actual results: There is not alert CephMonQuorumLost in time. Expected results: Alert CephMonQuorumLost should be triggered. Additional info: Based on test run test case (OCS 4.8.9-1): https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3425/testReport/junit/tests.manage.monitoring.prometheus/test_deployment_status/test_ceph_mons_quorum_lost_True_/ Reproduced in ODF 4.10: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3516
Checked with latest ODF 4.10 (at this time it is 4.10.0-198) on the latest openshift 4.10.0-0.nightly-2022-01-31-012936 CephMonQuorumLost alert is triggered at the specific time, that is after 5m. Alert will be raised and be in 'Pending' state for the configured number of minutes (here it is 5m) and after that only the alerts will be fired. Providing the screen shots. Steps followed are the same as Filip. Scaled down the deployment of 2 mon pods (out of 3) to ZERO, allowing only ONE mon in running state. @fbalak , can you confirm, when you say alert is not fired in time means, a. the alert is fired after 5m OR b.alert is fired after a very long wait (like it is in pending state for 10 - 20 minutes) and fired after that OR c. the alert is not fired at all Thanks
One more thing, we haven't backported CephMonQuorumLost alert to 4.8. It is available from 4.9 on wards as mentioned in this JIRA: https://issues.redhat.com/browse/RHSTOR-2491 Filip, can you please check the same with latest 4.10 release?
The test waits 14 minutes for the alert: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/conftest.py#L122. During that time it is not in Pending state. I see in one test run that it is triggered correctly: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3440/ That means that there could be some flakiness. Also as part of the test suit is first executed test that turns down one monitor, waits for alert CephMonQuorumAtRisk, scales the monitor back up and after this test finishes then this test that turns down all monitors except one is executed.
I think this is some flakiness in the way test is being executed. 14 mins (or more) wait is required for 'CephMonQuorumAtRisk', where we expect this alert to be triggered after 15mins when ONE mon goes down (out of THREE). Can we executed the test for CephMonQuorumLost alert independently and verify this is triggered in time?
I will try to do more testing and update here.
Based on an offline discussion with QE, it was agreed that this is not a RC blocker. Keeping it open and moving it out of 4.10
Filip, Can you please run tests and provide the details requested?
Filip, are you able to repro this issue? If not can we close this. Thanks
Putting it back to QA for repro. Please let me know (the steps as well) if it is reproducible or else we can close this as a flaky issue. Thanks
ON_QA is not the correct state here. Closing the BZ, lets reopen if we see it again.