Bug 2064736
| Summary: | CephMonQuorumLost is not triggered when 2 of 3 monitors are down | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Filip Balák <fbalak> |
| Component: | ceph-monitoring | Assignee: | Filip Balák <fbalak> |
| Status: | CLOSED WORKSFORME | QA Contact: | Filip Balák <fbalak> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | mmuench, muagarwa, nthomas, ocs-bugs, odf-bz-bot |
| Target Milestone: | --- | Keywords: | AutomationBlocker, Regression |
| Target Release: | --- | Flags: | nthomas:
needinfo?
(fbalak) nthomas: needinfo? (fbalak) |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-30 12:58:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Filip Balák
2022-03-16 13:03:00 UTC
Checked with latest ODF 4.10 (at this time it is 4.10.0-198) on the latest openshift 4.10.0-0.nightly-2022-01-31-012936 CephMonQuorumLost alert is triggered at the specific time, that is after 5m. Alert will be raised and be in 'Pending' state for the configured number of minutes (here it is 5m) and after that only the alerts will be fired. Providing the screen shots. Steps followed are the same as Filip. Scaled down the deployment of 2 mon pods (out of 3) to ZERO, allowing only ONE mon in running state. @fbalak , can you confirm, when you say alert is not fired in time means, a. the alert is fired after 5m OR b.alert is fired after a very long wait (like it is in pending state for 10 - 20 minutes) and fired after that OR c. the alert is not fired at all Thanks One more thing, we haven't backported CephMonQuorumLost alert to 4.8. It is available from 4.9 on wards as mentioned in this JIRA: https://issues.redhat.com/browse/RHSTOR-2491 Filip, can you please check the same with latest 4.10 release? The test waits 14 minutes for the alert: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/conftest.py#L122. During that time it is not in Pending state. I see in one test run that it is triggered correctly: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3440/ That means that there could be some flakiness. Also as part of the test suit is first executed test that turns down one monitor, waits for alert CephMonQuorumAtRisk, scales the monitor back up and after this test finishes then this test that turns down all monitors except one is executed. I think this is some flakiness in the way test is being executed. 14 mins (or more) wait is required for 'CephMonQuorumAtRisk', where we expect this alert to be triggered after 15mins when ONE mon goes down (out of THREE). Can we executed the test for CephMonQuorumLost alert independently and verify this is triggered in time? I will try to do more testing and update here. Based on an offline discussion with QE, it was agreed that this is not a RC blocker. Keeping it open and moving it out of 4.10 Filip, Can you please run tests and provide the details requested? Filip, are you able to repro this issue? If not can we close this. Thanks Putting it back to QA for repro. Please let me know (the steps as well) if it is reproducible or else we can close this as a flaky issue. Thanks ON_QA is not the correct state here. Closing the BZ, lets reopen if we see it again. |