Bug 2075062
| Summary: | activeAt value of CephClusterWarningState changes in time | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Filip Balák <fbalak> |
| Component: | ceph-monitoring | Assignee: | arun kumar mohan <amohan> |
| Status: | CLOSED WORKSFORME | QA Contact: | Harish NV Rao <hnallurv> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | mmuench, muagarwa, nthomas, ocs-bugs, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-10-17 09:54:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Not a 4.11 blocker We have a few alerts which get fired (or get activated) when we downscale an OSD to ZERO. a. CephClusterWarningState Description: Storage cluster is in warning state for more than 15m. b. CephDataRecoveryTakingTooLong (4 alerts, ie one alert for each pool) Description: Data recovery has been active for too long. Contact Support. c. CephOSDDiskNotResponding Description: Disk device /dev/xvdba not responding, on host ip-10-0-128-75.us-west-2.compute.internal. I could not see `CephClusterWarningState` being fired twice (at two different time intervals). Need to check the automation set to get more info. As per above comment, closing the BZ as wroks-for-me. To re-iterate, this is a test environment only issue (and not a product/cluster issue). If we are hitting this again, we should be looking at (debugging) the test framework (or test setup) Please re-open this if this issue is hit again. |
Description of problem (please be detailed as possible and provide log snippests): During test run that downscales one osd to 0 replicas, there is raised correctly alert CephClusterWarningState. From test logs it seems that the alert was created twice. Version of all relevant components (if applicable): ODF 4.10.0-220 OCP 4.10 Can this issue reproducible? Not sure Can this issue reproduce from the UI? Yes Steps to Reproduce: 1. Downscale one osd to 0 replicas 2. Check prometheus alerts of the cluster and wait 20 minutes Actual results: During test were logged following changes in prometheus api: 2022-04-10 19:25:01 17:25:01 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list 2022-04-10 19:25:41 17:25:41 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list 2022-04-10 19:39:39 17:39:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list 2022-04-10 19:40:40 17:40:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list It seems that there were logged 2 same alerts with difference in activeAt value which differs in one minute (or one alert where value of activeAt changes in time). Expected results: Value of activeAt for alerts raised to one event (alert CephClusterWarningState for event of osd down) should stay the same. Additional info: Found in test case test_ceph_osd_stopped from test run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3884