Description of problem (please be detailed as possible and provide log snippests): During test run that downscales one osd to 0 replicas, there is raised correctly alert CephClusterWarningState. From test logs it seems that the alert was created twice. Version of all relevant components (if applicable): ODF 4.10.0-220 OCP 4.10 Can this issue reproducible? Not sure Can this issue reproduce from the UI? Yes Steps to Reproduce: 1. Downscale one osd to 0 replicas 2. Check prometheus alerts of the cluster and wait 20 minutes Actual results: During test were logged following changes in prometheus api: 2022-04-10 19:25:01 17:25:01 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list 2022-04-10 19:25:41 17:25:41 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list 2022-04-10 19:39:39 17:39:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list 2022-04-10 19:40:40 17:40:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list It seems that there were logged 2 same alerts with difference in activeAt value which differs in one minute (or one alert where value of activeAt changes in time). Expected results: Value of activeAt for alerts raised to one event (alert CephClusterWarningState for event of osd down) should stay the same. Additional info: Found in test case test_ceph_osd_stopped from test run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3884
Not a 4.11 blocker
We have a few alerts which get fired (or get activated) when we downscale an OSD to ZERO. a. CephClusterWarningState Description: Storage cluster is in warning state for more than 15m. b. CephDataRecoveryTakingTooLong (4 alerts, ie one alert for each pool) Description: Data recovery has been active for too long. Contact Support. c. CephOSDDiskNotResponding Description: Disk device /dev/xvdba not responding, on host ip-10-0-128-75.us-west-2.compute.internal. I could not see `CephClusterWarningState` being fired twice (at two different time intervals). Need to check the automation set to get more info.
As per above comment, closing the BZ as wroks-for-me. To re-iterate, this is a test environment only issue (and not a product/cluster issue). If we are hitting this again, we should be looking at (debugging) the test framework (or test setup) Please re-open this if this issue is hit again.