Bug 2075062

Summary: activeAt value of CephClusterWarningState changes in time
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Filip Balák <fbalak>
Component: ceph-monitoringAssignee: arun kumar mohan <amohan>
Status: CLOSED WORKSFORME QA Contact: Harish NV Rao <hnallurv>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.10CC: mmuench, muagarwa, nthomas, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-17 09:54:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Filip Balák 2022-04-13 14:09:40 UTC
Description of problem (please be detailed as possible and provide log
snippests):
During test run that downscales one osd to 0 replicas, there is raised correctly alert CephClusterWarningState. From test logs it seems that the alert was created twice.

Version of all relevant components (if applicable):
ODF 4.10.0-220
OCP 4.10

Can this issue reproducible?
Not sure

Can this issue reproduce from the UI?
Yes

Steps to Reproduce:
1. Downscale one osd to 0 replicas
2. Check prometheus alerts of the cluster and wait 20 minutes

Actual results:
During test were logged following changes in prometheus api:
2022-04-10 19:25:01  17:25:01 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list
2022-04-10 19:25:41  17:25:41 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list
2022-04-10 19:39:39  17:39:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list
2022-04-10 19:40:40  17:40:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list

It seems that there were logged 2 same alerts with difference in activeAt value which differs in one minute (or one alert where value of activeAt changes in time).

Expected results:
Value of activeAt for alerts raised to one event (alert CephClusterWarningState for event of osd down) should stay the same.

Additional info:
Found in test case test_ceph_osd_stopped from test run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3884

Comment 2 Mudit Agarwal 2022-06-20 14:12:44 UTC
Not a 4.11 blocker

Comment 3 arun kumar mohan 2022-10-07 11:34:31 UTC
We have a few alerts which get fired (or get activated) when we downscale an OSD to ZERO.

a. CephClusterWarningState
   Description: Storage cluster is in warning state for more than 15m.

b. CephDataRecoveryTakingTooLong (4 alerts, ie one alert for each pool)
   Description: Data recovery has been active for too long. Contact Support.

c. CephOSDDiskNotResponding
   Description: Disk device /dev/xvdba not responding, on host ip-10-0-128-75.us-west-2.compute.internal.


I could not see `CephClusterWarningState` being fired twice (at two different time intervals).
Need to check the automation set to get more info.

Comment 5 arun kumar mohan 2022-10-17 09:54:14 UTC
As per above comment, closing the BZ as wroks-for-me.
To re-iterate, this is a test environment only issue (and not a product/cluster issue).
If we are hitting this again, we should be looking at (debugging) the test framework (or test setup)

Please re-open this if this issue is hit again.