Bug 2075062

Summary:	activeAt value of CephClusterWarningState changes in time
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Filip Balák <fbalak>
Component:	ceph-monitoring	Assignee:	arun kumar mohan <amohan>
Status:	CLOSED WORKSFORME	QA Contact:	Harish NV Rao <hnallurv>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	mmuench, muagarwa, nthomas, ocs-bugs, odf-bz-bot
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-10-17 09:54:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Filip Balák 2022-04-13 14:09:40 UTC

Description of problem (please be detailed as possible and provide log
snippests):
During test run that downscales one osd to 0 replicas, there is raised correctly alert CephClusterWarningState. From test logs it seems that the alert was created twice.

Version of all relevant components (if applicable):
ODF 4.10.0-220
OCP 4.10

Can this issue reproducible?
Not sure

Can this issue reproduce from the UI?
Yes

Steps to Reproduce:
1. Downscale one osd to 0 replicas
2. Check prometheus alerts of the cluster and wait 20 minutes

Actual results:
During test were logged following changes in prometheus api:
2022-04-10 19:25:01  17:25:01 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list
2022-04-10 19:25:41  17:25:41 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list
2022-04-10 19:39:39  17:39:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list
2022-04-10 19:40:40  17:40:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list

It seems that there were logged 2 same alerts with difference in activeAt value which differs in one minute (or one alert where value of activeAt changes in time).

Expected results:
Value of activeAt for alerts raised to one event (alert CephClusterWarningState for event of osd down) should stay the same.

Additional info:
Found in test case test_ceph_osd_stopped from test run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3884

Comment 2 Mudit Agarwal 2022-06-20 14:12:44 UTC

Not a 4.11 blocker

Comment 3 arun kumar mohan 2022-10-07 11:34:31 UTC

We have a few alerts which get fired (or get activated) when we downscale an OSD to ZERO.

a. CephClusterWarningState
   Description: Storage cluster is in warning state for more than 15m.

b. CephDataRecoveryTakingTooLong (4 alerts, ie one alert for each pool)
   Description: Data recovery has been active for too long. Contact Support.

c. CephOSDDiskNotResponding
   Description: Disk device /dev/xvdba not responding, on host ip-10-0-128-75.us-west-2.compute.internal.


I could not see `CephClusterWarningState` being fired twice (at two different time intervals).
Need to check the automation set to get more info.

Comment 5 arun kumar mohan 2022-10-17 09:54:14 UTC

As per above comment, closing the BZ as wroks-for-me.
To re-iterate, this is a test environment only issue (and not a product/cluster issue).
If we are hitting this again, we should be looking at (debugging) the test framework (or test setup)

Please re-open this if this issue is hit again.