Bug 2075062 - activeAt value of CephClusterWarningState changes in time
Summary: activeAt value of CephClusterWarningState changes in time
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: arun kumar mohan
QA Contact: Harish NV Rao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-13 14:09 UTC by Filip Balák
Modified: 2023-08-09 16:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-17 09:54:14 UTC
Embargoed:


Attachments (Terms of Use)

Description Filip Balák 2022-04-13 14:09:40 UTC
Description of problem (please be detailed as possible and provide log
snippests):
During test run that downscales one osd to 0 replicas, there is raised correctly alert CephClusterWarningState. From test logs it seems that the alert was created twice.

Version of all relevant components (if applicable):
ODF 4.10.0-220
OCP 4.10

Can this issue reproducible?
Not sure

Can this issue reproduce from the UI?
Yes

Steps to Reproduce:
1. Downscale one osd to 0 replicas
2. Check prometheus alerts of the cluster and wait 20 minutes

Actual results:
During test were logged following changes in prometheus api:
2022-04-10 19:25:01  17:25:01 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list
2022-04-10 19:25:41  17:25:41 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list
2022-04-10 19:39:39  17:39:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:24:36.986140336Z', 'value': '1e+00'} to alert list
2022-04-10 19:40:40  17:40:39 - Thread-14 - ocs_ci.utility.workloadfixture - INFO - Adding {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.131.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7cbfb75889-9xwbw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-04-10T17:25:36.986140336Z', 'value': '1e+00'} to alert list

It seems that there were logged 2 same alerts with difference in activeAt value which differs in one minute (or one alert where value of activeAt changes in time).

Expected results:
Value of activeAt for alerts raised to one event (alert CephClusterWarningState for event of osd down) should stay the same.

Additional info:
Found in test case test_ceph_osd_stopped from test run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3884

Comment 2 Mudit Agarwal 2022-06-20 14:12:44 UTC
Not a 4.11 blocker

Comment 3 arun kumar mohan 2022-10-07 11:34:31 UTC
We have a few alerts which get fired (or get activated) when we downscale an OSD to ZERO.

a. CephClusterWarningState
   Description: Storage cluster is in warning state for more than 15m.

b. CephDataRecoveryTakingTooLong (4 alerts, ie one alert for each pool)
   Description: Data recovery has been active for too long. Contact Support.

c. CephOSDDiskNotResponding
   Description: Disk device /dev/xvdba not responding, on host ip-10-0-128-75.us-west-2.compute.internal.


I could not see `CephClusterWarningState` being fired twice (at two different time intervals).
Need to check the automation set to get more info.

Comment 5 arun kumar mohan 2022-10-17 09:54:14 UTC
As per above comment, closing the BZ as wroks-for-me.
To re-iterate, this is a test environment only issue (and not a product/cluster issue).
If we are hitting this again, we should be looking at (debugging) the test framework (or test setup)

Please re-open this if this issue is hit again.


Note You need to log in before you can comment on or make changes to this bug.