2218874 – CephClusterWarningState alert doesn't disappear from UI when storage cluster recovers

Bug 2218874 - CephClusterWarningState alert doesn't disappear from UI when storage cluster recovers

Summary: CephClusterWarningState alert doesn't disappear from UI when storage cluster ...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	arun kumar mohan
QA Contact:	Harish NV Rao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-30 11:28 UTC by Aman Agrawal
Modified:	2023-08-09 16:37 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-07-24 06:19:32 UTC
Embargoed:

Attachments	(Terms of Use)

Comment 3 arun kumar mohan 2023-07-19 07:31:28 UTC

Hi,
The alert, `CephClusterWarningState`, is dependent on query `ceph_health_status == 1` (if it returns 1 means ceph cluster's health is in warning state).

In the above scenario, where the cluster was tested for BZ#2218593 - StorageCluster goes to an error state by it's own and here we recovered (using the workaround of restarting the OCS Operator) the StorageCluster is recovered, but we still see the `ceph_health_status` is still returning ONE (I believe, Aman, you were referring to some MDR crash). That is, ceph cluster health is still in warning state, so alert remains as expected.

@Aman, can you please try to repro the issue where we still see the alert, `CephClusterWarningState` in a cluster where ceph cluster is in HEALTH_OK state.

Comment 5 Filip Balák 2023-07-21 13:59:18 UTC

I was not able to reproduce the bug. CephClusterWarningState alert was correctly cleared when ceph health state was restored back to HEALTH_OK. I recommend to check ceph status directly with tools pod when this issue is encountered in future:
$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s

It is possible that ceph was in a state that prevented it to return back to HEALTH_OK.

Tested with:
ODF 4.13.1-9
OCP 4.13.0-0.nightly-2023-07-20-222544

Note You need to log in before you can comment on or make changes to this bug.