Bug 2218874

Summary: CephClusterWarningState alert doesn't disappear from UI when storage cluster recovers
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: ceph-monitoringAssignee: arun kumar mohan <amohan>
Status: CLOSED NOTABUG QA Contact: Harish NV Rao <hnallurv>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.13CC: dkamboj, fbalak, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-24 06:19:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 3 arun kumar mohan 2023-07-19 07:31:28 UTC
Hi,
The alert, `CephClusterWarningState`, is dependent on query `ceph_health_status == 1` (if it returns 1 means ceph cluster's health is in warning state).

In the above scenario, where the cluster was tested for BZ#2218593 - StorageCluster goes to an error state by it's own and here we recovered (using the workaround of restarting the OCS Operator) the StorageCluster is recovered, but we still see the `ceph_health_status` is still returning ONE (I believe, Aman, you were referring to some MDR crash). That is, ceph cluster health is still in warning state, so alert remains as expected.

@Aman, can you please try to repro the issue where we still see the alert, `CephClusterWarningState` in a cluster where ceph cluster is in HEALTH_OK state.

Comment 5 Filip Balák 2023-07-21 13:59:18 UTC
I was not able to reproduce the bug. CephClusterWarningState alert was correctly cleared when ceph health state was restored back to HEALTH_OK. I recommend to check ceph status directly with tools pod when this issue is encountered in future:
$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s

It is possible that ceph was in a state that prevented it to return back to HEALTH_OK.

Tested with:
ODF 4.13.1-9
OCP 4.13.0-0.nightly-2023-07-20-222544