1962161 – Alert 'ClusterObjectStoreState' is not triggered when RGW interface is unavailable

Bug 1962161 - Alert 'ClusterObjectStoreState' is not triggered when RGW interface is unavailable

Summary: Alert 'ClusterObjectStoreState' is not triggered when RGW interface is unavai...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.2
Assignee:	Anmol Sachan
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	1948378
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-19 12:16 UTC by Anmol Sachan
Modified:	2022-02-14 08:26 UTC (History)
CC List:	12 users (show)
Fixed In Version:	v4.7.2-429.ci
Doc Type:	Bug Fix
Doc Text:	Currently, the ClusterObjectStoreState alert message is not generated if the RADOS Object Gateway (RGW) is not available or is unhealthy. In this update, a fix implemented in the OpenShift Container Storage operator, and users can now see the ClusterObjectStoreState alert when RADOS Object Gateway (RGW) is not available or is unhealthy.
Clone Of:	1948378
Environment:
Last Closed:	2022-02-14 08:26:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
rgw alert test (52.84 KB, text/plain) 2021-06-30 06:49 UTC, Abdul Kandathil (IBM)	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 1174	None	closed	fix ClusterObjectStoreState Alert empty spec	2021-06-21 11:26:02 UTC
Github	openshift ocs-operator pull 1233	None	open	: Bug 1962161: [release-4.7] fix ClusterObjectStoreState Alert empty spec	2021-06-21 11:26:30 UTC
Red Hat Product Errata	RHBA-2021:2632	None	None	None	2021-06-30 19:23:01 UTC

Comment 5 Mudit Agarwal 2021-06-23 11:16:04 UTC

Please add doc text

Comment 14 Abdul Kandathil (IBM) 2021-06-30 06:49:50 UTC

Created attachment 1796115 [details]
rgw alert test

we ran ocs ci test "tests/manage/monitoring/prometheus/test_rgw.py::test_rgw_unavailable" on IBM Z. 
Alert itself looks to be working but test is failing due to label mismatch in the alert.
 
Attached the log for your reference.

Comment 15 Anmol Sachan 2021-06-30 08:31:23 UTC

(In reply to Abdul Kandathil (IBM) from comment #14)
> Created attachment 1796115 [details]
> rgw alert test
> 
> we ran ocs ci test
> "tests/manage/monitoring/prometheus/test_rgw.py::test_rgw_unavailable" on
> IBM Z. 
> Alert itself looks to be working but test is failing due to label mismatch
> in the alert.

Can you please elaborate on what is expected here? Also, if the alert is working as expected, shouldn't be the test fixed in this case?

Comment 16 Abdul Kandathil (IBM) 2021-06-30 08:49:02 UTC

@asachan, I am not able to see what info I need to provide. Looks like I don't have permission to view many comments.

Comment 17 Raz Tamir 2021-06-30 09:50:44 UTC

Hi Anmol,

By fixing this BZ, any chance the alert was changed too?
if so could you please confirm it was the intention and we will"fix" the test to align with the new alert

Comment 18 Mudit Agarwal 2021-06-30 12:21:51 UTC

Hi Abdul,

Are you talking about the mismatch in the alert message? I can't see from the logs why the test is exactly failing, can you please elaborate.
Target label is the alert name which is not changed.

Expected message in ci

>> Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health or RGW connection."

Message we are getting while running the test:

>> Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health.

Is this the issue or I am missing something here?

Comment 19 Abdul Kandathil (IBM) 2021-06-30 12:52:11 UTC

Hi mudit,

Yes, it looks like your observation is right. It is the alert message difference as you noticed.

The test expects message: 
'Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health or RGW connection.'

And the alert message generated is:
'Cluster Object Store is in unhealthy state. Please check Ceph cluster health.''

Comment 26 errata-xmlrpc 2021-06-30 19:22:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.2 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2632

Comment 27 Filip Balák 2022-02-11 13:18:25 UTC

From test runs it seems that this bug was never fixed: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3140

Comment 28 Filip Balák 2022-02-14 08:26:57 UTC

After further investigation I see that the alert is correctly raised in last two 4.7 runs:

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3175/
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3229/

--> Putting back to CLOSED

Note You need to log in before you can comment on or make changes to this bug.