Bug 1962161 - Alert 'ClusterObjectStoreState' is not triggered when RGW interface is unavailable
Summary: Alert 'ClusterObjectStoreState' is not triggered when RGW interface is unavai...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.2
Assignee: Anmol Sachan
QA Contact: Elad
URL:
Whiteboard:
Depends On: 1948378
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-19 12:16 UTC by Anmol Sachan
Modified: 2022-02-14 08:26 UTC (History)
12 users (show)

Fixed In Version: v4.7.2-429.ci
Doc Type: Bug Fix
Doc Text:
Currently, the ClusterObjectStoreState alert message is not generated if the RADOS Object Gateway (RGW) is not available or is unhealthy. In this update, a fix implemented in the OpenShift Container Storage operator, and users can now see the ClusterObjectStoreState alert when RADOS Object Gateway (RGW) is not available or is unhealthy.
Clone Of: 1948378
Environment:
Last Closed: 2022-02-14 08:26:57 UTC
Embargoed:


Attachments (Terms of Use)
rgw alert test (52.84 KB, text/plain)
2021-06-30 06:49 UTC, Abdul Kandathil (IBM)
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1174 0 None closed fix ClusterObjectStoreState Alert empty spec 2021-06-21 11:26:02 UTC
Github openshift ocs-operator pull 1233 0 None open : Bug 1962161: [release-4.7] fix ClusterObjectStoreState Alert empty spec 2021-06-21 11:26:30 UTC
Red Hat Product Errata RHBA-2021:2632 0 None None None 2021-06-30 19:23:01 UTC

Comment 5 Mudit Agarwal 2021-06-23 11:16:04 UTC
Please add doc text

Comment 14 Abdul Kandathil (IBM) 2021-06-30 06:49:50 UTC
Created attachment 1796115 [details]
rgw alert test

we ran ocs ci test "tests/manage/monitoring/prometheus/test_rgw.py::test_rgw_unavailable" on IBM Z. 
Alert itself looks to be working but test is failing due to label mismatch in the alert.
 
Attached the log for your reference.

Comment 15 Anmol Sachan 2021-06-30 08:31:23 UTC
(In reply to Abdul Kandathil (IBM) from comment #14)
> Created attachment 1796115 [details]
> rgw alert test
> 
> we ran ocs ci test
> "tests/manage/monitoring/prometheus/test_rgw.py::test_rgw_unavailable" on
> IBM Z. 
> Alert itself looks to be working but test is failing due to label mismatch
> in the alert.

Can you please elaborate on what is expected here? Also, if the alert is working as expected, shouldn't be the test fixed in this case?

Comment 16 Abdul Kandathil (IBM) 2021-06-30 08:49:02 UTC
@asachan, I am not able to see what info I need to provide. Looks like I don't have permission to view many comments.

Comment 17 Raz Tamir 2021-06-30 09:50:44 UTC
Hi Anmol,

By fixing this BZ, any chance the alert was changed too?
if so could you please confirm it was the intention and we will"fix" the test to align with the new alert

Comment 18 Mudit Agarwal 2021-06-30 12:21:51 UTC
Hi Abdul,

Are you talking about the mismatch in the alert message? I can't see from the logs why the test is exactly failing, can you please elaborate.
Target label is the alert name which is not changed.

Expected message in ci

>> Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health or RGW connection."

Message we are getting while running the test:

>> Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health.

Is this the issue or I am missing something here?

Comment 19 Abdul Kandathil (IBM) 2021-06-30 12:52:11 UTC
Hi mudit,

Yes, it looks like your observation is right. It is the alert message difference as you noticed.

The test expects message: 
'Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health or RGW connection.'

And the alert message generated is:
'Cluster Object Store is in unhealthy state. Please check Ceph cluster health.''

Comment 26 errata-xmlrpc 2021-06-30 19:22:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.2 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2632

Comment 27 Filip Balák 2022-02-11 13:18:25 UTC
From test runs it seems that this bug was never fixed: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3140

Comment 28 Filip Balák 2022-02-14 08:26:57 UTC
After further investigation I see that the alert is correctly raised in last two 4.7 runs:

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3175/
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3229/

--> Putting back to CLOSED


Note You need to log in before you can comment on or make changes to this bug.