Bug 2218190

Summary:	[release-4.14] Alert 'ClusterObjectStoreState' is not triggered when RGW interface is unavailable
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Divyansh Kamboj <dkamboj>
Component:	ceph-monitoring	Assignee:	Divyansh Kamboj <dkamboj>
Status:	CLOSED ERRATA	QA Contact:	akarsha <akrai>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.14	CC:	akrai, amohan, dkamboj, hnallurv, jolmomar, kramdoss, muagarwa, nthomas, odf-bz-bot, uchapaga
Target Milestone:	---	Keywords:	Regression
Target Release:	ODF 4.14.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.14.0-130	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	2214524	Environment:
Last Closed:	2023-11-08 18:52:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2214524
Bug Blocks:

Description Divyansh Kamboj 2023-06-28 11:55:40 UTC

+++ This bug was initially created as a clone of Bug #2214524 +++

Description of problem (please be detailed as possible and provide log
snippests):

During the ocs-ci tier4c tests, the following test fails as the "ClusterObjectStoreState" alerts are not generated when the RGW interface is unavailable in 4.13 cluster

"tests/manage/monitoring/prometheus/test_rgw.py::test_rgw_unavailable "

Console output: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25635/console
TestReport: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25635/testReport/
Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/akrai-vm/akrai-vm_20230613T063110/logs/ocs-ci-logs-1686642761/tests/manage/monitoring/prometheus/test_rgw.py/test_rgw_unavailable/logs
must-gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/akrai-vm/akrai-vm_20230613T063110/logs/testcases_1686642761/akrai-vm/

Version of all relevant components (if applicable):
OCP: 4.13.0-0.nightly-2023-06-12-183948
ODF: 4.13.0-218

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
This was already fixed in bz https://bugzilla.redhat.com/show_bug.cgi?id=2144532

Steps to Reproduce:
1. Install OCP 4.13 nightly build and ODF 4.13
2. Run ocs-ci test: "tests/manage/monitoring/prometheus/test_rgw.py::test_rgw_unavailable " and test fails with below error

AssertionError: Incorrect number of ClusterObjectStoreState alerts (0 instead of 2 with states: ['pending', 'firing']).
Alerts: []

3. To verify manually downscaled the deployment rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a 
$ oc -n openshift-storage scale --replicas=0 deployment/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a

Check for the alertname "ClusterObjectStoreState" that should be generated when the rgw interface is unavailable

Actual results:
No alert generated 

Expected results:
Alert should be generated when RGW interface is unavailable

Additional info:
To verify ran the same tests in 4.12 version and the test succeeds
Console output: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25623/console
TestReport: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25623/testReport/
Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/akrai-j13-vm/akrai-j13-vm_20230613T050513/logs/ocs-ci-logs-1686636880/tests/manage/monitoring/prometheus/test_rgw.py/test_rgw_unavailable/logs
must-gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/akrai-j13-vm/akrai-j13-vm_20230613T050513/logs/testcases_1686636880/

--- Additional comment from RHEL Program Management on 2023-06-13 08:35:12 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-06-13 08:36:13 UTC ---

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.

--- Additional comment from Nishanth Thomas on 2023-06-13 10:20:50 UTC ---

Its a regression but not a blocker for the release. We are looking at it. Will update once we identify the root cause. I propose to move this to 4.13.z

--- Additional comment from arun kumar mohan on 2023-06-13 11:47:57 UTC ---

The alert, `ClusterObjectStoreState`, is dependent on metric `ocs_rgw_health_status` which is generated/exposed by ocs_metrics_exporter.
From the ocs-metrics-exporter logs (provided in the BZ) we are seeing the following error messages from ceph-object-store component (which exposes this metrics),

-------
CephObjectStore in unexpected phase. Must be "Connected", "Progressing" or "Failure"
-------

thus unable to provide the metric, without which the alert, ClusterObjectStoreState, cannot be fired.

@dkamboj can you please take a look?

--- Additional comment from Nishanth Thomas on 2023-06-13 13:06:32 UTC ---

Moving out to 4.13.z , per agreement in the program call

Comment 8 Divyansh Kamboj 2023-09-25 10:59:19 UTC

Hey Akarsha, You'll need to change the test case for this. ODF/Rook used to run a routine that regularly created a bucket and then wrote/read the bucket to test the RGW health, now the status checking is removed. We now need to reflect the "Readyness" of the deployment and the "Connected" nature of status of the CephObjectStore. 

You can forcefully set the rgw pods to not be "Ready" thus triggering the alert. You can try and do that by changing the ReadinessProbe to something that always returns failure.

Comment 9 Divyansh Kamboj 2023-09-25 12:41:20 UTC

Moving it to QA as needs changes in test.

Comment 10 Divyansh Kamboj 2023-09-27 08:49:51 UTC

@akrai you can forcefully make the rgw pods not ready using these steps

Step 1: Get the current YAML configuration of the deployment
kubectl get deployment rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a -n <namespace> -o yaml > deployment.yaml

Step 2: Edit the deployment.yaml file to update the readiness probe

You can use any text editor to modify the file. For example, using nano:
nano deployment.yaml

Inside the deployment.yaml file, locate the readinessProbe section and update it as follows:
readinessProbe:
  exec:
    command:
    - /bin/bash
    - -c
    - |
      #!/usr/bin/env bash
      exit 100
  initialDelaySeconds: 30
  periodSeconds: 10

Save the changes and exit the text editor.

Step 3: Apply the updated YAML configuration to the deployment
kubectl apply -f deployment.yaml -n <namespace>

Comment 13 errata-xmlrpc 2023-11-08 18:52:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832