+++ This bug was initially created as a clone of Bug #2214524 +++ Description of problem (please be detailed as possible and provide log snippests): During the ocs-ci tier4c tests, the following test fails as the "ClusterObjectStoreState" alerts are not generated when the RGW interface is unavailable in 4.13 cluster "tests/manage/monitoring/prometheus/test_rgw.py::test_rgw_unavailable " Console output: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25635/console TestReport: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25635/testReport/ Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/akrai-vm/akrai-vm_20230613T063110/logs/ocs-ci-logs-1686642761/tests/manage/monitoring/prometheus/test_rgw.py/test_rgw_unavailable/logs must-gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/akrai-vm/akrai-vm_20230613T063110/logs/testcases_1686642761/akrai-vm/ Version of all relevant components (if applicable): OCP: 4.13.0-0.nightly-2023-06-12-183948 ODF: 4.13.0-218 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: This was already fixed in bz https://bugzilla.redhat.com/show_bug.cgi?id=2144532 Steps to Reproduce: 1. Install OCP 4.13 nightly build and ODF 4.13 2. Run ocs-ci test: "tests/manage/monitoring/prometheus/test_rgw.py::test_rgw_unavailable " and test fails with below error AssertionError: Incorrect number of ClusterObjectStoreState alerts (0 instead of 2 with states: ['pending', 'firing']). Alerts: [] 3. To verify manually downscaled the deployment rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a $ oc -n openshift-storage scale --replicas=0 deployment/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a Check for the alertname "ClusterObjectStoreState" that should be generated when the rgw interface is unavailable Actual results: No alert generated Expected results: Alert should be generated when RGW interface is unavailable Additional info: To verify ran the same tests in 4.12 version and the test succeeds Console output: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25623/console TestReport: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25623/testReport/ Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/akrai-j13-vm/akrai-j13-vm_20230613T050513/logs/ocs-ci-logs-1686636880/tests/manage/monitoring/prometheus/test_rgw.py/test_rgw_unavailable/logs must-gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/akrai-j13-vm/akrai-j13-vm_20230613T050513/logs/testcases_1686636880/ --- Additional comment from RHEL Program Management on 2023-06-13 08:35:12 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from RHEL Program Management on 2023-06-13 08:36:13 UTC --- This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. --- Additional comment from Nishanth Thomas on 2023-06-13 10:20:50 UTC --- Its a regression but not a blocker for the release. We are looking at it. Will update once we identify the root cause. I propose to move this to 4.13.z --- Additional comment from arun kumar mohan on 2023-06-13 11:47:57 UTC --- The alert, `ClusterObjectStoreState`, is dependent on metric `ocs_rgw_health_status` which is generated/exposed by ocs_metrics_exporter. From the ocs-metrics-exporter logs (provided in the BZ) we are seeing the following error messages from ceph-object-store component (which exposes this metrics), ------- CephObjectStore in unexpected phase. Must be "Connected", "Progressing" or "Failure" ------- thus unable to provide the metric, without which the alert, ClusterObjectStoreState, cannot be fired. @dkamboj can you please take a look? --- Additional comment from Nishanth Thomas on 2023-06-13 13:06:32 UTC --- Moving out to 4.13.z , per agreement in the program call
Hey Akarsha, You'll need to change the test case for this. ODF/Rook used to run a routine that regularly created a bucket and then wrote/read the bucket to test the RGW health, now the status checking is removed. We now need to reflect the "Readyness" of the deployment and the "Connected" nature of status of the CephObjectStore. You can forcefully set the rgw pods to not be "Ready" thus triggering the alert. You can try and do that by changing the ReadinessProbe to something that always returns failure.
Moving it to QA as needs changes in test.
@akrai you can forcefully make the rgw pods not ready using these steps Step 1: Get the current YAML configuration of the deployment kubectl get deployment rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a -n <namespace> -o yaml > deployment.yaml Step 2: Edit the deployment.yaml file to update the readiness probe You can use any text editor to modify the file. For example, using nano: nano deployment.yaml Inside the deployment.yaml file, locate the readinessProbe section and update it as follows: readinessProbe: exec: command: - /bin/bash - -c - | #!/usr/bin/env bash exit 100 initialDelaySeconds: 30 periodSeconds: 10 Save the changes and exit the text editor. Step 3: Apply the updated YAML configuration to the deployment kubectl apply -f deployment.yaml -n <namespace>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832