Bug 2189982

Summary: [RDR] ocs_rbd_client_blocklisted datapoints and the corresponding alert is not getting generated
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: ceph-monitoringAssignee: Divyansh Kamboj <dkamboj>
Status: CLOSED ERRATA QA Contact: Aman Agrawal <amagrawa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13CC: dkamboj, muagarwa, nberry, nthomas, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ODF 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.13.0-184 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-21 15:25:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 5 Divyansh Kamboj 2023-05-02 11:32:30 UTC
PR is up with the changes, https://github.com/red-hat-storage/ocs-operator/pull/2043
My account does not have the permission to do devel-acks.

Comment 11 Divyansh Kamboj 2023-05-24 01:22:00 UTC
From a quick look at ocs-metrics-exporter log, it's not able to get the blocklist data. Authenticating on mon-client is timing out, this is why it doesn't fetch any data directly from ceph.

```
2023-05-24T00:59:51.073+0000 7f01540d1b80  0 monclient(hunting): authenticate timed out after 300
```
Looking for possible reasons, why that's happening.

Comment 12 Divyansh Kamboj 2023-05-26 09:29:08 UTC
Based on my investigations, it seems that the ocs-metrics-exporter is unable to retrieve blocklist data, likely due to a timeout during authentication on the mon-client.

I investigated the mustgather logs from your QE cluster, and while I didn't find any issues that directly point to this bug, I didn't encounter the same issue when testing on a separate DR cluster. This suggests that the problem might be related to some unique conditions or configuration in your QE cluster, potentially related to connection issues or node management.

Additionally, it's worth noting that we have made a change in how the alert is raised. Now, the alert is only triggered if pods using the blocklisted PV go into `CreateContainerError`

To further investigate this, could you please:

1. Attempt to reproduce the issue once more and monitor for any connection issues (https://rook.io/docs/rook/v1.11/Troubleshooting/ceph-common-issues/#solution_1) during the process?
2. Provide more details on how consistently you're encountering this issue. If possible, provide a cluster where this issue is currently present for deeper investigation. (in the current cluster mon pods are not running)

Considering the potential impact of this issue, we might have to consider dropping this metric from version 4.13 if we can't resolve it in time. It's essential to identify the root cause of this issue.

Comment 16 errata-xmlrpc 2023-06-21 15:25:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742