Bug 2189982 - [RDR] ocs_rbd_client_blocklisted datapoints and the corresponding alert is not getting generated
Summary: [RDR] ocs_rbd_client_blocklisted datapoints and the corresponding alert is no...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.13.0
Assignee: Divyansh Kamboj
QA Contact: Aman Agrawal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-26 16:37 UTC by Aman Agrawal
Modified: 2023-08-09 16:37 UTC (History)
6 users (show)

Fixed In Version: 4.13.0-184
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-21 15:25:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2043 0 None open fix changes overwriting of clients in the RBDClientMap 2023-05-02 11:32:44 UTC
Github red-hat-storage ocs-operator pull 2046 0 None open Bug 2189982: [release-4.13] fix changes overwriting of clients in the RBDClientMap 2023-05-03 08:28:08 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:25:44 UTC

Comment 5 Divyansh Kamboj 2023-05-02 11:32:30 UTC
PR is up with the changes, https://github.com/red-hat-storage/ocs-operator/pull/2043
My account does not have the permission to do devel-acks.

Comment 11 Divyansh Kamboj 2023-05-24 01:22:00 UTC
From a quick look at ocs-metrics-exporter log, it's not able to get the blocklist data. Authenticating on mon-client is timing out, this is why it doesn't fetch any data directly from ceph.

```
2023-05-24T00:59:51.073+0000 7f01540d1b80  0 monclient(hunting): authenticate timed out after 300
```
Looking for possible reasons, why that's happening.

Comment 12 Divyansh Kamboj 2023-05-26 09:29:08 UTC
Based on my investigations, it seems that the ocs-metrics-exporter is unable to retrieve blocklist data, likely due to a timeout during authentication on the mon-client.

I investigated the mustgather logs from your QE cluster, and while I didn't find any issues that directly point to this bug, I didn't encounter the same issue when testing on a separate DR cluster. This suggests that the problem might be related to some unique conditions or configuration in your QE cluster, potentially related to connection issues or node management.

Additionally, it's worth noting that we have made a change in how the alert is raised. Now, the alert is only triggered if pods using the blocklisted PV go into `CreateContainerError`

To further investigate this, could you please:

1. Attempt to reproduce the issue once more and monitor for any connection issues (https://rook.io/docs/rook/v1.11/Troubleshooting/ceph-common-issues/#solution_1) during the process?
2. Provide more details on how consistently you're encountering this issue. If possible, provide a cluster where this issue is currently present for deeper investigation. (in the current cluster mon pods are not running)

Considering the potential impact of this issue, we might have to consider dropping this metric from version 4.13 if we can't resolve it in time. It's essential to identify the root cause of this issue.

Comment 16 errata-xmlrpc 2023-06-21 15:25:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.