PR is up with the changes, https://github.com/red-hat-storage/ocs-operator/pull/2043 My account does not have the permission to do devel-acks.
From a quick look at ocs-metrics-exporter log, it's not able to get the blocklist data. Authenticating on mon-client is timing out, this is why it doesn't fetch any data directly from ceph. ``` 2023-05-24T00:59:51.073+0000 7f01540d1b80 0 monclient(hunting): authenticate timed out after 300 ``` Looking for possible reasons, why that's happening.
Based on my investigations, it seems that the ocs-metrics-exporter is unable to retrieve blocklist data, likely due to a timeout during authentication on the mon-client. I investigated the mustgather logs from your QE cluster, and while I didn't find any issues that directly point to this bug, I didn't encounter the same issue when testing on a separate DR cluster. This suggests that the problem might be related to some unique conditions or configuration in your QE cluster, potentially related to connection issues or node management. Additionally, it's worth noting that we have made a change in how the alert is raised. Now, the alert is only triggered if pods using the blocklisted PV go into `CreateContainerError` To further investigate this, could you please: 1. Attempt to reproduce the issue once more and monitor for any connection issues (https://rook.io/docs/rook/v1.11/Troubleshooting/ceph-common-issues/#solution_1) during the process? 2. Provide more details on how consistently you're encountering this issue. If possible, provide a cluster where this issue is currently present for deeper investigation. (in the current cluster mon pods are not running) Considering the potential impact of this issue, we might have to consider dropping this metric from version 4.13 if we can't resolve it in time. It's essential to identify the root cause of this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742