Bug 2228359
| Summary: | Alerts CephOSDSlowOps and CephMdsMissingReplicas do not appear on 4.10 cluster | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Daniel Osypenko <dosypenk> |
| Component: | ceph-monitoring | Assignee: | Divyansh Kamboj <dkamboj> |
| Status: | CLOSED WORKSFORME | QA Contact: | Daniel Osypenko <dosypenk> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | nthomas, odf-bz-bot |
| Target Milestone: | --- | Keywords: | Regression |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-08-07 12:51:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Daniel Osypenko
2023-08-02 07:44:20 UTC
I tried reproducing CephMdsMissingReplicas on ODF 4.10.14 cluster, I was able to get the alert on a fresh cluster. But I noticed a caveat. The alert will only trigger when there's 1 pod for mds, if there's zero pods in total we do not have the alert. For CephODSSlowOps, can you provide the steps you took which would result in the alert firing? @ apologies, bugzilla posts the message if i press enter. @dosypenk can you confirm the behaviour in comment 4? Steps to reproduce CephODSSlowOps: 1. reduce osd_op_complaint_time value to 0.1 -> 'ceph config set osd osd_op_complaint_time 0.1' 2. create 2 PVCs with overall capacity equal to 90% of storage capacity 3. fill up PVCs and verify that during the period of writing data to PVCs CephOSDSlowOps appears using Prometheus API https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27670/console https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27662/console https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/27671/console Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh cluster. @dosypenk can you provide details for a cluster that has this issue reproduced? so we can look at the metric data, as that isn't available in the must-gather. (In reply to Divyansh Kamboj from comment #8) > Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh > cluster. @dosypenk can you provide details for a cluster that has > this issue reproduced? so we can look at the metric data, as that isn't > available in the must-gather. Hello Divyansh, I've added must-gather logs to the body of the bug report. Unfortunately cluster has been already destructed. (In reply to Daniel Osypenko from comment #9) > (In reply to Divyansh Kamboj from comment #8) > > Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh > > cluster. @dosypenk can you provide details for a cluster that has > > this issue reproduced? so we can look at the metric data, as that isn't > > available in the must-gather. > > Hello Divyansh, I've added must-gather logs to the body of the bug report. > Unfortunately cluster has been already destructed. must-gather logs don't provide the metric data of the cluster, we'll need to look at the metrics for a few hours and then correlate it with the alerts and the logs, to understand why the issue is happening(or what component is malfunctioning). if it's possible, can you reproduce it, and send the details? the issues are not reproducible on the fresh clusters i create, following the instructions provided in the bug. (In reply to Divyansh Kamboj from comment #10) > (In reply to Daniel Osypenko from comment #9) > > (In reply to Divyansh Kamboj from comment #8) > > > Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh > > > cluster. @dosypenk can you provide details for a cluster that has > > > this issue reproduced? so we can look at the metric data, as that isn't > > > available in the must-gather. > > > > Hello Divyansh, I've added must-gather logs to the body of the bug report. > > Unfortunately cluster has been already destructed. > > must-gather logs don't provide the metric data of the cluster, we'll need to > look at the metrics for a few hours and then correlate it with the alerts > and the logs, to understand why the issue is happening(or what component is > malfunctioning). if it's possible, can you reproduce it, and send the > details? > the issues are not reproducible on the fresh clusters i create, following > the instructions provided in the bug. With fresh cluster I reran the test with osd_op_complaint_time was set to 0.1, at 14:23 IST it filled the capacity to 84.4%, test failed. During 50 min CephOSDSlowOps did not appear. credentials -> https://url.corp.redhat.com/cluster Thanks for investigating (In reply to Daniel Osypenko from comment #11) > (In reply to Divyansh Kamboj from comment #10) > > (In reply to Daniel Osypenko from comment #9) > > > (In reply to Divyansh Kamboj from comment #8) > > > > Tried reproducing CephOSDSlowOps, and the alert works fine on a fresh > > > > cluster. @dosypenk can you provide details for a cluster that has > > > > this issue reproduced? so we can look at the metric data, as that isn't > > > > available in the must-gather. > > > > > > Hello Divyansh, I've added must-gather logs to the body of the bug report. > > > Unfortunately cluster has been already destructed. > > > > must-gather logs don't provide the metric data of the cluster, we'll need to > > look at the metrics for a few hours and then correlate it with the alerts > > and the logs, to understand why the issue is happening(or what component is > > malfunctioning). if it's possible, can you reproduce it, and send the > > details? > > the issues are not reproducible on the fresh clusters i create, following > > the instructions provided in the bug. > > With fresh cluster I reran the test with osd_op_complaint_time was set to > 0.1, at 14:23 IST it filled the capacity to 84.4%, test failed. During 50 > min CephOSDSlowOps did not appear. > credentials -> https://url.corp.redhat.com/cluster > Thanks for investigating What time was the test run? I can see the query for the alert pop up values around 7th Aug, 13:07 IST for almost 1.5 minutes, the threshold for the metric is 30s. So looks like the alert was triggered when SLOW_OPS was reported by ceph. Logs for ceph has the last mention of SLOW_OPS around 2023-08-07T07:38:28.770+0000 (7th Aug 13:08 IST), which correlates with the data that metric shows us. The behaviour of the Alert works as intended in the cluster above. Closing it as not reproducible. |