Bug 2009396

Summary: Alert CephDataRecoveryTakingTooLong is not propagated to PagerDuty
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Filip Balák <fbalak>
Component: odf-managed-serviceAssignee: Dhruv Bindra <dbindra>
Status: CLOSED CURRENTRELEASE QA Contact: Filip Balák <fbalak>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: aeyal, dbindra, ocs-bugs, omitrani, rperiyas, sabose
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-16 19:50:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Filip Balák 2021-09-30 14:23:31 UTC
Description of problem:
When alert CephDataRecoveryTakingTooLong is triggered in Prometheus, it is not propagated into PagerDuty.

Version-Release number of selected component (if applicable):
ocs-operator.v4.8.1
ocs-osd-deployer-qe.v1.1.0

How reproducible:
1/1

Steps to Reproduce:
1. Drain all nodes in one rack that contains osds.
2. Check Prometheus that alert CephDataRecoveryTakingTooLong is Pending.
3. Wait 2 hours.
4. Check that the alert is propagated into PagerDuty.

Actual results:
Alert is not propagated into PagerDuty.

Expected results:
Alert is propagated into PagerDuty.

Additional info:
To check Prometheus, user needs to forward a port:
 $ oc port-forward svc/prometheus-operated 9090 -n openshift-storage
Then user can access http://localhost:9090/alerts in browser and see managed alerts.

Comment 1 Filip Balák 2021-12-14 15:11:56 UTC
Alert is propagated correctly after 2 hours and when nodes are uncordoned again, the alert is cleared correctly. --> VERIFIED

Tested with:
ocs-operator.v4.8.5
ocs-osd-deployer-qe.v1.1.2
ocp 4.9.9