2009396 – Alert CephDataRecoveryTakingTooLong is not propagated to PagerDuty

Bug 2009396 - Alert CephDataRecoveryTakingTooLong is not propagated to PagerDuty

Summary: Alert CephDataRecoveryTakingTooLong is not propagated to PagerDuty

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	odf-managed-service
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Dhruv Bindra
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-30 14:23 UTC by Filip Balák
Modified:	2022-08-17 06:56 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-16 19:50:04 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ocs-osd-deployer pull 94	0	None	open	AlertRelabelConfigSecret added to apply namespace label to alerts	2021-10-04 10:28:35 UTC

Description Filip Balák 2021-09-30 14:23:31 UTC

Description of problem:
When alert CephDataRecoveryTakingTooLong is triggered in Prometheus, it is not propagated into PagerDuty.

Version-Release number of selected component (if applicable):
ocs-operator.v4.8.1
ocs-osd-deployer-qe.v1.1.0

How reproducible:
1/1

Steps to Reproduce:
1. Drain all nodes in one rack that contains osds.
2. Check Prometheus that alert CephDataRecoveryTakingTooLong is Pending.
3. Wait 2 hours.
4. Check that the alert is propagated into PagerDuty.

Actual results:
Alert is not propagated into PagerDuty.

Expected results:
Alert is propagated into PagerDuty.

Additional info:
To check Prometheus, user needs to forward a port:
 $ oc port-forward svc/prometheus-operated 9090 -n openshift-storage
Then user can access http://localhost:9090/alerts in browser and see managed alerts.

Comment 1 Filip Balák 2021-12-14 15:11:56 UTC

Alert is propagated correctly after 2 hours and when nodes are uncordoned again, the alert is cleared correctly. --> VERIFIED

Tested with:
ocs-operator.v4.8.5
ocs-osd-deployer-qe.v1.1.2
ocp 4.9.9

Note You need to log in before you can comment on or make changes to this bug.