Bug 2006222

Summary: CephNodeDown alert is not propagated into PagerDuty
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Filip Balák <fbalak>
Component: odf-managed-serviceAssignee: Dhruv Bindra <dbindra>
Status: CLOSED CURRENTRELEASE QA Contact: Elena Bondarenko <ebondare>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: aeyal, ebenahar, ebondare, ocs-bugs, omitrani, sabose, sheggodu
Target Milestone: ---Keywords: TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-16 19:49:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2006323    
Bug Blocks:    

Description Filip Balák 2021-09-21 08:51:58 UTC
Description of problem:
When kubelet process is stopped on a node and CephNodeDown is triggered in Prometheus, it is not propagated into PagerDuty.

Version-Release number of selected component (if applicable):
ocs-operator.v4.8.1
ocs-osd-deployer-qe.v1.1.0

How reproducible:
2/2

Steps to Reproduce:
1. Execute to identify one of worker nodes:
 $ oc get nodes
2. Execute to log into a node: 
 $ oc debug node/<some worker node>
 $ chroot /host
3. Execute to stop kubelet:
 $ systemctl stop kubelet
4. Execute to check for node to be in NotReady state:
 $ watch oc get nodes
5. Look into PagerDuty when node is in NotReady state.

Actual results:
Alert CephNodeDown is not in PagerDuty. There is no alert sent that indicates that there is any problem with cluster.

Expected results:
Appropriate alerts are raised in PagerDuty.

Additional info:

Comment 1 Elena Bondarenko 2021-12-14 15:39:40 UTC
Filip Balak ran all pagerduty tests with 
ocs-operator.v4.8.5
ocs-osd-deployer-qe.v1.1.2
ocp 4.9.9

Alerts are working correctly