2322173 – CephNodeDown Alert not triggered when Worker node Is gracefully stopped

Bug 2322173 - CephNodeDown Alert not triggered when Worker node Is gracefully stopped

Summary: CephNodeDown Alert not triggered when Worker node Is gracefully stopped

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nishanth Thomas
QA Contact:	Harish NV Rao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-10-28 17:19 UTC by Daniel Osypenko
Modified:	2024-10-29 09:56 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-9442	0	None	None	None	2024-10-28 17:20:12 UTC

Description Daniel Osypenko 2024-10-28 17:19:52 UTC

Description of problem (please be detailed as possible and provide log
snippests):

`test_stop_start_node_validate_topology` constantly fails on multiple deployments.
CephNodeDown alert is not firing after 30s and not visible both via Prometheus and management-console. Issue detected on vSphere clusters.

**Failed: One or multiple checks did not pass:**

| Check                                                        | check_state |
|--------------------------------------------------------------|-------------|
| prometheus_CephNodeDown_alert_fired                         | 0           |
| cluster_in_danger_state_check_pass                          | 1           |
| ceph_node_down_alert_found_check_pass                       | 0           |
| ceph_node_down_alert_found_on_idle_node_check_pass          | 1           |
| prometheus_CephNodeDown_alert_removed                       | 1           |
| ceph_node_down_alert_found_after_node_turned_on_check_pass  | 1           |


Version of all relevant components (if applicable):
ODF 4.17.0-105, OCP 4.17

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
in some cases

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:
regression. We had successful test executions, but a long ago. no history saved.

Steps to Reproduce:
1. Turn off the node from VMWare console or in other way
2. login to management console and verify CephNodeDown alert appeared or make a prometheus request to fetch fired alerts with req
curl -k -H "Authorization: Bearer $TOKEN" -X GET https://$ROUTE/api/v1/alerts | jq '.data.alerts[].labels.alertname'
3. 


Actual results:
CephNodeDown fail to fire

Expected results:
CephNodeDown fires when one ceph node is down for 30+sec

Additional info:
OCP and OCS must-gather logs https://url.corp.redhat.com/2a64b06

Note You need to log in before you can comment on or make changes to this bug.