1935342 – [RFE] Add OSD flapping alert

Bug 1935342 - [RFE] Add OSD flapping alert

Summary: [RFE] Add OSD flapping alert

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Anmol Sachan
QA Contact:	suchita
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1938134
TreeView+	depends on / blocked

Reported:	2021-03-04 16:35 UTC by Anmol Sachan
Modified:	2021-08-05 12:49 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	.Prometheus alert for OSD restart This enhancement adds a Prometheus alert to notify if an OpenShift Container Storage OSD restarts more than 5 times in 5 minutes. The alert message is as follows: ---- Storage daemon osd.x has restarted 5 times in the last 5 minutes. Please check the pod events or ceph status to find out the cause. ---- x - represent the OSD number
Clone Of:
Environment:
Last Closed:	2021-05-19 09:20:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 192	None	closed	Bug 1935342: ceph: add osd flapping alert	2021-03-15 05:23:21 UTC
Github	rook rook pull 7358	None	open	ceph: add osd flapping alert	2021-03-05 12:47:09 UTC
Red Hat Product Errata	RHSA-2021:2041	None	None	None	2021-05-19 09:20:59 UTC

Comment 8 Elad 2021-03-21 12:04:13 UTC

Hi Anmol,

Could you please suggest how to test this alert?

Comment 9 Yaniv Kaul 2021-03-21 14:00:18 UTC

(In reply to Elad from comment #8)
> Hi Anmol,
> 
> Could you please suggest how to test this alert?

Latency, and especially enough packet drops towards that OSD would cause it.

Comment 13 suchita 2021-03-24 09:46:56 UTC

Verified on Cluster with versions: 
OCP: 4.7.0-0.nightly-2021-03-21-181832
OCS: ocs-operator.v4.7.0-307.ci

Created the scenario by deleting one of the "rook-ceph-osd-*-*" pod continuously till it shows 5 flaps in metric.  It shows the following message alert. 

"Storage daemon osd.2 has restarted 5 times in last 5 minutes. Please check the pod events or ceph status to find out the cause."

 And alert disappears as soon as the action of pods restart action has been stopped.


According to the above observation Marking BZ as verified.

Comment 18 errata-xmlrpc 2021-05-19 09:20:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.