2006865 – Ceph alerts that are auto-resolved should not be fired

Bug 2006865 - Ceph alerts that are auto-resolved should not be fired

Summary: Ceph alerts that are auto-resolved should not be fired

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	arun kumar mohan
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2010394
TreeView+	depends on / blocked

Reported:	2021-09-22 14:50 UTC by Kesavan
Modified:	2023-08-09 16:37 UTC (History)
CC List:	15 users (show)
Fixed In Version:	v4.9.0-182.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2010394 (view as bug list)
Environment:
Last Closed:	2021-12-13 17:46:30 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 298	0	None	Merged	Bug 2006865: [release-4.9] All alert interval changes	2021-10-14 14:36:25 UTC
Red Hat Product Errata	RHSA-2021:5086	0	None	None	None	2021-12-13 17:46:50 UTC

Description Kesavan 2021-09-22 14:50:42 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Alerts that are auto-resolved should not be fired. For isntance, when on upgrade/node replacement CephOSDDiskNotResponding alert is fired. This leads in creating pagerduty incidents to the ODF-MS SRE.

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Creates Pagerduty incident for SREs for the alerts that gets auto-resolved quickly

Is there any workaround available to the best of your knowledge?


Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Remove an ocs node.
2. Check PagerDuty system


Actual results:
Alerts which are auto-resolved are sent to pagerduty

Expected results:
Alerts which are auto-resolved should not be sent to pagerduty, as some of the rules have just 1 minute as a frequency to fire. 
As ceph prometheus rules gets reconciled by the ceph component, its not possible to patch the ceph prometheus rules by ODF-MF in order to increase the frequency for a particular rule. 

Additional info:
Tested on ROSA cluster.

Comment 3 arun kumar mohan 2021-09-23 16:45:24 UTC

Had a chat about this with Paul and discussion mainly revolved around,

Can these alerts be silenced using AlertManagers 'Silence' - ing feature?
How can we generalize that these, lets say, 'N' number of alerts I want to silence, as they are quickly resolvable?
Quickly resolvable is a loose term which depends from platform to platform and we cannot (IMO) put a general solution agreeable to all the components.
With that note, isn't it a good idea to have our own alert mechanism in ODF-MS SRE.

CC-ing Paul as well

Comment 5 arun kumar mohan 2021-09-30 20:17:29 UTC

After discussion with wider audience, we've decided to increase the delay timing of following alerts to 15m

CephMonHighNumberOfLeaderChanges
CephOSDDiskNotResponding
CephClusterWarningState

PR: https://github.com/rook/rook/pull/8896

Travis, Sebastian please take a look.

Comment 6 arun kumar mohan 2021-10-04 06:02:04 UTC

As per the discussion we had (over email, subject: Need some input on BZ#2006865), we should not increase the delay for alert 'CephMonHighNumberOfLeaderChanges'.
So added one more PR: https://github.com/rook/rook/pull/8909 , to revert the change for the alert.

Comment 10 Filip Balák 2021-11-01 10:29:48 UTC

As per comments 5 and 6:
The delay for CephMonHighNumberOfLeaderChanges is 5 minutes.
The delay for CephOSDDiskNotResponding and CephClusterWarningState is 15 minutes.

Tested with:
OCS 4.9.0-214.ci
OCP 4.9.0-0.nightly-2021-10-30-120753

Comment 12 errata-xmlrpc 2021-12-13 17:46:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Note You need to log in before you can comment on or make changes to this bug.