Bug 2006865

Summary: Ceph alerts that are auto-resolved should not be fired
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Kesavan <kvellalo>
Component: ceph-monitoringAssignee: arun kumar mohan <amohan>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.8CC: amohan, ebenahar, mhackett, muagarwa, nberry, nschiede, ocs-bugs, odf-bz-bot, owasserm, pcuzner, rcyriac, rperiyas, sabose, shan, tnielsen
Target Milestone: ---   
Target Release: ODF 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.9.0-182.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2010394 (view as bug list) Environment:
Last Closed: 2021-12-13 17:46:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2010394    

Description Kesavan 2021-09-22 14:50:42 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Alerts that are auto-resolved should not be fired. For isntance, when on upgrade/node replacement CephOSDDiskNotResponding alert is fired. This leads in creating pagerduty incidents to the ODF-MS SRE.

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Creates Pagerduty incident for SREs for the alerts that gets auto-resolved quickly

Is there any workaround available to the best of your knowledge?


Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Remove an ocs node.
2. Check PagerDuty system


Actual results:
Alerts which are auto-resolved are sent to pagerduty

Expected results:
Alerts which are auto-resolved should not be sent to pagerduty, as some of the rules have just 1 minute as a frequency to fire. 
As ceph prometheus rules gets reconciled by the ceph component, its not possible to patch the ceph prometheus rules by ODF-MF in order to increase the frequency for a particular rule. 

Additional info:
Tested on ROSA cluster.

Comment 3 arun kumar mohan 2021-09-23 16:45:24 UTC
Had a chat about this with Paul and discussion mainly revolved around,

Can these alerts be silenced using AlertManagers 'Silence' - ing feature?
How can we generalize that these, lets say, 'N' number of alerts I want to silence, as they are quickly resolvable?
Quickly resolvable is a loose term which depends from platform to platform and we cannot (IMO) put a general solution agreeable to all the components.
With that note, isn't it a good idea to have our own alert mechanism in ODF-MS SRE.

CC-ing Paul as well

Comment 5 arun kumar mohan 2021-09-30 20:17:29 UTC
After discussion with wider audience, we've decided to increase the delay timing of following alerts to 15m

CephMonHighNumberOfLeaderChanges
CephOSDDiskNotResponding
CephClusterWarningState

PR: https://github.com/rook/rook/pull/8896

Travis, Sebastian please take a look.

Comment 6 arun kumar mohan 2021-10-04 06:02:04 UTC
As per the discussion we had (over email, subject: Need some input on BZ#2006865), we should not increase the delay for alert 'CephMonHighNumberOfLeaderChanges'.
So added one more PR: https://github.com/rook/rook/pull/8909 , to revert the change for the alert.

Comment 10 Filip Balák 2021-11-01 10:29:48 UTC
As per comments 5 and 6:
The delay for CephMonHighNumberOfLeaderChanges is 5 minutes.
The delay for CephOSDDiskNotResponding and CephClusterWarningState is 15 minutes.

Tested with:
OCS 4.9.0-214.ci
OCP 4.9.0-0.nightly-2021-10-30-120753

Comment 12 errata-xmlrpc 2021-12-13 17:46:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086