Description of problem (please be detailed as possible and provide log snippests): Alerts that are auto-resolved should not be fired. For isntance, when on upgrade/node replacement CephOSDDiskNotResponding alert is fired. This leads in creating pagerduty incidents to the ODF-MS SRE. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Creates Pagerduty incident for SREs for the alerts that gets auto-resolved quickly Is there any workaround available to the best of your knowledge? Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Remove an ocs node. 2. Check PagerDuty system Actual results: Alerts which are auto-resolved are sent to pagerduty Expected results: Alerts which are auto-resolved should not be sent to pagerduty, as some of the rules have just 1 minute as a frequency to fire. As ceph prometheus rules gets reconciled by the ceph component, its not possible to patch the ceph prometheus rules by ODF-MF in order to increase the frequency for a particular rule. Additional info: Tested on ROSA cluster.
Had a chat about this with Paul and discussion mainly revolved around, Can these alerts be silenced using AlertManagers 'Silence' - ing feature? How can we generalize that these, lets say, 'N' number of alerts I want to silence, as they are quickly resolvable? Quickly resolvable is a loose term which depends from platform to platform and we cannot (IMO) put a general solution agreeable to all the components. With that note, isn't it a good idea to have our own alert mechanism in ODF-MS SRE. CC-ing Paul as well
After discussion with wider audience, we've decided to increase the delay timing of following alerts to 15m CephMonHighNumberOfLeaderChanges CephOSDDiskNotResponding CephClusterWarningState PR: https://github.com/rook/rook/pull/8896 Travis, Sebastian please take a look.
As per the discussion we had (over email, subject: Need some input on BZ#2006865), we should not increase the delay for alert 'CephMonHighNumberOfLeaderChanges'. So added one more PR: https://github.com/rook/rook/pull/8909 , to revert the change for the alert.
As per comments 5 and 6: The delay for CephMonHighNumberOfLeaderChanges is 5 minutes. The delay for CephOSDDiskNotResponding and CephClusterWarningState is 15 minutes. Tested with: OCS 4.9.0-214.ci OCP 4.9.0-0.nightly-2021-10-30-120753
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086