Bug 2006865
| Summary: | Ceph alerts that are auto-resolved should not be fired | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Kesavan <kvellalo> | |
| Component: | ceph-monitoring | Assignee: | arun kumar mohan <amohan> | |
| Status: | CLOSED ERRATA | QA Contact: | Filip Balák <fbalak> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.8 | CC: | amohan, ebenahar, mhackett, muagarwa, nberry, nschiede, ocs-bugs, odf-bz-bot, owasserm, pcuzner, rcyriac, rperiyas, sabose, shan, tnielsen | |
| Target Milestone: | --- | |||
| Target Release: | ODF 4.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | v4.9.0-182.ci | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2010394 (view as bug list) | Environment: | ||
| Last Closed: | 2021-12-13 17:46:30 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2010394 | |||
|
Description
Kesavan
2021-09-22 14:50:42 UTC
Had a chat about this with Paul and discussion mainly revolved around, Can these alerts be silenced using AlertManagers 'Silence' - ing feature? How can we generalize that these, lets say, 'N' number of alerts I want to silence, as they are quickly resolvable? Quickly resolvable is a loose term which depends from platform to platform and we cannot (IMO) put a general solution agreeable to all the components. With that note, isn't it a good idea to have our own alert mechanism in ODF-MS SRE. CC-ing Paul as well After discussion with wider audience, we've decided to increase the delay timing of following alerts to 15m CephMonHighNumberOfLeaderChanges CephOSDDiskNotResponding CephClusterWarningState PR: https://github.com/rook/rook/pull/8896 Travis, Sebastian please take a look. As per the discussion we had (over email, subject: Need some input on BZ#2006865), we should not increase the delay for alert 'CephMonHighNumberOfLeaderChanges'. So added one more PR: https://github.com/rook/rook/pull/8909 , to revert the change for the alert. As per comments 5 and 6: The delay for CephMonHighNumberOfLeaderChanges is 5 minutes. The delay for CephOSDDiskNotResponding and CephClusterWarningState is 15 minutes. Tested with: OCS 4.9.0-214.ci OCP 4.9.0-0.nightly-2021-10-30-120753 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086 |