Bug 1935342
| Summary: | [RFE] Add OSD flapping alert | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Anmol Sachan <asachan> |
| Component: | rook | Assignee: | Anmol Sachan <asachan> |
| Status: | CLOSED ERRATA | QA Contact: | suchita <sgatfane> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | asachan, dwalveka, ebenahar, etamir, madam, mbukatov, muagarwa, nberry, nthomas, ocs-bugs, sgatfane |
| Target Milestone: | --- | Keywords: | AutomationBackLog, FutureFeature |
| Target Release: | OCS 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: |
.Prometheus alert for OSD restart
This enhancement adds a Prometheus alert to notify if an OpenShift Container
Storage OSD restarts more than 5 times in 5 minutes.
The alert message is as follows:
----
Storage daemon osd.x has restarted 5 times in the last 5 minutes. Please check the pod events or ceph status to find out the cause.
----
x - represent the OSD number
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-19 09:20:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1938134 | ||
|
Comment 8
Elad
2021-03-21 12:04:13 UTC
(In reply to Elad from comment #8) > Hi Anmol, > > Could you please suggest how to test this alert? Latency, and especially enough packet drops towards that OSD would cause it. Verified on Cluster with versions: OCP: 4.7.0-0.nightly-2021-03-21-181832 OCS: ocs-operator.v4.7.0-307.ci Created the scenario by deleting one of the "rook-ceph-osd-*-*" pod continuously till it shows 5 flaps in metric. It shows the following message alert. "Storage daemon osd.2 has restarted 5 times in last 5 minutes. Please check the pod events or ceph status to find out the cause." And alert disappears as soon as the action of pods restart action has been stopped. According to the above observation Marking BZ as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041 |