Description of problem (please be detailed as possible and provide log snippests): When the alert such as CephMgrIsAbsent and CephNodeDown are fired they're missing the label namespace:openshift-storage due to which the alerts are only visible in alertmanager UI but no incident is raised in pagerduty. Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Scale deployment of rook-ceph-mgr-a to 0. 2. Check PagerDuty system Actual results: No alert is sent to pagerduty within 10 minutes Expected results: Alert should be sent to pagerduty according to alerting rule (after 5 minutes) Additional info: Tested on ROSA cluster.
This has two parts, alert for CephNodeDown: this could be fixed as the matric results are available for this query but for alert for CephMgrIsAbsent : we check for the absence of the metrics and since the metrics is absent we cannot figure out from which namespace the result (which is not present) is from. Will talk further with Dhruv and see what we can do.
I have raised the PR to fix the CephNodeDown : https://github.com/rook/rook/pull/8793, which adds the 'namespace' filed into the metric result PS: this PR is on top of a general PR: https://github.com/rook/rook/pull/8774 Travis, Sebastian, please take a look.
(In reply to arun kumar mohan from comment #8) > I have raised the PR to fix the CephNodeDown : > https://github.com/rook/rook/pull/8793, which adds the 'namespace' filed > into the metric result > > PS: this PR is on top of a general PR: https://github.com/rook/rook/pull/8774 > > Travis, Sebastian, please take a look. Reviewed the base PR yesterday and waiting for a reply https://github.com/rook/rook/pull/8774#pullrequestreview-761712175
Thanks Sebastian. Addressed and updated both the PRs.
For the mgr related alerts CephMgrIsAbsent label_replace(absent(up{job="rook-ceph-mgr"} == 1),"namespace", "openshift-storage","","") this will yield a metric with a namespace label based on the absent function Alternatively, you could skip the absent function and just use up. the up metric denotes the scrape job success - so if it fails, it implies either the mgr is down or the prometheus endpoint is not reachable, By using up directly, the labelset already includes the namespace so there is no need for label_replace. up{job="rook-ceph-mgr"} == 0 ... would indicate the scrape from mgr/prometheus is failing so alerts/metrics are compromised. Another way to look at this is to look for the pod. kube_pod_status_phase{namespace="openshift-storage", pod=~"rook-ceph-mgr.*",phase="Running"} == 1 However, I think sticking with the "up" based solution works better, since if the scrape job is failing - you'll get no alerts firing and the dashboards will be incomplete. CephMgrIsMissingReplicas Do we really need this one? We run with a single mgr, and since the CephMgrIsAbsent alert handles the case where the mgr is not present - I'm not sure what the value is here?
For alert 'CephMgrIsAbsent', the options/solutions listed in Comment#11, we have decided to go with the option Paul has recommended (the 'up{job="rook-ceph-mgr"} == 0' functionality instead of 'absent'). Making changes.
Rook PR send: https://github.com/rook/rook/pull/8882 Travis, Sebastian please take a look
@nberry Yes, there are few more alerts that are missing the label and I'm compiling a list of alerts that are missing the namespace label. I'm not sure about the alert for data/cluster unavailability is affected by this bug, but it'll be clear when we have the list.
(In reply to arun kumar mohan from comment #17) > Rook PR send: https://github.com/rook/rook/pull/8882 > Travis, Sebastian please take a look Merged.
@amohan @
@amohan @nberry This is the list of Alerts that are missing namespace label https://docs.google.com/document/d/1u3mG_zq4RJAJDL_w54leBxQfeTEFFSsxETLTVhywC88/edit?usp=sharing
According to Dhruv's document these are the list of alerts which require additional 'namespace' field, CephMdsMissingReplicas CephMgrIsAbsent CephMgrIsMissingReplicas CephOSDVersionMismatch CephMonVersionMismatch CephMonQuorumAtRisk CephNodeDown [not done] Except for the 'CephNodeDown' query, the following PR adds the 'namespace' field to the alert, PR: https://github.com/rook/rook/pull/8901 Travis, Sebastian (again) please take a look... =)
Thanks Sebastian for merging the rook PR. Created the backport PRs for, release-4.8: https://github.com/red-hat-storage/rook/pull/295 release-4.9: https://github.com/red-hat-storage/rook/pull/296 @muagarwa , please take a look.
Based on regression tests this fix doesn't break any alert in OCS 4.9.1-227.ci. Managed Service environment will be tested in corresponding bugs: https://bugzilla.redhat.com/show_bug.cgi?id=2006222 https://bugzilla.redhat.com/show_bug.cgi?id=2009397 https://bugzilla.redhat.com/show_bug.cgi?id=2009396 https://bugzilla.redhat.com/show_bug.cgi?id=2006342 https://bugzilla.redhat.com/show_bug.cgi?id=2004478 --> VERIFIED Tested with: OCS 4.9.1-227.ci OCP 4.9.0-0.nightly-2021-11-06-034743
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086