Bug 2005290 - namespace: openshift-storage label missing for few OCS alerts
Summary: namespace: openshift-storage label missing for few OCS alerts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ODF 4.9.0
Assignee: arun kumar mohan
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks: 2006323
TreeView+ depends on / blocked
 
Reported: 2021-09-17 10:53 UTC by Dhruv Bindra
Modified: 2023-08-09 16:37 UTC (History)
13 users (show)

Fixed In Version: v4.9.0-182.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2006323 (view as bug list)
Environment:
Last Closed: 2021-12-13 17:46:17 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 296 0 None open Bug 2005290: [release-4.9] add namespace to ceph queries 2021-10-04 08:55:14 UTC
Github rook rook pull 8774 0 None Merged ceph: prometheus rules format changes 2021-10-04 08:52:29 UTC
Github rook rook pull 8793 0 None Merged monitoring: add namespace to ceph node down query 2021-10-04 08:52:31 UTC
Github rook rook pull 8882 0 None Merged ceph: change CephAbsentMgr to use 'up' query 2021-09-30 10:29:34 UTC
Github rook rook pull 8901 0 None Merged ceph: adding 'namespace' field to the needed ceph queries 2021-10-04 05:39:01 UTC
Red Hat Bugzilla 2004478 1 high CLOSED MGR related alerts are not working 2021-12-16 19:49:18 UTC
Red Hat Product Errata RHSA-2021:5086 0 None None None 2021-12-13 17:46:45 UTC

Internal Links: 2004478

Description Dhruv Bindra 2021-09-17 10:53:27 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When the alert such as CephMgrIsAbsent and CephNodeDown are fired they're missing the label namespace:openshift-storage due to which the alerts are only visible in alertmanager UI but no incident is raised in pagerduty. 

Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Scale deployment of rook-ceph-mgr-a to 0.
2. Check PagerDuty system


Actual results:
No alert is sent to pagerduty within 10 minutes


Expected results:
Alert should be sent to pagerduty according to alerting rule (after 5 minutes)

Additional info:
Tested on ROSA cluster.

Comment 7 arun kumar mohan 2021-09-21 15:54:57 UTC
This has two parts, 

alert for CephNodeDown: this could be fixed as the matric results are available for this query

but for alert for CephMgrIsAbsent : we check for the absence of the metrics and since the metrics is absent we cannot figure out from which namespace the result (which is not present) is from.

Will talk further with Dhruv and see what we can do.

Comment 8 arun kumar mohan 2021-09-22 14:56:13 UTC
I have raised the PR to fix the CephNodeDown : https://github.com/rook/rook/pull/8793, which adds the 'namespace' filed into the metric result

PS: this PR is on top of a general PR: https://github.com/rook/rook/pull/8774

Travis, Sebastian, please take a look.

Comment 9 Sébastien Han 2021-09-23 08:15:05 UTC
(In reply to arun kumar mohan from comment #8)
> I have raised the PR to fix the CephNodeDown :
> https://github.com/rook/rook/pull/8793, which adds the 'namespace' filed
> into the metric result
> 
> PS: this PR is on top of a general PR: https://github.com/rook/rook/pull/8774
> 
> Travis, Sebastian, please take a look.

Reviewed the base PR yesterday and waiting for a reply https://github.com/rook/rook/pull/8774#pullrequestreview-761712175

Comment 10 arun kumar mohan 2021-09-23 19:50:50 UTC
Thanks Sebastian.
Addressed and updated both the PRs.

Comment 11 Paul Cuzner 2021-09-23 23:58:24 UTC
For the mgr related alerts

CephMgrIsAbsent
label_replace(absent(up{job="rook-ceph-mgr"} == 1),"namespace", "openshift-storage","","")
this will yield a metric with a namespace label based on the absent function

Alternatively, you could skip the absent function and just use up. the up metric denotes the scrape job success - so if it fails, it implies either the mgr is down or the prometheus endpoint is not reachable, By using up directly, the labelset already includes the namespace so there is no need for label_replace.
up{job="rook-ceph-mgr"} == 0 ... would indicate the scrape from mgr/prometheus is failing so alerts/metrics are compromised.

Another way to look at this is to look for the pod.
kube_pod_status_phase{namespace="openshift-storage", pod=~"rook-ceph-mgr.*",phase="Running"} == 1

However, I think sticking with the "up" based solution works better, since if the scrape job is failing - you'll get no alerts firing and the dashboards will be incomplete.

CephMgrIsMissingReplicas
Do we really need this one? We run with a single mgr, and since the CephMgrIsAbsent alert handles the case where the mgr is not present - I'm not sure what the value is here?

Comment 16 arun kumar mohan 2021-09-29 15:51:34 UTC
For alert 'CephMgrIsAbsent', the options/solutions listed in Comment#11, we have decided to go with the option Paul has recommended (the 'up{job="rook-ceph-mgr"} == 0' functionality instead of 'absent'). Making changes.

Comment 17 arun kumar mohan 2021-09-29 19:52:51 UTC
Rook PR send: https://github.com/rook/rook/pull/8882
Travis, Sebastian please take a look

Comment 18 Dhruv Bindra 2021-09-30 06:21:23 UTC
@nberry Yes, there are few more alerts that are missing the label and I'm compiling a list of alerts that are missing the namespace label. I'm not sure about the alert for data/cluster unavailability is affected by this bug, but it'll be clear when we have the list.

Comment 19 Sébastien Han 2021-09-30 07:06:22 UTC
(In reply to arun kumar mohan from comment #17)
> Rook PR send: https://github.com/rook/rook/pull/8882
> Travis, Sebastian please take a look

Merged.

Comment 20 Dhruv Bindra 2021-09-30 11:17:33 UTC
@amohan @

Comment 21 Dhruv Bindra 2021-09-30 11:18:59 UTC
@amohan @nberry This is the list of Alerts that are missing namespace label
https://docs.google.com/document/d/1u3mG_zq4RJAJDL_w54leBxQfeTEFFSsxETLTVhywC88/edit?usp=sharing

Comment 22 arun kumar mohan 2021-10-01 10:43:55 UTC
According to Dhruv's document these are the list of alerts which require additional 'namespace' field,

CephMdsMissingReplicas
CephMgrIsAbsent
CephMgrIsMissingReplicas
CephOSDVersionMismatch
CephMonVersionMismatch
CephMonQuorumAtRisk

CephNodeDown [not done]

Except for the 'CephNodeDown' query, the following PR adds the 'namespace' field to the alert,

PR: https://github.com/rook/rook/pull/8901

Travis, Sebastian (again) please take a look... =)

Comment 23 arun kumar mohan 2021-10-01 12:25:16 UTC
Thanks Sebastian for merging the rook PR.
Created the backport PRs for,
release-4.8: https://github.com/red-hat-storage/rook/pull/295
release-4.9: https://github.com/red-hat-storage/rook/pull/296

@muagarwa , please take a look.

Comment 25 Filip Balák 2021-11-16 14:16:57 UTC
Based on regression tests this fix doesn't break any alert in OCS 4.9.1-227.ci. Managed Service environment will be tested in corresponding bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=2006222
https://bugzilla.redhat.com/show_bug.cgi?id=2009397
https://bugzilla.redhat.com/show_bug.cgi?id=2009396
https://bugzilla.redhat.com/show_bug.cgi?id=2006342
https://bugzilla.redhat.com/show_bug.cgi?id=2004478

--> VERIFIED

Tested with:
OCS 4.9.1-227.ci
OCP 4.9.0-0.nightly-2021-11-06-034743

Comment 27 errata-xmlrpc 2021-12-13 17:46:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086


Note You need to log in before you can comment on or make changes to this bug.