Bug 2256725 - The MDSCacheUsageHigh alert is lacking a call to action
Summary: The MDSCacheUsageHigh alert is lacking a call to action
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.15.0
Assignee: Santosh Pillai
QA Contact: Nagendra Reddy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-04 08:02 UTC by Prasad Desala
Modified: 2024-03-19 15:30 UTC (History)
7 users (show)

Fixed In Version: 4.15.0-123
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-03-19 15:30:36 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift runbooks pull 160 0 None open Add runbooks for CephMdsCacheUsageHigh 2024-01-19 17:27:29 UTC
Github red-hat-storage ocs-operator pull 2401 0 None Draft add runbook url for MDSCacheUsageHigh alert 2024-01-22 04:39:00 UTC
Github red-hat-storage ocs-operator pull 2412 0 None open Bug 2256725: [release-4.15] add runbook url for MDSCacheUsageHigh alert 2024-01-23 05:17:32 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:30:39 UTC

Description Prasad Desala 2024-01-04 08:02:50 UTC
Description of problem (please be detailed as possible and provide log
snippests):
===================================================================================
Below MDSCacheUsageHigh alert is fired when MDS cache usage breaches 95% of the mds_cache_memory_limit but the alert does not provide clear instructions or steps to take in response to the alert. 

The alert should include a call to action, providing either steps to increase the memory request for MDS pods or a link to the documentation on what to do when the cache oversize alert is received.


Alert:

Name
MDSCacheUsageHigh

Severity
Critical

Description

MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-b has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-b pod.

Message
High MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-b.


Version of all relevant components (if applicable):
odf version: 4.15.0-102


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No, this is a new feature in 4.15

Steps to Reproduce:
===================
1) Create 3m, 3w OCP cluster and install ODF on it.
2) Create multiple cephfs PVCs with RWX access mode
3) Attach multiple pods to those PVCs and start continuous file creation + meta data operations 
4) wait till MDS cache usage breaches 95% and the alert to fire

Actual results:
===============
The MDSCacheUsageHigh alert is lacking a call to action

Expected results:
=================
The alert should include a call to action, providing either steps to increase the memory request for MDS pods or a link to the documentation on what to do when the cache oversize alert is received.

Comment 2 Santosh Pillai 2024-01-04 08:14:16 UTC
(In reply to Prasad Desala from comment #0)
> Description of problem (please be detailed as possible and provide log

> Actual results:
> ===============
> The MDSCacheUsageHigh alert is lacking a call to action
> 
> Expected results:
> =================
> The alert should include a call to action, providing either steps to
> increase the memory request for MDS pods or a link to the documentation on
> what to do when the cache oversize alert is received.


Agree that current message in the alert "Increase the memory request for mds.ocs-storagecluster-cephfilesystem-b pod" might not be very useful for the customer. I can link the existing article (https://access.redhat.com/solutions/6959127)  to increase the memory resources via the StorageCluster yaml. 

Let me know if the above link should suffice. IMO, it should be good.

Comment 3 Mudit Agarwal 2024-01-04 08:19:30 UTC
Adding Bipin to get his thoughts.

Comment 7 Malay Kumar parida 2024-01-09 04:10:29 UTC
*** Bug 2257310 has been marked as a duplicate of this bug. ***

Comment 8 Malay Kumar parida 2024-01-18 05:49:26 UTC
Hi Santosh, Any plans for this BZ in 4.15? else we can move it out to 4.16.

Comment 9 Santosh Pillai 2024-01-18 06:59:14 UTC
(In reply to Malay Kumar parida from comment #8)
> Hi Santosh, Any plans for this BZ in 4.15? else we can move it out to 4.16.

It will be in 4.15.

Comment 13 Nagendra Reddy 2024-01-29 14:59:16 UTC
I can see below link has been provided in the alert.

RunBook: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md

This looks like an upstream documentation. Do we need to have a downstream documentation as well?

I feel its better link it as a KCS article. 


@ Bipin,

What are your thoughts?

Comment 14 Santosh Pillai 2024-01-30 03:44:56 UTC
using Runbooks is the recommended way to add call to actions in the alerts. You can refer the `runbook_url` annotation of the existing alerts. They all point the same documentation. 

https://github.com/red-hat-storage/ocs-operator/blob/2ef8a88377a956cb6a0b2543c72873a03eb8d3c9/metrics/deploy/prometheus-ocs-rules.yaml#L72

Comment 17 errata-xmlrpc 2024-03-19 15:30:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.