Bug 2261881 - Documentation need to be corrected for MDSCacheUsageHigh alert. [NEEDINFO]
Summary: Documentation need to be corrected for MDSCacheUsageHigh alert.
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Santosh Pillai
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks: 2246375
TreeView+ depends on / blocked
 
Reported: 2024-01-30 08:41 UTC by Nagendra Reddy
Modified: 2025-04-15 08:28 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: Ceph returns `ceph_mds_mem_rss` metric in Kilobytes (KB) Consequence: When the user is searching for the metric in OCS UI, the graphs shows the y axis in MB. This can cause confusion when the user is comparing the results for `MDSCacheUsageHigh` alert. Workaround (if any): Use `ceph_mds_mem_rss * 1000` when searching for this metric in the Openshift UI to see the graph y axis in GB. Result: Using `ceph_mds_mem_rss * 1000` will show the graph in GB, and user can easily compare the results shown in `MDSCacheUsageHigh` alert.
Clone Of:
Environment:
Last Closed:
Embargoed:
sapillai: needinfo? (nagreddy)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift runbooks pull 169 0 None open Remove info about default MDS cache size 2024-03-04 10:35:14 UTC

Description Nagendra Reddy 2024-01-30 08:41:58 UTC
Created attachment 2013978 [details]
s1

Description of problem (please be detailed as possible and provide log
snippests):

The document which has been provided as part of BZ-2256725 need corrections. This main use case of this doc is for adding memory to the MDS pod whenever the alert MDSCacheUsageHigh seen.

Link to the doc:
https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1
Can this issue reproducible?
1

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Maintain the MDS CPU load to reach 95% of the cache limit.

2. MDSCacheHighUsage alert will be triggered in the dashboard.

3. Go to the alert and click on the document linked to the alert.

4. The document need to be more clear in sections "Impact" & "Mitigation".


Actual results:
Document has the steps to apply default memory in MDS pod

Expected results:
Document should have steps to Increase MDS pod memory from default to recommended based on the alert.

Please refer attachment for more information.

Additional info:

Comment 2 Nagendra Reddy 2024-02-29 02:20:39 UTC
Below changes are required:

1. ceph_mds_mem_rss gives the wrong output. When there is an actual cache usage of 3GB, it will show it as 3MB. Please fix either query or documentation. Based on our previous discussions, we used "ceph_mds_mem_rss*1000" for testing.


2. Default is 4GB, but recomended is minimum 8GB.

--> Default was 4GB in 4.14. But after upgrading to 4.15, we observed that the default reduced to 3GB. Need to be corrected in documentation.

3. Patch command need to be corrected

-->When you recommended minimum 8GB of cache limit. You should increase MDS memory to 16GB, then only user will get 8GB of cache limit which is recommened when the alert is firing. Give the patch command with recommended values.

Given patch:

oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"memory": "8Gi"},"requests": {"memory": "8Gi"}}}}}' 

Expecting below patch to have recommended Cache limit [8GB]:

oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"memory": "16Gi"},"requests": {"memory": "16Gi"}}}}}'

Comment 3 Santosh Pillai 2024-03-04 10:35:14 UTC
Opened a PR for point number 2 - https://github.com/openshift/runbooks/pull/169

We don't need to change anything for point number 3.

We decided to discuss the changes for point number 1 in 4.16.

Comment 4 Nagendra Reddy 2024-03-05 13:11:06 UTC
(In reply to Santosh Pillai from comment #3)
> Opened a PR for point number 2 -
> https://github.com/openshift/runbooks/pull/169
> 
> We don't need to change anything for point number 3.
> 
> We decided to discuss the changes for point number 1 in 4.16.

we discussed to give instructions/notes to use metric in a correct way like "ceph_mds_mem_rss*1000" to pull the accurate mds memory usage. It can be fixed in 4.16, till then we should provide instructions to use the metric with multiplier 1000 to convert the data MB to GB.

Please make changes accordingly.

Comment 5 Santosh Pillai 2024-03-06 04:09:20 UTC
(In reply to Nagendra Reddy from comment #4)
> (In reply to Santosh Pillai from comment #3)
> > Opened a PR for point number 2 -
> > https://github.com/openshift/runbooks/pull/169
> > 
> > We don't need to change anything for point number 3.
> > 
> > We decided to discuss the changes for point number 1 in 4.16.
> 
> we discussed to give instructions/notes to use metric in a correct way like
> "ceph_mds_mem_rss*1000" to pull the accurate mds memory usage. It can be
> fixed in 4.16, till then we should provide instructions to use the metric
> with multiplier 1000 to convert the data MB to GB.

This will add more confusion to the customer. The customer can anyway see the correct units in the graph in the alert itself, correct? 
> 
> Please make changes accordingly.

Comment 6 Harish NV Rao 2024-03-06 05:52:58 UTC
(In reply to Santosh Pillai from comment #5)
> (In reply to Nagendra Reddy from comment #4)
> > (In reply to Santosh Pillai from comment #3)
> > > Opened a PR for point number 2 -
> > > https://github.com/openshift/runbooks/pull/169
> > > 
> > > We don't need to change anything for point number 3.
> > > 
> > > We decided to discuss the changes for point number 1 in 4.16.
> > 
> > we discussed to give instructions/notes to use metric in a correct way like
> > "ceph_mds_mem_rss*1000" to pull the accurate mds memory usage. It can be
> > fixed in 4.16, till then we should provide instructions to use the metric
> > with multiplier 1000 to convert the data MB to GB.
> 
> This will add more confusion to the customer. The customer can anyway see
> the correct units in the graph in the alert itself, correct? 
> > 
> > Please make changes accordingly.

Let's make this as a known issue in 4.15 and work toward fixing it in 4.16. 
Santosh, could you please provide the doc text for known issue?

Comment 7 Santosh Pillai 2024-04-25 13:42:40 UTC
since the documentation was fixed in 4.15 on how the use the query (ceph_mds_mem_rss * 1000) and changing the `ceph_mds_mem_rss` unit might require changes in ceph, I'll move this to 4.17 for now.


Note You need to log in before you can comment on or make changes to this bug.