Bug 2261881

Summary:	Documentation need to be corrected for MDSCacheUsageHigh alert.
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Nagendra Reddy <nagreddy>
Component:	ocs-operator	Assignee:	Santosh Pillai <sapillai>
Status:	ASSIGNED ---	QA Contact:	Elad <ebenahar>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.15	CC:	hnallurv, muagarwa, nigoyal, odf-bz-bot, sapillai
Target Milestone:	---	Flags:	sapillai: needinfo? (nagreddy)
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Cause: Ceph returns `ceph_mds_mem_rss` metric in Kilobytes (KB) Consequence: When the user is searching for the metric in OCS UI, the graphs shows the y axis in MB. This can cause confusion when the user is comparing the results for `MDSCacheUsageHigh` alert. Workaround (if any): Use `ceph_mds_mem_rss * 1000` when searching for this metric in the Openshift UI to see the graph y axis in GB. Result: Using `ceph_mds_mem_rss * 1000` will show the graph in GB, and user can easily compare the results shown in `MDSCacheUsageHigh` alert.	Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2246375

Description Nagendra Reddy 2024-01-30 08:41:58 UTC

Created attachment 2013978 [details]
s1

Description of problem (please be detailed as possible and provide log
snippests):

The document which has been provided as part of BZ-2256725 need corrections. This main use case of this doc is for adding memory to the MDS pod whenever the alert MDSCacheUsageHigh seen.

Link to the doc:
https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1
Can this issue reproducible?
1

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Maintain the MDS CPU load to reach 95% of the cache limit.

2. MDSCacheHighUsage alert will be triggered in the dashboard.

3. Go to the alert and click on the document linked to the alert.

4. The document need to be more clear in sections "Impact" & "Mitigation".


Actual results:
Document has the steps to apply default memory in MDS pod

Expected results:
Document should have steps to Increase MDS pod memory from default to recommended based on the alert.

Please refer attachment for more information.

Additional info:

Comment 2 Nagendra Reddy 2024-02-29 02:20:39 UTC

Below changes are required:

1. ceph_mds_mem_rss gives the wrong output. When there is an actual cache usage of 3GB, it will show it as 3MB. Please fix either query or documentation. Based on our previous discussions, we used "ceph_mds_mem_rss*1000" for testing.


2. Default is 4GB, but recomended is minimum 8GB.

--> Default was 4GB in 4.14. But after upgrading to 4.15, we observed that the default reduced to 3GB. Need to be corrected in documentation.

3. Patch command need to be corrected

-->When you recommended minimum 8GB of cache limit. You should increase MDS memory to 16GB, then only user will get 8GB of cache limit which is recommened when the alert is firing. Give the patch command with recommended values.

Given patch:

oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"memory": "8Gi"},"requests": {"memory": "8Gi"}}}}}' 

Expecting below patch to have recommended Cache limit [8GB]:

oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"memory": "16Gi"},"requests": {"memory": "16Gi"}}}}}'

Comment 3 Santosh Pillai 2024-03-04 10:35:14 UTC

Opened a PR for point number 2 - https://github.com/openshift/runbooks/pull/169

We don't need to change anything for point number 3.

We decided to discuss the changes for point number 1 in 4.16.

Comment 4 Nagendra Reddy 2024-03-05 13:11:06 UTC

(In reply to Santosh Pillai from comment #3)
> Opened a PR for point number 2 -
> https://github.com/openshift/runbooks/pull/169
> 
> We don't need to change anything for point number 3.
> 
> We decided to discuss the changes for point number 1 in 4.16.

we discussed to give instructions/notes to use metric in a correct way like "ceph_mds_mem_rss*1000" to pull the accurate mds memory usage. It can be fixed in 4.16, till then we should provide instructions to use the metric with multiplier 1000 to convert the data MB to GB.

Please make changes accordingly.

Comment 5 Santosh Pillai 2024-03-06 04:09:20 UTC

(In reply to Nagendra Reddy from comment #4)
> (In reply to Santosh Pillai from comment #3)
> > Opened a PR for point number 2 -
> > https://github.com/openshift/runbooks/pull/169
> > 
> > We don't need to change anything for point number 3.
> > 
> > We decided to discuss the changes for point number 1 in 4.16.
> 
> we discussed to give instructions/notes to use metric in a correct way like
> "ceph_mds_mem_rss*1000" to pull the accurate mds memory usage. It can be
> fixed in 4.16, till then we should provide instructions to use the metric
> with multiplier 1000 to convert the data MB to GB.

This will add more confusion to the customer. The customer can anyway see the correct units in the graph in the alert itself, correct? 
> 
> Please make changes accordingly.

Comment 6 Harish NV Rao 2024-03-06 05:52:58 UTC

(In reply to Santosh Pillai from comment #5)
> (In reply to Nagendra Reddy from comment #4)
> > (In reply to Santosh Pillai from comment #3)
> > > Opened a PR for point number 2 -
> > > https://github.com/openshift/runbooks/pull/169
> > > 
> > > We don't need to change anything for point number 3.
> > > 
> > > We decided to discuss the changes for point number 1 in 4.16.
> > 
> > we discussed to give instructions/notes to use metric in a correct way like
> > "ceph_mds_mem_rss*1000" to pull the accurate mds memory usage. It can be
> > fixed in 4.16, till then we should provide instructions to use the metric
> > with multiplier 1000 to convert the data MB to GB.
> 
> This will add more confusion to the customer. The customer can anyway see
> the correct units in the graph in the alert itself, correct? 
> > 
> > Please make changes accordingly.

Let's make this as a known issue in 4.15 and work toward fixing it in 4.16. 
Santosh, could you please provide the doc text for known issue?

Comment 7 Santosh Pillai 2024-04-25 13:42:40 UTC

since the documentation was fixed in 4.15 on how the use the query (ceph_mds_mem_rss * 1000) and changing the `ceph_mds_mem_rss` unit might require changes in ceph, I'll move this to 4.17 for now.