Created attachment 2013978 [details] s1 Description of problem (please be detailed as possible and provide log snippests): The document which has been provided as part of BZ-2256725 need corrections. This main use case of this doc is for adding memory to the MDS pod whenever the alert MDSCacheUsageHigh seen. Link to the doc: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? 1 Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Maintain the MDS CPU load to reach 95% of the cache limit. 2. MDSCacheHighUsage alert will be triggered in the dashboard. 3. Go to the alert and click on the document linked to the alert. 4. The document need to be more clear in sections "Impact" & "Mitigation". Actual results: Document has the steps to apply default memory in MDS pod Expected results: Document should have steps to Increase MDS pod memory from default to recommended based on the alert. Please refer attachment for more information. Additional info:
Below changes are required: 1. ceph_mds_mem_rss gives the wrong output. When there is an actual cache usage of 3GB, it will show it as 3MB. Please fix either query or documentation. Based on our previous discussions, we used "ceph_mds_mem_rss*1000" for testing. 2. Default is 4GB, but recomended is minimum 8GB. --> Default was 4GB in 4.14. But after upgrading to 4.15, we observed that the default reduced to 3GB. Need to be corrected in documentation. 3. Patch command need to be corrected -->When you recommended minimum 8GB of cache limit. You should increase MDS memory to 16GB, then only user will get 8GB of cache limit which is recommened when the alert is firing. Give the patch command with recommended values. Given patch: oc patch -n openshift-storage storagecluster ocs-storagecluster \ --type merge \ --patch '{"spec": {"resources": {"mds": {"limits": {"memory": "8Gi"},"requests": {"memory": "8Gi"}}}}}' Expecting below patch to have recommended Cache limit [8GB]: oc patch -n openshift-storage storagecluster ocs-storagecluster \ --type merge \ --patch '{"spec": {"resources": {"mds": {"limits": {"memory": "16Gi"},"requests": {"memory": "16Gi"}}}}}'
Opened a PR for point number 2 - https://github.com/openshift/runbooks/pull/169 We don't need to change anything for point number 3. We decided to discuss the changes for point number 1 in 4.16.
(In reply to Santosh Pillai from comment #3) > Opened a PR for point number 2 - > https://github.com/openshift/runbooks/pull/169 > > We don't need to change anything for point number 3. > > We decided to discuss the changes for point number 1 in 4.16. we discussed to give instructions/notes to use metric in a correct way like "ceph_mds_mem_rss*1000" to pull the accurate mds memory usage. It can be fixed in 4.16, till then we should provide instructions to use the metric with multiplier 1000 to convert the data MB to GB. Please make changes accordingly.
(In reply to Nagendra Reddy from comment #4) > (In reply to Santosh Pillai from comment #3) > > Opened a PR for point number 2 - > > https://github.com/openshift/runbooks/pull/169 > > > > We don't need to change anything for point number 3. > > > > We decided to discuss the changes for point number 1 in 4.16. > > we discussed to give instructions/notes to use metric in a correct way like > "ceph_mds_mem_rss*1000" to pull the accurate mds memory usage. It can be > fixed in 4.16, till then we should provide instructions to use the metric > with multiplier 1000 to convert the data MB to GB. This will add more confusion to the customer. The customer can anyway see the correct units in the graph in the alert itself, correct? > > Please make changes accordingly.
(In reply to Santosh Pillai from comment #5) > (In reply to Nagendra Reddy from comment #4) > > (In reply to Santosh Pillai from comment #3) > > > Opened a PR for point number 2 - > > > https://github.com/openshift/runbooks/pull/169 > > > > > > We don't need to change anything for point number 3. > > > > > > We decided to discuss the changes for point number 1 in 4.16. > > > > we discussed to give instructions/notes to use metric in a correct way like > > "ceph_mds_mem_rss*1000" to pull the accurate mds memory usage. It can be > > fixed in 4.16, till then we should provide instructions to use the metric > > with multiplier 1000 to convert the data MB to GB. > > This will add more confusion to the customer. The customer can anyway see > the correct units in the graph in the alert itself, correct? > > > > Please make changes accordingly. Let's make this as a known issue in 4.15 and work toward fixing it in 4.16. Santosh, could you please provide the doc text for known issue?
since the documentation was fixed in 4.15 on how the use the query (ceph_mds_mem_rss * 1000) and changing the `ceph_mds_mem_rss` unit might require changes in ceph, I'll move this to 4.17 for now.