Description of problem (please be detailed as possible and provide log snippests): =================================================================================== Below MDSCacheUsageHigh alert is fired when MDS cache usage breaches 95% of the mds_cache_memory_limit but the alert does not provide clear instructions or steps to take in response to the alert. The alert should include a call to action, providing either steps to increase the memory request for MDS pods or a link to the documentation on what to do when the cache oversize alert is received. Alert: Name MDSCacheUsageHigh Severity Critical Description MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-b has exceeded above 95% of the requested value. Increase the memory request for mds.ocs-storagecluster-cephfilesystem-b pod. Message High MDS cache usage for the daemon mds.ocs-storagecluster-cephfilesystem-b. Version of all relevant components (if applicable): odf version: 4.15.0-102 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: No, this is a new feature in 4.15 Steps to Reproduce: =================== 1) Create 3m, 3w OCP cluster and install ODF on it. 2) Create multiple cephfs PVCs with RWX access mode 3) Attach multiple pods to those PVCs and start continuous file creation + meta data operations 4) wait till MDS cache usage breaches 95% and the alert to fire Actual results: =============== The MDSCacheUsageHigh alert is lacking a call to action Expected results: ================= The alert should include a call to action, providing either steps to increase the memory request for MDS pods or a link to the documentation on what to do when the cache oversize alert is received.
(In reply to Prasad Desala from comment #0) > Description of problem (please be detailed as possible and provide log > Actual results: > =============== > The MDSCacheUsageHigh alert is lacking a call to action > > Expected results: > ================= > The alert should include a call to action, providing either steps to > increase the memory request for MDS pods or a link to the documentation on > what to do when the cache oversize alert is received. Agree that current message in the alert "Increase the memory request for mds.ocs-storagecluster-cephfilesystem-b pod" might not be very useful for the customer. I can link the existing article (https://access.redhat.com/solutions/6959127) to increase the memory resources via the StorageCluster yaml. Let me know if the above link should suffice. IMO, it should be good.
Adding Bipin to get his thoughts.
*** Bug 2257310 has been marked as a duplicate of this bug. ***
Hi Santosh, Any plans for this BZ in 4.15? else we can move it out to 4.16.
(In reply to Malay Kumar parida from comment #8) > Hi Santosh, Any plans for this BZ in 4.15? else we can move it out to 4.16. It will be in 4.15.
I can see below link has been provided in the alert. RunBook: https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md This looks like an upstream documentation. Do we need to have a downstream documentation as well? I feel its better link it as a KCS article. @ Bipin, What are your thoughts?
using Runbooks is the recommended way to add call to actions in the alerts. You can refer the `runbook_url` annotation of the existing alerts. They all point the same documentation. https://github.com/red-hat-storage/ocs-operator/blob/2ef8a88377a956cb6a0b2543c72873a03eb8d3c9/metrics/deploy/prometheus-ocs-rules.yaml#L72
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383