Bug 2265563
Summary: | Increasing MDS memory is erasing CPU values when pods are in CLBO state. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Nagendra Reddy <nagreddy> |
Component: | ocs-operator | Assignee: | Santosh Pillai <sapillai> |
Status: | ASSIGNED --- | QA Contact: | Elad <ebenahar> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.15 | CC: | hnallurv, kbg, mparida, muagarwa, nberry, nigoyal, odf-bz-bot, sapillai |
Target Milestone: | --- | Flags: | nigoyal:
needinfo?
(sapillai) |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Known Issue | |
Doc Text: |
.Increasing MDS memory is erasing CPU values when pods are in CLBO state
When the metadata server (MDS) memory is increased while the MDS pods are in a crash loop back off (CLBO) state, CPU request or limit for the MDS pods is removed. As a result, the CPU request or the limit that is set for the MDS changes.
Workaround: Run the `oc patch` command to adjust the CPU limits.
For example:
----
$ oc patch -n openshift-storage storagecluster ocs-storagecluster \
--type merge \
--patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "3"},
"requests": {"cpu": "3"}}}}}'
----
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2246375 |
The system was patched with this https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md, that removed CPU limit Needs retesting (In reply to Mudit Agarwal from comment #4) > The system was patched with this > https://github.com/openshift/runbooks/blob/master/alerts/openshift-container- > storage-operator/CephMdsCacheUsageHigh.md, that removed CPU limit > Needs retesting Mudit, Yes. This issue is not related to upgrade. The main problem here is the patch applied to increase mds pod memory is erasing the CPU values. Since the CPU values not available in resources, the formula used in the prometheus rules to calculate 67% of usage is failing. So the alert didn't trigger. This issue is observed only when MDS pods are in CLBO state due to OOMkilled and patch applied in that state. patch used to increase memory: oc patch -n openshift-storage storagecluster ocs-storagecluster \ --type merge \ --patch '{"spec": {"resources": {"mds": {"limits": {"memory": "8Gi"},"requests": {"memory": "8Gi"}}}}}' WA: We are able to resolve this issue by simply applying the patch for CPU values: oc patch -n openshift-storage storagecluster ocs-storagecluster \ --type merge \ --patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "3"}, "requests": {"cpu": "3"}}}}}' This issue need to be investigated why the Memory patch cmd is erasing the CPU values and why it happens only when the MDS pods are in CLBO state? Not a 4.15 blocker latest reproduction logs are available at http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/sosreports/nagendra/2265563/repro1/ Santosh, are we fixing this BZ in 4.16? I'll move this to 4.17 to investigate more. This is not a blocker. According to the comment #5, the issue only happens when pod is in CLBO and there is a workaround to apply the patch with the CPU values |
Created attachment 2018213 [details] no alert for mdscpuhighusage in notifications Description of problem (please be detailed as possible and provide log snippests): There is no alert triggered for MDSCPUHighUsage after upgrading cluster from 4.14 to 4.15. Version of all relevant components (if applicable): odf:4.15.0-147 ocp: 4.15.0-rc.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy cluster with 4.14, run file creation IO and upgrade to 4.15. 2 Run file creator IO to utilize MDS CPU of 67% and continue the same load at-least for 6hrs [time can be tweaked in prometheus rules yaml to test quickly]. 4. Verify the alert generated of not if the condition met. Actual results: No alert seen for MDSCPUHighUsage after upgrade Expected results: Alert should be triggred when the condition met in terms of CPU utilisation after upgrading to 4.15 Additional info: