Bug 2265563

Summary: Increasing MDS memory is erasing CPU values when pods are in CLBO state.
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Nagendra Reddy <nagreddy>
Component: ocs-operatorAssignee: Santosh Pillai <sapillai>
Status: ASSIGNED --- QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.15CC: hnallurv, kbg, mparida, muagarwa, nberry, nigoyal, odf-bz-bot, sapillai
Target Milestone: ---Flags: nigoyal: needinfo? (sapillai)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
.Increasing MDS memory is erasing CPU values when pods are in CLBO state When the metadata server (MDS) memory is increased while the MDS pods are in a crash loop back off (CLBO) state, CPU request or limit for the MDS pods is removed. As a result, the CPU request or the limit that is set for the MDS changes. Workaround: Run the `oc patch` command to adjust the CPU limits. For example: ---- $ oc patch -n openshift-storage storagecluster ocs-storagecluster \ --type merge \ --patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "3"}, "requests": {"cpu": "3"}}}}}' ----
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2246375    

Description Nagendra Reddy 2024-02-22 18:16:07 UTC
Created attachment 2018213 [details]
no alert for mdscpuhighusage in notifications

Description of problem (please be detailed as possible and provide log
snippests):

There is no alert triggered for MDSCPUHighUsage after upgrading cluster from 4.14 to 4.15.

Version of all relevant components (if applicable):

odf:4.15.0-147
ocp: 4.15.0-rc.8 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes
Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy cluster with 4.14, run file creation IO and upgrade to 4.15.
2 Run file creator IO  to utilize MDS CPU of 67%  and continue the same load at-least for 6hrs [time can be tweaked in prometheus rules yaml to test quickly].
4. Verify the alert generated of not if the condition met.



Actual results:

No alert seen for MDSCPUHighUsage after upgrade

Expected results:
 Alert should be triggred when the condition met in terms of CPU utilisation after upgrading to 4.15

Additional info:

Comment 4 Mudit Agarwal 2024-02-28 08:04:03 UTC
The system was patched with this https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCacheUsageHigh.md, that removed CPU limit
Needs retesting

Comment 5 Nagendra Reddy 2024-03-01 14:55:18 UTC
(In reply to Mudit Agarwal from comment #4)
> The system was patched with this
> https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-
> storage-operator/CephMdsCacheUsageHigh.md, that removed CPU limit
> Needs retesting

Mudit,

Yes. This issue is not related to upgrade.

The main problem here is the patch applied to increase mds pod memory is erasing the CPU values. Since the CPU values not available in resources, the formula used in the prometheus rules to calculate 67% of usage is failing. So the alert didn't trigger. This issue is observed only when MDS pods are in CLBO state due to OOMkilled and patch applied in that state.

patch used to increase memory:

oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"memory": "8Gi"},"requests": {"memory": "8Gi"}}}}}' 


WA: We are able to resolve this issue by simply applying the patch for CPU values:

oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "3"},
    "requests": {"cpu": "3"}}}}}'

This issue need to be investigated why the Memory patch cmd is erasing the CPU values  and why it happens only when the MDS pods are in CLBO state?

Comment 6 Mudit Agarwal 2024-03-04 10:25:16 UTC
Not a 4.15 blocker

Comment 7 Nagendra Reddy 2024-03-04 14:09:27 UTC
latest reproduction logs are available at http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/sosreports/nagendra/2265563/repro1/

Comment 11 Mudit Agarwal 2024-05-07 05:50:28 UTC
Santosh, are we fixing this BZ in 4.16?

Comment 12 Santosh Pillai 2024-05-07 06:33:17 UTC
I'll move this to 4.17 to investigate more. This is not a blocker. According to the comment #5,  the issue only happens when pod is in CLBO and there is a workaround to apply the patch with the CPU values