Bug 2002545

Summary: [GSS][RFE] Adjust memory limits due to mds pods getting oom killed during pod start (mds replay)
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: MAYANK PANDEY <mpandey>
Component: ocs-operatorAssignee: Jose A. Rivera <jrivera>
Status: CLOSED WONTFIX QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: ajuarez, assingh, bniver, hnallurv, jrivera, madam, muagarwa, nravinas, ocs-bugs, odf-bz-bot, pdonnell, sostapov
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-30 10:50:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 4 Scott Ostapovicz 2021-09-09 18:39:23 UTC
Not sure there is anything to do here until it gets reproduced.

Comment 11 Scott Ostapovicz 2021-09-22 14:15:23 UTC
Travis, please see comment #7 from Patrick about making Rook transiently increase MDS memory

Comment 12 Travis Nielsen 2021-09-22 17:33:43 UTC
The resources limits on mds are currently set by the ocs operator as seen here [1]


"mds": {
    Requests: corev1.ResourceList{
        corev1.ResourceCPU:    resource.MustParse("3"),
        corev1.ResourceMemory: resource.MustParse("8Gi"),
    },
    Limits: corev1.ResourceList{
        corev1.ResourceCPU:    resource.MustParse("3"),
        corev1.ResourceMemory: resource.MustParse("8Gi"),
    },
},

K8s does not allow the limits to be changed at pod runtime. If the requests or limits are changed, the pod will be restarted. But if MDS really needs to burst to 12Gi sometimes, seems like we should leave the "requests" at 8Gi and increase the "limits" to 12Gi so the mds won't be killed prematurely. So please move to the ocs operator component for this change if there is not another way to constraint the mds memory.

[1] https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/defaults/resources.go#L32-L41

Comment 15 Scott Ostapovicz 2021-09-28 13:38:56 UTC
Please see comment 12

Comment 16 Jose A. Rivera 2021-10-15 14:53:32 UTC
This would have other implications on the QoS Class of the MDS Pod, having different requests and limits bumps it down a level and has fewer guarantees of survivability when node resources become constrained.

At this point this is basically an RFE, and given where we are in the schedule I'm pushing this to ODF 4.10.