Bug 2002545 - [GSS][RFE] Adjust memory limits due to mds pods getting oom killed during pod start (mds replay)
Summary: [GSS][RFE] Adjust memory limits due to mds pods getting oom killed during pod...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Jose A. Rivera
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-09 07:28 UTC by MAYANK PANDEY
Modified: 2023-08-09 17:00 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-30 10:50:31 UTC
Embargoed:


Attachments (Terms of Use)

Comment 4 Scott Ostapovicz 2021-09-09 18:39:23 UTC
Not sure there is anything to do here until it gets reproduced.

Comment 11 Scott Ostapovicz 2021-09-22 14:15:23 UTC
Travis, please see comment #7 from Patrick about making Rook transiently increase MDS memory

Comment 12 Travis Nielsen 2021-09-22 17:33:43 UTC
The resources limits on mds are currently set by the ocs operator as seen here [1]


"mds": {
    Requests: corev1.ResourceList{
        corev1.ResourceCPU:    resource.MustParse("3"),
        corev1.ResourceMemory: resource.MustParse("8Gi"),
    },
    Limits: corev1.ResourceList{
        corev1.ResourceCPU:    resource.MustParse("3"),
        corev1.ResourceMemory: resource.MustParse("8Gi"),
    },
},

K8s does not allow the limits to be changed at pod runtime. If the requests or limits are changed, the pod will be restarted. But if MDS really needs to burst to 12Gi sometimes, seems like we should leave the "requests" at 8Gi and increase the "limits" to 12Gi so the mds won't be killed prematurely. So please move to the ocs operator component for this change if there is not another way to constraint the mds memory.

[1] https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/defaults/resources.go#L32-L41

Comment 15 Scott Ostapovicz 2021-09-28 13:38:56 UTC
Please see comment 12

Comment 16 Jose A. Rivera 2021-10-15 14:53:32 UTC
This would have other implications on the QoS Class of the MDS Pod, having different requests and limits bumps it down a level and has fewer guarantees of survivability when node resources become constrained.

At this point this is basically an RFE, and given where we are in the schedule I'm pushing this to ODF 4.10.


Note You need to log in before you can comment on or make changes to this bug.