2002545 – [GSS][RFE] Adjust memory limits due to mds pods getting oom killed during pod start (mds replay)

Bug 2002545 - [GSS][RFE] Adjust memory limits due to mds pods getting oom killed during pod start (mds replay)

Summary: [GSS][RFE] Adjust memory limits due to mds pods getting oom killed during pod...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jose A. Rivera
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-09 07:28 UTC by MAYANK PANDEY
Modified:	2023-08-09 17:00 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-30 10:50:31 UTC
Embargoed:

Attachments	(Terms of Use)

Comment 4 Scott Ostapovicz 2021-09-09 18:39:23 UTC

Not sure there is anything to do here until it gets reproduced.

Comment 11 Scott Ostapovicz 2021-09-22 14:15:23 UTC

Travis, please see comment #7 from Patrick about making Rook transiently increase MDS memory

Comment 12 Travis Nielsen 2021-09-22 17:33:43 UTC

The resources limits on mds are currently set by the ocs operator as seen here [1]


"mds": {
    Requests: corev1.ResourceList{
        corev1.ResourceCPU:    resource.MustParse("3"),
        corev1.ResourceMemory: resource.MustParse("8Gi"),
    },
    Limits: corev1.ResourceList{
        corev1.ResourceCPU:    resource.MustParse("3"),
        corev1.ResourceMemory: resource.MustParse("8Gi"),
    },
},

K8s does not allow the limits to be changed at pod runtime. If the requests or limits are changed, the pod will be restarted. But if MDS really needs to burst to 12Gi sometimes, seems like we should leave the "requests" at 8Gi and increase the "limits" to 12Gi so the mds won't be killed prematurely. So please move to the ocs operator component for this change if there is not another way to constraint the mds memory.

[1] https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/defaults/resources.go#L32-L41

Comment 15 Scott Ostapovicz 2021-09-28 13:38:56 UTC

Please see comment 12

Comment 16 Jose A. Rivera 2021-10-15 14:53:32 UTC

This would have other implications on the QoS Class of the MDS Pod, having different requests and limits bumps it down a level and has fewer guarantees of survivability when node resources become constrained.

At this point this is basically an RFE, and given where we are in the schedule I'm pushing this to ODF 4.10.

Note You need to log in before you can comment on or make changes to this bug.