2065838 – [RFE] Need to auto increase MDS memory limit when MDS is reporting oversized cache

Bug 2065838 - [RFE] Need to auto increase MDS memory limit when MDS is reporting oversized cache

Summary: [RFE] Need to auto increase MDS memory limit when MDS is reporting oversized ...

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Mudit Agarwal
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-18 20:25 UTC by Mike Hackett
Modified:	2024-08-12 13:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2063374	1	unspecified	CLOSED	[GSS] MDS cache is too large (7GB/4GB); 362 inodes in use by clients, 2913 stray files	2023-08-09 16:37:41 UTC
Red Hat Knowledge Base (Solution)	5920011	0	None	None	None	2022-03-18 20:28:17 UTC

Description Mike Hackett 2022-03-18 20:25:31 UTC

Description of problem:

Refer KCS: https://access.redhat.com/solutions/5920011
Refer BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1944148

We need the ability to automatically scale the MDS memory (current set to 4GB) on the MDS pod when encountering MDS cache too large scenarios. This in turn increases the mds_cache_memory_limit to 50% of the memory.

Being able to adjust dynamically is vital for managed ODF and ODF deployments.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Sébastien Han 2022-03-28 15:19:44 UTC

As discussed with the Rook team, this is not something the Rook operator will do (dynamically change the pod memory resources). Today the memory resources are built and passed by ocs-operator. This component would likely be responsible for editing those resources to increase/decrease the memory available in the MDS pod.
We probably need to react to an alert coming from Prometheus, which will result in adapting the memory resources of that pod.

Moving to ocs-operator.

Comment 4 Travis Nielsen 2022-03-30 18:30:53 UTC

Adjusting the resource limits dynamically may not be desirable since it will require an update to the mds pod spec, which will restart the mds.

Instead of dynamically updating the limits, we should consider:
1. The limits can be overridden in the StorageCluster CR with the "mds" key when the workload requires it
2. Set higher limits instead of using the same limits as requests

See the resource requests/limits currently set to 8Gi here: https://github.com/red-hat-storage/ocs-operator/blob/e871f8953e3a32bc82b27a174ae6fe7f85a22d3e/controllers/defaults/resources.go#L32-L41

For now, the first option at least help those clusters with higher mds loads.

Comment 5 Jose A. Rivera 2022-06-07 13:57:28 UTC

Since this is not a blocker, moving out to ODF 4.12. WE'll definitely need some prioritization on this, as we don't have the bandwidth to just take in something like this without accoomodating the schedule.

Comment 20 Malay Kumar parida 2023-12-19 06:19:28 UTC

Moving this BZ to 4.16 as the linked Epic is part of 4.16 planning now.

Note You need to log in before you can comment on or make changes to this bug.