Bug 2028854

Summary:	SystemMemoryExceedsReservation Alert
Product:	OpenShift Container Platform	Reporter:	Shubham Jadhav <shujadha>
Component:	Node	Assignee:	Swarup Ghosh <swghosh>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	aos-bugs, gagore, harpatil, hasingh, hgomes, kir, mmarkand, nagrawal, openshift-bugs-escalate, openshift-bugzilla-robot, swghosh
Version:	4.8	Flags:	swghosh: needinfo-
Target Milestone:	---
Target Release:	4.8.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: System memory reservation alert using Prometheus QL was using hugepages memory consumption in account which was not required. Consequence: The alert was getting fired unnecessarily on the cluster for OCP 4.8. Fix: The fix was backported to 4.8 and already existent in later versions of OCP. Fix included removal of linux huge pages from the system memory calculation. Result: The unnecessary alerts should be fixed.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-16 11:30:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1979297
Bug Blocks:

Description Shubham Jadhav 2021-12-03 15:06:22 UTC

Description of problem:

Customer is facing an issue with the SystemMemoryExceedsReservation alert on worker nodes.



How reproducible: Everytime


Actual results:

Kubelet consuming high memory around ~40GB.

Expected results:

SystemMemoryExceedsReservation alert after increasing the System Reserved Memory to 9GB should be gone


Additional info:

We increased the System Reserved Memory to 9GB as per the KCS[0] and documentation[1]. 

Even after increasing the System Reserved Memory, we found that the Kubelet on all the worker nodes is consuming high memory. 

~~~
[core@ocpnonprod-xxxx-worker-xxxx ~]$ top
top - 09:48:50 up 21:45,  1 user,  load average: 7.69, 7.79, 8.49
Tasks: 379 total,   1 running, 377 sleeping,   0 stopped,   1 zombie
%Cpu(s): 74.7 us, 16.5 sy,  0.2 ni,  7.8 id,  0.0 wa,  0.7 hi,  0.2 si,  0.0 st
MiB Mem : 128919.9 total,  68571.9 free,  45363.9 used,  14984.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  82294.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1976 root      20   0   41.5g  37.2g  71340 S 614.6  29.5   2000:12 kubelet

------------------------------------------------------------------------------

[root@ocpnonprod-xxxx-worker-xxxx /]# top
top - 09:49:47 up 22:19,  1 user,  load average: 15.75, 13.10, 14.09
Tasks: 1058 total,   7 running, 1048 sleeping,   0 stopped,   3 zombie
%Cpu(s): 64.6 us, 28.2 sy,  0.2 ni,  5.2 id,  0.0 wa,  1.0 hi,  0.8 si,  0.0 st
MiB Mem : 128919.9 total,  58637.4 free,  50217.3 used,  20065.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  78033.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1980 root      20   0   46.9g  41.8g  70724 S 522.0  33.2   1987:18 kubelet

------------------------------------------------------------------------------

[root@ocpnonprod-xxxx-worker-xxxx ~]# top
top - 09:50:27 up 21:51,  1 user,  load average: 8.48, 9.43, 11.14
Tasks: 507 total,   1 running, 505 sleeping,   0 stopped,   1 zombie
%Cpu(s):  7.7 us, 64.4 sy,  0.4 ni, 25.9 id,  0.0 wa,  1.1 hi,  0.5 si,  0.0 st
MiB Mem : 128919.9 total,  58985.9 free,  53438.4 used,  16495.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  74253.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1973 root      20   0   50.1g  43.9g  71068 S 198.0  34.9   2236:23 kubelet
~~~


[0] https://access.redhat.com/solutions/5843241
[1] https://docs.openshift.com/container-platform/4.8/nodes/nodes/nodes-nodes-resources-configuring.html#nodes-nodes-resources-configuring-auto_nodes-nodes-resources-configuring

Comment 3 Mridul Markandey 2021-12-10 08:43:28 UTC

Hello Team,

I have a customer who is facing a similar issue in his RHOCP v4.8.14 cluster. The customer is getting a "SystemMemoryExceedsReservation" warning on all the master and worker nodes of the cluster even well after configuring the reservation to 12G. The customer has shared a must-gather which I will share on this Bugzilla. Let me know if more information is needed from the customer's environment for further analysis.

Regards,
Mridul Markandey

Comment 52 Harshal Patil 2022-02-21 11:51:17 UTC

*** Bug 2056502 has been marked as a duplicate of this bug. ***

Comment 61 Sunil Choudhary 2022-03-10 07:31:39 UTC

Verified on 4.8.34

Comment 63 errata-xmlrpc 2022-03-16 11:30:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.34 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0795

Comment 64 Harshal Patil 2022-03-30 09:55:35 UTC

*** Bug 2067292 has been marked as a duplicate of this bug. ***