Bug 2028854

Summary: SystemMemoryExceedsReservation Alert
Product: OpenShift Container Platform Reporter: Shubham Jadhav <shujadha>
Component: NodeAssignee: Swarup Ghosh <swghosh>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, gagore, harpatil, hasingh, hgomes, kir, mmarkand, nagrawal, openshift-bugs-escalate, openshift-bugzilla-robot, swghosh
Version: 4.8Flags: swghosh: needinfo-
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: System memory reservation alert using Prometheus QL was using hugepages memory consumption in account which was not required. Consequence: The alert was getting fired unnecessarily on the cluster for OCP 4.8. Fix: The fix was backported to 4.8 and already existent in later versions of OCP. Fix included removal of linux huge pages from the system memory calculation. Result: The unnecessary alerts should be fixed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-16 11:30:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1979297    
Bug Blocks:    

Description Shubham Jadhav 2021-12-03 15:06:22 UTC
Description of problem:

Customer is facing an issue with the SystemMemoryExceedsReservation alert on worker nodes.



How reproducible: Everytime


Actual results:

Kubelet consuming high memory around ~40GB.

Expected results:

SystemMemoryExceedsReservation alert after increasing the System Reserved Memory to 9GB should be gone


Additional info:

We increased the System Reserved Memory to 9GB as per the KCS[0] and documentation[1]. 

Even after increasing the System Reserved Memory, we found that the Kubelet on all the worker nodes is consuming high memory. 

~~~
[core@ocpnonprod-xxxx-worker-xxxx ~]$ top
top - 09:48:50 up 21:45,  1 user,  load average: 7.69, 7.79, 8.49
Tasks: 379 total,   1 running, 377 sleeping,   0 stopped,   1 zombie
%Cpu(s): 74.7 us, 16.5 sy,  0.2 ni,  7.8 id,  0.0 wa,  0.7 hi,  0.2 si,  0.0 st
MiB Mem : 128919.9 total,  68571.9 free,  45363.9 used,  14984.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  82294.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1976 root      20   0   41.5g  37.2g  71340 S 614.6  29.5   2000:12 kubelet

------------------------------------------------------------------------------

[root@ocpnonprod-xxxx-worker-xxxx /]# top
top - 09:49:47 up 22:19,  1 user,  load average: 15.75, 13.10, 14.09
Tasks: 1058 total,   7 running, 1048 sleeping,   0 stopped,   3 zombie
%Cpu(s): 64.6 us, 28.2 sy,  0.2 ni,  5.2 id,  0.0 wa,  1.0 hi,  0.8 si,  0.0 st
MiB Mem : 128919.9 total,  58637.4 free,  50217.3 used,  20065.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  78033.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1980 root      20   0   46.9g  41.8g  70724 S 522.0  33.2   1987:18 kubelet

------------------------------------------------------------------------------

[root@ocpnonprod-xxxx-worker-xxxx ~]# top
top - 09:50:27 up 21:51,  1 user,  load average: 8.48, 9.43, 11.14
Tasks: 507 total,   1 running, 505 sleeping,   0 stopped,   1 zombie
%Cpu(s):  7.7 us, 64.4 sy,  0.4 ni, 25.9 id,  0.0 wa,  1.1 hi,  0.5 si,  0.0 st
MiB Mem : 128919.9 total,  58985.9 free,  53438.4 used,  16495.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  74253.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1973 root      20   0   50.1g  43.9g  71068 S 198.0  34.9   2236:23 kubelet
~~~


[0] https://access.redhat.com/solutions/5843241
[1] https://docs.openshift.com/container-platform/4.8/nodes/nodes/nodes-nodes-resources-configuring.html#nodes-nodes-resources-configuring-auto_nodes-nodes-resources-configuring

Comment 3 Mridul Markandey 2021-12-10 08:43:28 UTC
Hello Team,

I have a customer who is facing a similar issue in his RHOCP v4.8.14 cluster. The customer is getting a "SystemMemoryExceedsReservation" warning on all the master and worker nodes of the cluster even well after configuring the reservation to 12G. The customer has shared a must-gather which I will share on this Bugzilla. Let me know if more information is needed from the customer's environment for further analysis.

Regards,
Mridul Markandey

Comment 52 Harshal Patil 2022-02-21 11:51:17 UTC
*** Bug 2056502 has been marked as a duplicate of this bug. ***

Comment 61 Sunil Choudhary 2022-03-10 07:31:39 UTC
Verified on 4.8.34

Comment 63 errata-xmlrpc 2022-03-16 11:30:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.34 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0795

Comment 64 Harshal Patil 2022-03-30 09:55:35 UTC
*** Bug 2067292 has been marked as a duplicate of this bug. ***