Bug 1953846

Summary: SystemMemoryExceedsReservation alert should consider hugepage reservation
Product: OpenShift Container Platform Reporter: Xingbin Li <xingli>
Component: NodeAssignee: Harshal Patil <harpatil>
Node sub component: Kubelet QA Contact: Weinan Liu <weinliu>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: alegrand, anpicker, aos-bugs, erooth, jerzhang, kakkoyun, lcosic, mchebbi, nagrawal, pkrupa, saniyer, snalawad, spasquie, surbania, weinliu
Version: 4.6   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:04:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Xingbin Li 2021-04-27 05:35:39 UTC
SystemMemoryExceedsReservation alert which is added from OCP 4.6 should consider Hugepage reservation.

The SystemMemoryExceedsReservation alert uses following Prometheus query:

~~~
sum by (node) (container_memory_rss{id=\"/system.slice\"}) > ((sum by (node) (kube_node_status_capacity{resource=\"\memory\"} - kube_node_status_allocatable{resource=\"memory\"})) * 0.9)
~~~

As per the above query, If hugepages were set on worker node, the right side of the check would contain hugepages that are supposed to be allocated by the applications. The left side indicates working memory allocated by system processes related to containers running inside the node.
In this case, the right side would be added much more application memory size that is irrelevant to the system reserved memory, so the alert would become meaningless.




For example, if a node has 30GiB of hugepages like below:

~~~
$ oc describe node <node-name>

...
Capacity:
cpu:                      80
ephemeral-storage:        2096613Mi
hugepages-1Gi:            30Gi
hugepages-2Mi:            0
memory:                   527977304Ki
openshift.io/dpdk_ext0:   0
openshift.io/f1u:         10
openshift.io/sriov_ext0:  10
pods:                     250

Allocatable:
cpu:                      79500m
ephemeral-storage:        1977538520680
hugepages-1Gi:            30Gi
hugepages-2Mi:            0
memory:                   495369048Ki
openshift.io/dpdk_ext0:   0
openshift.io/f1u:         10
openshift.io/sriov_ext0:  10
pods:                     250
..
~~~

The system-reserved contains the 30GiB of huge pages which will be allocated by the applications. 

SystemReserved  =    (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}))   
 = 527977304Ki - 495369048Ki = 31GiB

And (container_memory_rss {id = \"/system.slice \"}) is unlikely to be larger than the right side, as the underlying system process rarely uses huge pages as far as I know.

I am not sure If my understanding is correct or not , if I am wrong please let me know.

Comment 1 Simon Pasquier 2021-04-27 06:40:36 UTC
This alert is managed by the machine-config-operator [1], reassigning to the team.

[1] https://github.com/openshift/machine-config-operator/blob/f86955971533aacbb4bb66f5c7041057d3f33566/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L53-L60

Comment 2 Yu Qi Zhang 2021-04-27 21:49:33 UTC
Passing over to the node team to take a look as well, since its a kubelet warning

Comment 3 Sanket N 2021-04-29 12:46:49 UTC
*** Bug 1955044 has been marked as a duplicate of this bug. ***

Comment 7 Weinan Liu 2021-06-11 09:40:43 UTC
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         2
  ephemeral-storage:           125293548Ki
  hugepages-1Gi:               5Gi
  hugepages-2Mi:               0
  memory:                      7935292Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         1500m
  ephemeral-storage:           115470533646
  hugepages-1Gi:               5Gi
  hugepages-2Mi:               0
  memory:                      1541436Ki
  pods:                        250

7935292Ki-1541436Ki-5Gi=1.097Gi

Verified to get fixed on
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-10-071057   True        False         6h28m   Cluster version is 4.8.0-0.nightly-2021-06-10-071057

Comment 10 errata-xmlrpc 2021-07-27 23:04:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 11 Xingbin Li 2021-07-28 03:09:33 UTC
Do we have any plans to backport this to OCP 4.7 ?