Bug 1979297

Summary: SystemExceedsMemoryReservation prometheusRule manages wrongly hugepage reservation
Product: OpenShift Container Platform Reporter: Alberto Losada <alosadag>
Component: NodeAssignee: Harshal Patil <harpatil>
Node sub component: Kubelet QA Contact: Weinan Liu <weinliu>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aos-bugs, djuran, harpatil
Version: 4.8   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:38:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2028854, 2056502    

Description Alberto Losada 2021-07-05 14:08:45 UTC
Description of problem:

Prometheus rule for the well-known SystemExceedsMemoryReservation is not correct. Actually, I am getting a negative value when hugepages are configured. 

See here the query, which has been changed in 4.8: https://github.com/openshift/machine-config-operator/blob/release-4.8/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L54

sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"}) - sum by (node) (kube_node_status_capacity{resource="hugepages_1Gi"}) - sum by (node) (kube_node_status_capacity{resource="hugepages_2Mi"}) - sum by (node) (kube_node_status_allocatable{resource="memory"}) - sum by (node)  (kube_node_status_allocatable{resource="hugepages_1Gi"}) - sum by (node)  (kube_node_status_allocatable{resource="hugepages_2Mi"})) * 0.9)


Taking into account my node resources:

Capacity:
  cpu:                                     80
  ephemeral-storage:                       3750202692Ki
  hugepages-1Gi:                           16Gi
  hugepages-2Mi:                           0
  management.workload.openshift.io/cores:  80k
  memory:                                  197707916Ki
Allocatable:
  cpu:                                     76
  ephemeral-storage:                       3456186795225
  hugepages-1Gi:                           16Gi
  hugepages-2Mi:                           0
  management.workload.openshift.io/cores:  80k
  memory:                                  179804300Ki


Taking a look at the query and the values from my node, you will notice that the right side of the query is removing twice the hugepage-1Gi value (16Gi) from allocatable and capacity.

Executing the query in Thanos querier (or Prometheus if you prefer) you get these value for the mentioned node:

* sum by (node) (container_memory_rss{id="/system.slice"}) 685.961.216
* sum by (node) (kube_node_status_capacity{resource="memory"}) 202.452.905.984
* sum by (node) (kube_node_status_allocatable{resource="memory"}) 184.119.603.200
* sum by (node) (kube_node_status_capacity{resource="hugepages_1Gi"}) 17.179.869.184
* sum by (node)  (kube_node_status_allocatable{resource="hugepages_1Gi"}) 17.179.869.184

Then, the result for the right side of the query is showing a negative result:

(sum by (node) (kube_node_status_capacity{resource="memory"}) - sum by (node) (kube_node_status_capacity{resource="hugepages_1Gi"}) - sum by (node) (kube_node_status_capacity{resource="hugepages_2Mi"}) - sum by (node) (kube_node_status_allocatable{resource="memory"}) - sum by (node)  (kube_node_status_allocatable{resource="hugepages_1Gi"}) - sum by (node)  (kube_node_status_allocatable{resource="hugepages_2Mi"})) = -16026435584

Which, eventually triggers always the alert.

Version-Release number of selected component (if applicable):


How reproducible:

Always if you configure hugepages

Steps to Reproduce:
1. Configure hugepages on your OpenShift cluster. In my case it is a SNO cluster that has been configured using the performance addon operator to reserve 16Gi of 1Gi hugepages
2. Execute the Prometheus query in Thanos querier or Prometheus console
3.

Actual results:

The alert is always triggered since the right side of the equation has a negative value when hugepages are configured

Expected results:


Additional info:

My tests were done on a SNO, however, the results must be the same on a regular OCP cluster.

Comment 8 errata-xmlrpc 2021-10-18 17:38:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759