Bug 1943265 - Negative Memory Utilization for Cluster Compute Resources Dashboard
Summary: Negative Memory Utilization for Cluster Compute Resources Dashboard
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: s390x
OS: Linux
low
low
Target Milestone: ---
: 4.9.0
Assignee: Simon Pasquier
QA Contact: Yadan Pei
URL:
Whiteboard:
Depends On:
Blocks: ocp-48-z-tracker
TreeView+ depends on / blocked
 
Reported: 2021-03-25 17:00 UTC by jhusta
Modified: 2021-10-11 03:03 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-01 14:22:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Screen Shots of Dashboard and usage by nodes (285.36 KB, application/pdf)
2021-03-25 17:17 UTC, jhusta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 626 0 None open dashboards/resources: use one datasource for calculating memory consumption 2021-06-16 12:30:03 UTC
Github openshift cluster-monitoring-operator pull 1127 0 None closed Bug 1946865: Update kube prometheus and related assets 2021-07-20 09:28:41 UTC

Description jhusta 2021-03-25 17:00:50 UTC
Description of problem:
When using Dashboard Kubernetes/Compute Resources/Cluster 
Memory Utilization shows negative percentage and incorrect usage


Version-Release number of selected component (if applicable):
Server Version: 4.8.0-0.nightly-s390x-2021-03-22-155743


How reproducible:
This is a large environment consisting of 3 masters and 10 workers


Looking at the inspect values:
1 - sum(:node_memory_MemAvailable_bytes:sum{cluster=""}) / sum(kube_node_status_allocatable_memory_bytes{cluster=""})

sum of node_memory_MemAvailable_bytes = 611205861376
sum of Kub_node_status_allocatable_memory_bytes = 387137921024

I messed around with some other variables to get to a proper value. I have included my screen shots and trials with and without  a mem workload running to see how the current 





Steps to Reproduce:
1. Compute mem utilization on an environment to see if you get the correct value and it matches what is displayed in the Dashboard.
2.
3.

Actual results:


Expected results:
correct utilization value


Additional info:

Comment 1 jhusta 2021-03-25 17:17:26 UTC
Created attachment 1766360 [details]
Screen Shots of Dashboard and usage by nodes

Comment 2 Andrew Pickering 2021-04-02 13:41:43 UTC
I see that this query has now been changed to `1 - sum(:node_memory_MemAvailable_bytes:sum{cluster=""}) / sum(kube_node_status_allocatable{resource="memory",cluster=""})` (changed by https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/534).

Not sure if this change would be expected to resolve this issue. Pawel, could you confirm?

FWIW, I am not seeing negative values with my test cluster.

Comment 3 Pawel Krupa 2021-04-06 07:42:22 UTC
Seems like this can be happening when there is a large chunk of memory reserved for other uses. In such scenario node available memory will be much higher than what is allowed to be allocated by scheduler. This leads to have higher than one right part of the equation (`sum(:node_memory_MemAvailable_bytes:sum{cluster=""}) / sum(kube_node_status_allocatable{resource="memory",cluster=""})`) and causes negative values in overall.

The PR https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/534 won't fix this as we need a different way to track this, preferably one where we don't subtract metric values from 1.

Comment 5 Junqi Zhao 2021-08-05 08:40:50 UTC
checked with 4.9.0-0.nightly-2021-08-04-131508, Dashboard Kubernetes/Compute Resources/Cluster, "Memory Utilisation" expression now is
1 - sum(:node_memory_MemAvailable_bytes:sum{cluster=""}) / sum(node_memory_MemTotal_bytes{cluster=""})
this can guarantee no negative value

Comment 10 jhusta 2021-10-01 14:22:23 UTC
@juzhao I was able to validate the fix on 4.9. This defect can be closed. Thank you!


Note You need to log in before you can comment on or make changes to this bug.